- Conditionals & subsets
- For loops
- Avoiding for loops with
ddply
ddply
First we load our MLB data set that contains the career seasonal statistics for 15 different MLB players. Data was collected from https://www.baseball-reference.com/. Variable descriptions can be found there as well.
mlb <- read.csv("MLB Stats.csv") head(mlb)
## id year age tm lg g pa ab r h X2b X3b hr rbi sb cs bb so ## 1 SosaSa 1989 20 TOT AL 58 203 183 27 47 8 0 4 13 7 5 11 47 ## 2 SosaSa 1989 20 TEX AL 25 88 84 8 20 3 0 1 3 0 2 0 20 ## 3 SosaSa 1989 20 CHW AL 33 115 99 19 27 5 0 3 10 7 3 11 27 ## 4 SosaSa 1990 21 CHW AL 153 579 532 72 124 26 10 15 70 32 16 33 150 ## 5 SosaSa 1991 22 CHW AL 116 338 316 39 64 10 1 10 33 13 6 14 98 ## 6 SosaSa 1992 23 CHC NL 67 291 262 41 68 7 2 8 25 15 7 19 63 ## ba obp slg ops ops. tb gdp hbp sh sf ibb ## 1 0.257 0.303 0.366 0.669 89 67 6 2 5 2 2 ## 2 0.238 0.238 0.310 0.548 52 26 3 0 4 0 0 ## 3 0.273 0.351 0.414 0.765 118 41 3 2 1 2 2 ## 4 0.233 0.282 0.404 0.687 92 215 10 6 2 6 4 ## 5 0.203 0.240 0.335 0.576 59 106 5 2 5 1 2 ## 6 0.260 0.317 0.393 0.710 98 103 4 4 4 2 1
We would like to find the career batting average for each player, that is 15 total.
For one player, Sammy Sosa, we can do it as follows:
ss <- subset(mlb, id == "SosaSa") # Subset() takes a subset of the mlb data where the id column has "SosaSa" head(ss)
## id year age tm lg g pa ab r h X2b X3b hr rbi sb cs bb so ## 1 SosaSa 1989 20 TOT AL 58 203 183 27 47 8 0 4 13 7 5 11 47 ## 2 SosaSa 1989 20 TEX AL 25 88 84 8 20 3 0 1 3 0 2 0 20 ## 3 SosaSa 1989 20 CHW AL 33 115 99 19 27 5 0 3 10 7 3 11 27 ## 4 SosaSa 1990 21 CHW AL 153 579 532 72 124 26 10 15 70 32 16 33 150 ## 5 SosaSa 1991 22 CHW AL 116 338 316 39 64 10 1 10 33 13 6 14 98 ## 6 SosaSa 1992 23 CHC NL 67 291 262 41 68 7 2 8 25 15 7 19 63 ## ba obp slg ops ops. tb gdp hbp sh sf ibb ## 1 0.257 0.303 0.366 0.669 89 67 6 2 5 2 2 ## 2 0.238 0.238 0.310 0.548 52 26 3 0 4 0 0 ## 3 0.273 0.351 0.414 0.765 118 41 3 2 1 2 2 ## 4 0.233 0.282 0.404 0.687 92 215 10 6 2 6 4 ## 5 0.203 0.240 0.335 0.576 59 106 5 2 5 1 2 ## 6 0.260 0.317 0.393 0.710 98 103 4 4 4 2 1
mean(ss$h/ss$ab) # Calculates the mean
## [1] 0.2675846
This was fairly simply. However, we need an automatic way of calculating this for all 15 players!
Idea of for loops:
Household chores:
Code Skeleton:
result <- rep(NA, length(Indexset)) for (i in Indexset) { ... some statements ... result[i] <- ... }
In the following for loop, we do the following:
players <- unique(mlb$id) # Creates vector contain each player id n <- length(players) # Finds length of such vector ba <- rep(NA, n) # Creates "empty" vector of length n for (i in 1:n) { career <- subset(mlb, id == players[i]) # Subsetting mlb data based on player in index i ba[i] <- with(career, mean(h/ab, na.rm=T)) # Calculating career batting average } ba
## [1] 0.2675846 0.2983058 0.2624510 0.2718598 0.2832828 0.2867690 0.2958271 ## [8] 0.2779703 0.2859873 0.2939742 0.2697667 0.3353740 0.2598282 0.2777574 ## [15] 0.3257374
summary(ba)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.2598 0.2708 0.2833 0.2862 0.2949 0.3354
The above code, prints the career batting average, however, it is not clear what belongs to what player. On the next slide we present a slightly modified version.
mean_ba <- data.frame(players,rep(NA,n)) # Creates data frame for storage for (i in 1:n) { career <- subset(mlb, id == players[i]) mean_ba[i,2] <- with(career, mean(h/ab, na.rm=T)) # Puts data in 2nd column #as well as removes any missing values } colnames(mean_ba) <- c("Players","Batting Average") # Renaming columns mean_ba
## Players Batting Average ## 1 SosaSa 0.2675846 ## 2 BondsBa 0.2983058 ## 3 McGwiMa 0.2624510 ## 4 MatsuiHi 0.2718598 ## 5 RodriAl 0.2832828 ## 6 SheffGa 0.2867690 ## 7 RamirMa 0.2958271 ## 8 BiggioCr 0.2779703 ## 9 PalmeRa 0.2859873 ## 10 LarkiBa 0.2939742 ## 11 RipkeCa 0.2697667 ## 12 GwynnTo 0.3353740 ## 13 OzzieSm 0.2598282 ## 14 McGriFr 0.2777574 ## 15 BoggsWa 0.3257374
From this we can easily see which player corresponds to which batting average as opposed to a sequence of numbers.
MLB rules for the greatest all-time hitters are that players have to have played at least 1000 games with at least as many at bats in order to be considered.
sum()
command may be of use)career.stats <- data.frame(players,rep(NA,n),rep(NA,n),rep(NA,n)) # Creates data frame for storage for (i in 1:n) { career <- subset(mlb, id == players[i]) career.stats[i,2] <- with(career, mean(h/ab, na.rm=T)) # Calculating career batting average career.stats[i,3] <- sum(career$g, na.rm=T) # sum career games career.stats[i,4] <- sum(career$ab, na.rm=T) # sum career at bats } colnames(career.stats) <- c("Players","Batting Average","Total Games","Total At Bats") # Renaming columns
career.stats
## Players Batting Average Total Games Total At Bats ## 1 SosaSa 0.2675846 2412 8996 ## 2 BondsBa 0.2983058 2986 9847 ## 3 McGwiMa 0.2624510 2030 6727 ## 4 MatsuiHi 0.2718598 1236 4442 ## 5 RodriAl 0.2832828 2784 10566 ## 6 SheffGa 0.2867690 2846 10148 ## 7 RamirMa 0.2958271 2545 9061 ## 8 BiggioCr 0.2779703 2850 10876 ## 9 PalmeRa 0.2859873 2831 10472 ## 10 LarkiBa 0.2939742 2180 7937 ## 11 RipkeCa 0.2697667 3001 11551 ## 12 GwynnTo 0.3353740 2440 9288 ## 13 OzzieSm 0.2598282 2573 9396 ## 14 McGriFr 0.2777574 2757 9827 ## 15 BoggsWa 0.3257374 2440 9180
What is difficult?
What was difficult?
Routines from the plyr package help us to avoid loops
Usage: ddply(.data, .variables, .fun = NULL, ...)
Split-apply-combine approach i.e. split data into subsets on each element of an index set apply the same statements for each element combine results
library(plyr) # Load relevant libraries # ddply(dataset, variable, function) meanstats <- ddply(mlb, .(id),colwise(mean)) head(meanstats)
## id year age tm lg g pa ab r h ## 1 BiggioCr 1997.5 31.5 NA NA 142.5000 625.2000 543.8000 92.20000 153.0000 ## 2 BoggsWa 1990.5 32.5 NA NA 135.5556 596.6667 510.0000 84.05556 167.2222 ## 3 BondsBa 1996.5 31.5 NA NA 135.7273 573.0000 447.5909 101.22727 133.4091 ## 4 GwynnTo 1991.5 31.5 NA NA 122.0000 511.6000 464.4000 69.15000 157.0500 ## 5 LarkiBa 1995.0 31.0 NA NA 114.7368 476.6842 417.7368 69.94737 123.1579 ## 6 MatsuiHi 2007.5 33.5 NA NA 123.6000 506.6000 444.2000 65.60000 125.3000 ## X2b X3b hr rbi sb cs bb ## 1 33.40000 2.750000 14.550000 58.75000 20.700000 6.200000 58.00000 ## 2 32.11111 3.388889 6.555556 56.33333 1.333333 1.944444 78.44444 ## 3 27.31818 3.500000 34.636364 90.72727 23.363636 6.409091 116.27273 ## 4 27.15000 4.250000 6.750000 56.90000 15.950000 6.250000 39.50000 ## 5 23.21053 4.000000 10.421053 50.52632 19.947368 4.052632 49.42105 ## 6 24.90000 1.200000 17.500000 76.00000 1.300000 0.900000 54.70000 ## so ba obp slg ops ops. tb ## 1 87.65000 0.2779000 0.3575500 0.4281500 0.7854500 109.6000 235.5500 ## 2 41.38889 0.3257778 0.4109444 0.4392778 0.8501111 128.1111 225.7778 ## 3 69.95455 0.2983636 0.4414545 0.6155909 1.0571364 181.0909 271.6364 ## 4 21.70000 0.3354000 0.3848500 0.4554500 0.8403000 129.8000 212.9500 ## 5 43.00000 0.2940526 0.3694211 0.4395789 0.8088947 114.3158 185.6316 ## 6 68.90000 0.2717000 0.3503000 0.4423000 0.7927000 110.0000 205.1000 ## gdp hbp sh sf ibb ## 1 7.500000 14.250000 5.0500000 4.050000 3.400000 ## 2 13.111111 1.277778 1.6111111 5.333333 10.000000 ## 3 7.500000 4.818182 0.1818182 4.136364 31.272727 ## 4 12.950000 1.200000 2.2500000 4.250000 10.150000 ## 5 9.368421 2.894737 3.1052632 3.526316 3.473684 ## 6 10.600000 2.100000 0.0000000 4.600000 3.500000
Note that columns which are factors are given NA values since means can only be computed on numeric values.
A special function: summarise (or summarize)
The summarize()
can be used to create a list that computes the given argument. In addition we can subset our data to a specific player.
library(plyr) summarize(mlb, ab = mean(h/ab, na.rm=T)) # Calculates batting average over entire data set
## ab ## 1 0.2866281
summarize(mlb, ba = mean(h/ab, na.rm=T), # Calculates batting average over entire data set games = sum(g, na.rm=T), # Calculates total games over entire data set hr=sum(hr, na.rm=T), # Calculates total home runs over entire data set ab = sum(ab, na.rm=T)) # Calculates total at bats over entire data set
## ba games hr ab ## 1 0.2866281 37911 6370 138314
This code is similar as above, except it is subsetted to rows that correspond to SosaSa
summarize(subset(mlb, id=="SosaSa"), ba = mean(h/ab, na.rm=T), games = sum(g, na.rm=T), hr=sum(hr, na.rm=T), ab = sum(ab, na.rm=T))
## ba games hr ab ## 1 0.2675846 2412 613 8996
Powerful combination to create summary statistics
ddply()
gathers the MLB data then finds each unique player id. It then proceeds to pass the summarise function to find each statistic. Note the ease as opposed to a for loop.
careers <- ddply(mlb, .(id), summarize, # Gathers mlb data by id, then applies summarize ba = mean(h/ab, na.rm=T), # Calculates batting average for each id games = sum(g, na.rm=T), # Calculates total games for each id atbats = sum(ab, na.rm=T) # Calculates total at bats for each id ) head(careers)
## id ba games atbats ## 1 BiggioCr 0.2779703 2850 10876 ## 2 BoggsWa 0.3257374 2440 9180 ## 3 BondsBa 0.2983058 2986 9847 ## 4 GwynnTo 0.3353740 2440 9288 ## 5 LarkiBa 0.2939742 2180 7937 ## 6 MatsuiHi 0.2718598 1236 4442
# Subsets summarize(subset(mlb, id=="McGwiMa" & lg=="AL"), ba = mean(h/ab, na.rm=T))
## ba ## 1 0.2606549
ddply(mlb, .(id,lg), summarize, # Gathers mlb data by id and applies summarize function ba = mean(h/ab, na.rm=T) # Calculates batting average for each id )
## id lg ba ## 1 BiggioCr NL 0.2779703 ## 2 BoggsWa AL 0.3257374 ## 3 BondsBa NL 0.2983058 ## 4 GwynnTo NL 0.3353740 ## 5 LarkiBa NL 0.2939742 ## 6 MatsuiHi AL 0.2718598 ## 7 McGriFr AL 0.2667029 ## 8 McGriFr MLB 0.3060429 ## 9 McGriFr NL 0.2846123 ## 10 McGwiMa AL 0.2606549 ## 11 McGwiMa MLB 0.2740741 ## 12 McGwiMa NL 0.2644370 ## 13 OzzieSm NL 0.2598282 ## 14 PalmeRa AL 0.2876621 ## 15 PalmeRa NL 0.2764967 ## 16 RamirMa AL 0.2876483 ## 17 RamirMa MLB 0.3148175 ## 18 RamirMa NL 0.3322397 ## 19 RipkeCa AL 0.2697667 ## 20 RodriAl AL 0.2832828 ## 21 SheffGa AL 0.2601753 ## 22 SheffGa NL 0.3008480 ## 23 SosaSa AL 0.2395353 ## 24 SosaSa NL 0.2826881