Intro to ddply

Outline

Conditionals & subsets
For loops
Avoiding for loops with ddply

Baseball Data

First we load our MLB data set that contains the career seasonal statistics for 15 different MLB players. Data was collected from https://www.baseball-reference.com/. Variable descriptions can be found there as well.

mlb <- read.csv("MLB Stats.csv")
head(mlb)

##       id year age  tm lg   g  pa  ab  r   h X2b X3b hr rbi sb cs bb  so
## 1 SosaSa 1989  20 TOT AL  58 203 183 27  47   8   0  4  13  7  5 11  47
## 2 SosaSa 1989  20 TEX AL  25  88  84  8  20   3   0  1   3  0  2  0  20
## 3 SosaSa 1989  20 CHW AL  33 115  99 19  27   5   0  3  10  7  3 11  27
## 4 SosaSa 1990  21 CHW AL 153 579 532 72 124  26  10 15  70 32 16 33 150
## 5 SosaSa 1991  22 CHW AL 116 338 316 39  64  10   1 10  33 13  6 14  98
## 6 SosaSa 1992  23 CHC NL  67 291 262 41  68   7   2  8  25 15  7 19  63
##      ba   obp   slg   ops ops.  tb gdp hbp sh sf ibb
## 1 0.257 0.303 0.366 0.669   89  67   6   2  5  2   2
## 2 0.238 0.238 0.310 0.548   52  26   3   0  4  0   0
## 3 0.273 0.351 0.414 0.765  118  41   3   2  1  2   2
## 4 0.233 0.282 0.404 0.687   92 215  10   6  2  6   4
## 5 0.203 0.240 0.335 0.576   59 106   5   2  5  1   2
## 6 0.260 0.317 0.393 0.710   98 103   4   4  4  2   1

Baseball Data

Goal:

We would like to find the career batting average for each player, that is 15 total.

For one player, Sammy Sosa, we can do it as follows:

ss <- subset(mlb, id == "SosaSa") 
# Subset() takes a subset of the mlb data where the id column has "SosaSa"
head(ss)

##       id year age  tm lg   g  pa  ab  r   h X2b X3b hr rbi sb cs bb  so
## 1 SosaSa 1989  20 TOT AL  58 203 183 27  47   8   0  4  13  7  5 11  47
## 2 SosaSa 1989  20 TEX AL  25  88  84  8  20   3   0  1   3  0  2  0  20
## 3 SosaSa 1989  20 CHW AL  33 115  99 19  27   5   0  3  10  7  3 11  27
## 4 SosaSa 1990  21 CHW AL 153 579 532 72 124  26  10 15  70 32 16 33 150
## 5 SosaSa 1991  22 CHW AL 116 338 316 39  64  10   1 10  33 13  6 14  98
## 6 SosaSa 1992  23 CHC NL  67 291 262 41  68   7   2  8  25 15  7 19  63
##      ba   obp   slg   ops ops.  tb gdp hbp sh sf ibb
## 1 0.257 0.303 0.366 0.669   89  67   6   2  5  2   2
## 2 0.238 0.238 0.310 0.548   52  26   3   0  4  0   0
## 3 0.273 0.351 0.414 0.765  118  41   3   2  1  2   2
## 4 0.233 0.282 0.404 0.687   92 215  10   6  2  6   4
## 5 0.203 0.240 0.335 0.576   59 106   5   2  5  1   2
## 6 0.260 0.317 0.393 0.710   98 103   4   4  4  2   1

mean(ss$h/ss$ab) # Calculates the mean

## [1] 0.2675846

This was fairly simply. However, we need an automatic way of calculating this for all 15 players!

For Loops

Idea of for loops:

repeat the same (set of) statement(s) for each element of an Indexset

Household chores:

Introduce counter variable (often times i)
Reserve space for results

Code Skeleton:

result <- rep(NA, length(Indexset))

for (i in Indexset) {
 ... some statements ...
 result[i] <- ...
}

All baseball players' careers

In the following for loop, we do the following:

Create a vector, players, that contains the name of each player
Define n to be the number of players. In this case n = 15
Create a vector, ba (Batting Average), that contains NA values of length n
Iterate through the for loop with the following logic:

career: is the subset of the MLB data for a given player
ba[i]: fills in the i-th entry of ba with the yearly batting average for the subsetted player

players <- unique(mlb$id) # Creates vector contain each player id
n <- length(players) # Finds length of such vector
ba <- rep(NA, n) # Creates "empty" vector of length n
for (i in 1:n) {
  career <- subset(mlb, id == players[i]) # Subsetting mlb data based on player in index i
  ba[i] <- with(career, mean(h/ab, na.rm=T)) # Calculating career batting average
}
ba

##  [1] 0.2675846 0.2983058 0.2624510 0.2718598 0.2832828 0.2867690 0.2958271
##  [8] 0.2779703 0.2859873 0.2939742 0.2697667 0.3353740 0.2598282 0.2777574
## [15] 0.3257374

summary(ba)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2598  0.2708  0.2833  0.2862  0.2949  0.3354

Create Data Frame Finding Mean Batting Average for a Player's Career

The above code, prints the career batting average, however, it is not clear what belongs to what player. On the next slide we present a slightly modified version.

mean_ba <- data.frame(players,rep(NA,n)) # Creates data frame for storage
for (i in 1:n) {
  career <- subset(mlb, id == players[i])
  mean_ba[i,2] <- with(career, mean(h/ab, na.rm=T)) # Puts data in 2nd column
  #as well as removes any missing values
}
colnames(mean_ba) <- c("Players","Batting Average") # Renaming columns
mean_ba

##     Players Batting Average
## 1    SosaSa       0.2675846
## 2   BondsBa       0.2983058
## 3   McGwiMa       0.2624510
## 4  MatsuiHi       0.2718598
## 5   RodriAl       0.2832828
## 6   SheffGa       0.2867690
## 7   RamirMa       0.2958271
## 8  BiggioCr       0.2779703
## 9   PalmeRa       0.2859873
## 10  LarkiBa       0.2939742
## 11  RipkeCa       0.2697667
## 12  GwynnTo       0.3353740
## 13  OzzieSm       0.2598282
## 14  McGriFr       0.2777574
## 15  BoggsWa       0.3257374

From this we can easily see which player corresponds to which batting average as opposed to a sequence of numbers.

Your Turn

MLB rules for the greatest all-time hitters are that players have to have played at least 1000 games with at least as many at bats in order to be considered.

Extend the for loop above to collect the additional information, i.e. introduce and collect data for two new variables games and atbats. Create a data frame that is easily readily with player and career information. (Hint: the sum() command may be of use)

Answers

1.

career.stats <- 
  data.frame(players,rep(NA,n),rep(NA,n),rep(NA,n)) # Creates data frame for storage
for (i in 1:n) {
  career <- subset(mlb, id == players[i])
  career.stats[i,2] <- with(career, mean(h/ab, na.rm=T)) # Calculating career batting average
    career.stats[i,3] <- sum(career$g, na.rm=T) # sum career games
    career.stats[i,4] <- sum(career$ab, na.rm=T) # sum career at bats
}
colnames(career.stats) <- 
  c("Players","Batting Average","Total Games","Total At Bats") # Renaming columns

career.stats

##     Players Batting Average Total Games Total At Bats
## 1    SosaSa       0.2675846        2412          8996
## 2   BondsBa       0.2983058        2986          9847
## 3   McGwiMa       0.2624510        2030          6727
## 4  MatsuiHi       0.2718598        1236          4442
## 5   RodriAl       0.2832828        2784         10566
## 6   SheffGa       0.2867690        2846         10148
## 7   RamirMa       0.2958271        2545          9061
## 8  BiggioCr       0.2779703        2850         10876
## 9   PalmeRa       0.2859873        2831         10472
## 10  LarkiBa       0.2939742        2180          7937
## 11  RipkeCa       0.2697667        3001         11551
## 12  GwynnTo       0.3353740        2440          9288
## 13  OzzieSm       0.2598282        2573          9396
## 14  McGriFr       0.2777574        2757          9827
## 15  BoggsWa       0.3257374        2440          9180

How did the Your Turn go?

What is difficult?

What was difficult?

household chores distract from 'real work'
indices are error-prone
loops often times result in slow code, because we don't make use of R's optimized vector approach

plyr package

Routines from the plyr package help us to avoid loops
Usage: ddply(.data, .variables, .fun = NULL, ...)
Split-apply-combine approach i.e. split data into subsets on each element of an index set apply the same statements for each element combine results

Example

Separates baseball data into one subset for each player
Computes the mean for all columns of the subset

library(plyr) # Load relevant libraries
# ddply(dataset, variable, function)
meanstats <- ddply(mlb, .(id),colwise(mean))

head(meanstats)

##         id   year  age tm lg        g       pa       ab         r        h
## 1 BiggioCr 1997.5 31.5 NA NA 142.5000 625.2000 543.8000  92.20000 153.0000
## 2  BoggsWa 1990.5 32.5 NA NA 135.5556 596.6667 510.0000  84.05556 167.2222
## 3  BondsBa 1996.5 31.5 NA NA 135.7273 573.0000 447.5909 101.22727 133.4091
## 4  GwynnTo 1991.5 31.5 NA NA 122.0000 511.6000 464.4000  69.15000 157.0500
## 5  LarkiBa 1995.0 31.0 NA NA 114.7368 476.6842 417.7368  69.94737 123.1579
## 6 MatsuiHi 2007.5 33.5 NA NA 123.6000 506.6000 444.2000  65.60000 125.3000
##        X2b      X3b        hr      rbi        sb       cs        bb
## 1 33.40000 2.750000 14.550000 58.75000 20.700000 6.200000  58.00000
## 2 32.11111 3.388889  6.555556 56.33333  1.333333 1.944444  78.44444
## 3 27.31818 3.500000 34.636364 90.72727 23.363636 6.409091 116.27273
## 4 27.15000 4.250000  6.750000 56.90000 15.950000 6.250000  39.50000
## 5 23.21053 4.000000 10.421053 50.52632 19.947368 4.052632  49.42105
## 6 24.90000 1.200000 17.500000 76.00000  1.300000 0.900000  54.70000
##         so        ba       obp       slg       ops     ops.       tb
## 1 87.65000 0.2779000 0.3575500 0.4281500 0.7854500 109.6000 235.5500
## 2 41.38889 0.3257778 0.4109444 0.4392778 0.8501111 128.1111 225.7778
## 3 69.95455 0.2983636 0.4414545 0.6155909 1.0571364 181.0909 271.6364
## 4 21.70000 0.3354000 0.3848500 0.4554500 0.8403000 129.8000 212.9500
## 5 43.00000 0.2940526 0.3694211 0.4395789 0.8088947 114.3158 185.6316
## 6 68.90000 0.2717000 0.3503000 0.4423000 0.7927000 110.0000 205.1000
##         gdp       hbp        sh       sf       ibb
## 1  7.500000 14.250000 5.0500000 4.050000  3.400000
## 2 13.111111  1.277778 1.6111111 5.333333 10.000000
## 3  7.500000  4.818182 0.1818182 4.136364 31.272727
## 4 12.950000  1.200000 2.2500000 4.250000 10.150000
## 5  9.368421  2.894737 3.1052632 3.526316  3.473684
## 6 10.600000  2.100000 0.0000000 4.600000  3.500000

Note that columns which are factors are given NA values since means can only be computed on numeric values.

summarize

A special function: summarise (or summarize)

The summarize() can be used to create a list that computes the given argument. In addition we can subset our data to a specific player.

library(plyr)

summarize(mlb, ab = mean(h/ab, na.rm=T)) # Calculates batting average over entire data set

##          ab
## 1 0.2866281

summarize(mlb,
 ba = mean(h/ab, na.rm=T), # Calculates batting average over entire data set
 games = sum(g, na.rm=T), # Calculates total games over entire data set
 hr=sum(hr, na.rm=T), # Calculates total home runs over entire data set
 ab = sum(ab, na.rm=T)) # Calculates total at bats over entire data set

##          ba games   hr     ab
## 1 0.2866281 37911 6370 138314

This code is similar as above, except it is subsetted to rows that correspond to SosaSa

summarize(subset(mlb, id=="SosaSa"),
 ba = mean(h/ab, na.rm=T),
 games = sum(g, na.rm=T),
 hr=sum(hr, na.rm=T),
 ab = sum(ab, na.rm=T))

##          ba games  hr   ab
## 1 0.2675846  2412 613 8996

ddply + summarise

Powerful combination to create summary statistics

ddply() gathers the MLB data then finds each unique player id. It then proceeds to pass the summarise function to find each statistic. Note the ease as opposed to a for loop.

careers <- ddply(mlb, .(id), summarize, # Gathers mlb data by id, then applies summarize
ba = mean(h/ab, na.rm=T), # Calculates batting average for each id
games = sum(g, na.rm=T), # Calculates total games for each id
atbats = sum(ab, na.rm=T) # Calculates total at bats for each id
) 

head(careers)

##         id        ba games atbats
## 1 BiggioCr 0.2779703  2850  10876
## 2  BoggsWa 0.3257374  2440   9180
## 3  BondsBa 0.2983058  2986   9847
## 4  GwynnTo 0.3353740  2440   9288
## 5  LarkiBa 0.2939742  2180   7937
## 6 MatsuiHi 0.2718598  1236   4442

Your Turn

What was Mark McGwire's batting average while he was in the AL (American League)?
Find career batting averages for all players for each league they were are part of.

Answers

1.

# Subsets
summarize(subset(mlb, id=="McGwiMa" & lg=="AL"),
 ba = mean(h/ab, na.rm=T))

##          ba
## 1 0.2606549

2.

ddply(mlb, .(id,lg), summarize, # Gathers mlb data by id and applies summarize function
ba = mean(h/ab, na.rm=T) # Calculates batting average for each id
)

##          id  lg        ba
## 1  BiggioCr  NL 0.2779703
## 2   BoggsWa  AL 0.3257374
## 3   BondsBa  NL 0.2983058
## 4   GwynnTo  NL 0.3353740
## 5   LarkiBa  NL 0.2939742
## 6  MatsuiHi  AL 0.2718598
## 7   McGriFr  AL 0.2667029
## 8   McGriFr MLB 0.3060429
## 9   McGriFr  NL 0.2846123
## 10  McGwiMa  AL 0.2606549
## 11  McGwiMa MLB 0.2740741
## 12  McGwiMa  NL 0.2644370
## 13  OzzieSm  NL 0.2598282
## 14  PalmeRa  AL 0.2876621
## 15  PalmeRa  NL 0.2764967
## 16  RamirMa  AL 0.2876483
## 17  RamirMa MLB 0.3148175
## 18  RamirMa  NL 0.3322397
## 19  RipkeCa  AL 0.2697667
## 20  RodriAl  AL 0.2832828
## 21  SheffGa  AL 0.2601753
## 22  SheffGa  NL 0.3008480
## 23   SosaSa  AL 0.2395353
## 24   SosaSa  NL 0.2826881