---
title: "Intro to ddply"
output:
ioslides_presentation:
smaller: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Outline
- Conditionals & subsets
- For loops
- Avoiding for loops with `ddply`
##Baseball Data
First we load our MLB data set that contains the career seasonal statistics for 15 different MLB players. Data was collected from https://www.baseball-reference.com/. Variable descriptions can be found there as well.
```{r}
mlb <- read.csv("MLB Stats.csv")
head(mlb)
```
##Baseball Data
### Goal:
We would like to find the career batting average for each player, that is 15 total.
For one player, Sammy Sosa, we can do it as follows:
```{r}
ss <- subset(mlb, id == "SosaSa")
# Subset() takes a subset of the mlb data where the id column has "SosaSa"
head(ss)
```
##
```{r}
mean(ss$h/ss$ab) # Calculates the mean
```
This was fairly simply. However, we need an automatic way of calculating this for all 15 players!
## For Loops
Idea of for loops:
- repeat the same (set of) statement(s) for each element of an Indexset
Household chores:
- Introduce counter variable (often times i)
- Reserve space for results
Code Skeleton:
```{r,eval = F}
result <- rep(NA, length(Indexset))
for (i in Indexset) {
... some statements ...
result[i] <- ...
}
```
## All baseball players' careers
In the following for loop, we do the following:
1. Create a vector, **players**, that contains the name of each player
2. Define **n** to be the number of players. In this case **n = 15**
3. Create a vector, **ba** (Batting Average), that contains NA values of length **n**
4. Iterate through the for loop with the following logic:
- **career**: is the subset of the MLB data for a given player
- **ba[i]**: fills in the **i-th** entry of **ba** with the yearly batting average for the subsetted player
##
```{r}
players <- unique(mlb$id) # Creates vector contain each player id
n <- length(players) # Finds length of such vector
ba <- rep(NA, n) # Creates "empty" vector of length n
for (i in 1:n) {
career <- subset(mlb, id == players[i]) # Subsetting mlb data based on player in index i
ba[i] <- with(career, mean(h/ab, na.rm=T)) # Calculating career batting average
}
ba
summary(ba)
```
## Create Data Frame Finding Mean Batting Average for a Player's Career
The above code, prints the career batting average, however, it is not clear what belongs to what player. On the next slide we present a slightly modified version.
##
```{r}
mean_ba <- data.frame(players,rep(NA,n)) # Creates data frame for storage
for (i in 1:n) {
career <- subset(mlb, id == players[i])
mean_ba[i,2] <- with(career, mean(h/ab, na.rm=T)) # Puts data in 2nd column
#as well as removes any missing values
}
colnames(mean_ba) <- c("Players","Batting Average") # Renaming columns
mean_ba
```
From this we can easily see which player corresponds to which batting average as opposed to a sequence of numbers.
## Your Turn
MLB rules for the greatest all-time hitters are that players have to have played at least 1000 games with at least as many at bats in order to be considered.
1. Extend the for loop above to collect the additional information, i.e. introduce and collect data for two new variables games and atbats. Create a data frame that is easily readily with player and career information. (Hint: the `sum()` command may be of use)
## Answers
### 1.
```{r}
career.stats <-
data.frame(players,rep(NA,n),rep(NA,n),rep(NA,n)) # Creates data frame for storage
for (i in 1:n) {
career <- subset(mlb, id == players[i])
career.stats[i,2] <- with(career, mean(h/ab, na.rm=T)) # Calculating career batting average
career.stats[i,3] <- sum(career$g, na.rm=T) # sum career games
career.stats[i,4] <- sum(career$ab, na.rm=T) # sum career at bats
}
colnames(career.stats) <-
c("Players","Batting Average","Total Games","Total At Bats") # Renaming columns
```
##
```{r}
career.stats
```
## How did the Your Turn go?
What is difficult?
What was difficult?
- household chores distract from 'real work'
- indices are error-prone
- loops often times result in slow code, because we don't make use of R's optimized vector approach
## plyr package
- Routines from the plyr package help us to avoid loops
- Usage: `ddply(.data, .variables, .fun = NULL, ...)`
- Split-apply-combine approach
i.e. split data into subsets on each element of an index set apply the same statements for each element combine results
## Example
- Separates baseball data into one subset for each player
- Computes the mean for all columns of the subset
```{r,warning =F}
library(plyr) # Load relevant libraries
# ddply(dataset, variable, function)
meanstats <- ddply(mlb, .(id),colwise(mean))
head(meanstats)
```
##
Note that columns which are factors are given NA values since means can only be computed on numeric values.
## summarize
A special function: summarise (or summarize)
The `summarize()` can be used to create a list that computes the given argument. In addition we can subset our data to a specific player.
```{r}
library(plyr)
summarize(mlb, ab = mean(h/ab, na.rm=T)) # Calculates batting average over entire data set
summarize(mlb,
ba = mean(h/ab, na.rm=T), # Calculates batting average over entire data set
games = sum(g, na.rm=T), # Calculates total games over entire data set
hr=sum(hr, na.rm=T), # Calculates total home runs over entire data set
ab = sum(ab, na.rm=T)) # Calculates total at bats over entire data set
```
##
This code is similar as above, except it is subsetted to rows that correspond to SosaSa
```{r}
summarize(subset(mlb, id=="SosaSa"),
ba = mean(h/ab, na.rm=T),
games = sum(g, na.rm=T),
hr=sum(hr, na.rm=T),
ab = sum(ab, na.rm=T))
```
## ddply + summarise
Powerful combination to create summary
statistics
`ddply()` gathers the MLB data then finds each unique player id. It then proceeds to pass the summarise function to find each statistic. Note the ease as opposed to a for loop.
```{r}
careers <- ddply(mlb, .(id), summarize, # Gathers mlb data by id, then applies summarize
ba = mean(h/ab, na.rm=T), # Calculates batting average for each id
games = sum(g, na.rm=T), # Calculates total games for each id
atbats = sum(ab, na.rm=T) # Calculates total at bats for each id
)
head(careers)
```
## Your Turn
1. What was Mark McGwire's batting average while he was in the AL (American League)?
2. Find career batting averages for all players for each league they were are part of.
##
### Answers
### 1.
```{r}
# Subsets
summarize(subset(mlb, id=="McGwiMa" & lg=="AL"),
ba = mean(h/ab, na.rm=T))
```
##
### 2.
```{r}
ddply(mlb, .(id,lg), summarize, # Gathers mlb data by id and applies summarize function
ba = mean(h/ab, na.rm=T) # Calculates batting average for each id
)
```