---
title: "Motivating Example"
output:
ioslides_presentation:
smaller: true
---
## Motivating Example
- Explore a real data set using R
- Goal: get good understanding of using R for data management and exploration
- Dont worry about understanding all coding right away
- We will go back and explain how it all works in detail
## NBA Draft Dataset
- Shows the statistical outputs of players drafted in 2012 through their first two seasons in the NBA
- Several pieces of info are accounted for
- Pick, Team, Minutes Played Total Points, Points Per Game, etc.
- Follow along using NBA Historical Data.csv
## First look at data in R
Lets use R to look at the top few rows of the NBA data set. First, we load NBA using `read.csv()`:
```{r}
nba <- read.csv("NBA Draft 2012.csv")
```
## Looking at the Data
The `head()` function allows you to look at first 6 rows of the data and `tail()` allows you to look at last 6 rows
```{r}
head(nba)
```
## NBA Data Attributes
The command `str()`, short for structure, gives us a summary of each variable along with the size of the "data frame".
```{r, fig.height=4, fig.width=7}
str(nba)
```
As we can see the nba data frame has 30 observations (rows) and 18 variables (columns).
## NBA Variables Summary
Let's summarize the values for each variable in NBA with the `summary()` command.
```{r}
summary(nba)
```
With this command we immediately have summary statistics of each variable.
## Scatterplots
Let's look at the relationship between pick selection and total points. First, we need to install and load ggplot2, a special package for plotting.
```{r, eval = F}
install.packages("ggplot2")
```
```{r}
library(ggplot2)
```
##
Using the `qplot()` command we can create a simple scatter plot.
```{r, fig.height=3, fig.width=7}
qplot(Pick, Total.Points, geom="point", data = nba, main = "Total Points vs. Pick")
```
## More Scatterplots
```{r, fig.height=4, fig.width=7}
qplot(Pick, Total.Points, geom = "point", data = nba, colour = Position,
main = "Total Points vs. Pick" )
```
## Even More Scatterplots
```{r, fig.height=4, fig.width=7}
qplot(Pick, Total.Points, geom = "point", data = nba,
main = "Total Points vs. Pick with Regression Line") +
geom_smooth(method = "lm")
```
##
Do not let the `qplot()` command or any of the parameters confuse you. We will discuss in detail this function in later lessons.
## Creating A New Variable
We will make a new variable in the nba data set to account for minutes per game that is, `Minutes per game = minutes / games`.
```{r}
# Creates new column Minutes.Per.Game in nba data
nba$Minutes.Per.Game <- nba$Minutes / nba$Games
```
Notice that we had to place periods between words. R does not allow users to separate phrases with spaces.
```{r}
summary(nba$Minutes.Per.Game)
names(nba) # Provides the name of each column
```
One way we can interpret this is to say that on average by a player's second year in the league they average about 19.52 minutes per game.
## Minutes Per Game Histogram
We now plot a histogram of the Minutes Per Game to see its distribution.
```{r, fig.height=3, fig.width=7}
qplot(Minutes.Per.Game, data = nba, binwidth = 2.5,
main = "Histogram of Minutes Per Game")
# binwidth is the length of each rectangular bar
```
## Someone Played a lot of Minutes
We now take a closer look at a particular player, that is, Damian Lillard.
```{r}
nba[which.max(nba$Minutes.Per.Game),]
```
## Find Minutes By Position
Looking at the average minutes for separate positions, we noticed that Point Guards average the most minutes per game.
```{r}
mean((nba$Minutes.Per.Game)[nba$Position == "PG"])
mean((nba$Minutes.Per.Game)[nba$Position == "SG"])
mean((nba$Minutes.Per.Game)[nba$Position == "SF"])
```
##
```{r}
mean((nba$Minutes.Per.Game)[nba$Position == "PF"])
mean((nba$Minutes.Per.Game)[nba$Position == "C"])
```
As a breakdown:
- `mean()` - provides mean of argument
- `nba$Minutes.Per.Game` - selects the column in the nba data set named Minutes.Per.Game
- `[nba$Position == "XX"]` - subsets the column with the choosen position XX
## Combine Positions
We label both power and small forwards as just forward, and similiarily we do the same for point and shooting guards.
This code works by creating a new column **forward** and **guard** by selecting the rows that contain `c("PF", "SF")` and `c("PG", "SG")` in the Position column, respectfully
```{r}
forward <- (nba$Position %in% c("PF", "SF")) #
head(forward)
```
```{r}
guard <- (nba$Position %in% c("PG", "SG"))
head(guard)
```
## Average Points Per Game for Position
Finding the mean of each based on the newly created position
```{r}
mean(nba$Points.Per.Game[guard])
mean(nba$Points.Per.Game[forward])
```
## Box Plots
Perhaps we are interested in the different minutes per game for the different positions. We could compare this with a side by side boxplot.
```{r, fig.height=3, fig.width=7}
qplot(Position, Minutes.Per.Game, geom = "boxplot", data = nba,
main = "Box Plot of Minutes Per Game by Position")
```
From this, we notice that the median minutes per game of small forwards are slightly more than the other positions.
## Your Turn
Try playing with chunks of code from this session to further investigate the NBA data:
1. Get a summary of the Assists.Per.Game values.
2. Make a boxplot comparing Rebounds.Per.Game for different positions.
3. Find the average Assists.Per.Game for the guards (point and shooting) and forwards (small and power).
## Answers
### 1.
```{r}
summary(nba$Assists.Per.Game)
```
##
### 2.
```{r}
qplot(Position, Rebounds.Per.Game, geom = "boxplot", data = nba,
main = "Box Plot of Rebounds Per Game by Position")
```
##
### 3.
```{r}
guard <- (nba$Position %in% c("PG", "SG"))
forward <- (nba$Position %in% c("PF", "SF"))
mean(nba$Assists.Per.Game[guard])
mean(nba$Assists.Per.Game[forward])
```