## Motivating Example

• Explore a real data set using R
• Goal: get good understanding of using R for data management and exploration
• Dont worry about understanding all coding right away
• We will go back and explain how it all works in detail

## NBA Draft Dataset

• Shows the statistical outputs of players drafted in 2012 through their first two seasons in the NBA
• Several pieces of info are accounted for
• Pick, Team, Minutes Played Total Points, Points Per Game, etc.
• Follow along using NBA Historical Data.csv

## First look at data in R

Lets use R to look at the top few rows of the NBA data set. First, we load NBA using read.csv():

nba <- read.csv("NBA Draft 2012.csv")

## Looking at the Data

The head() function allows you to look at first 6 rows of the data and tail() allows you to look at last 6 rows

head(nba)
##   Year Pick Team                 Player Position                College
## 1 2012    1  NOH          Anthony Davis       PF University of Kentucky
## 2 2012    2  CHA Michael Kidd-Gilchrist       SF University of Kentucky
## 3 2012    3  WAS           Bradley Beal       SG  University of Florida
## 4 2012    4  CLE           Dion Waiters       SG    Syracuse University
## 5 2012    5  SAC        Thomas Robinson       PF   University of Kansas
## 6 2012    6  POR         Damian Lillard       PG Weber State University
##   Games Minutes Total.Points Total.Rebounds Total.Assists
## 1   131    4204         2261           1195           168
## 2   140    3527         1152            779           169
## 3   129    4275         2029            484           380
## 4   131    3828         2007            344           392
## 5   140    1929          672            622            80
## 6   164    6104         3257            545           988
##   Field.Goal.Percentage Three.Point.Percentage Free.Throw.Percentage
## 1                 0.518                  0.133                 0.777
## 2                 0.464                  0.167                 0.682
## 3                 0.416                  0.396                 0.787
## 4                 0.424                  0.342                 0.714
## 5                 0.454                      0                 0.543
## 6                 0.427                  0.381                 0.859
##   Points.Per.Game Rebounds.Per.Game Assists.Per.Game Win.Share
## 1            17.3               9.1              1.3      16.5
## 2             8.2               5.6              1.2       5.2
## 3            15.7               3.8              2.9       7.0
## 4            15.3               2.6              3.0       2.5
## 5             4.8               4.4              0.6       1.6
## 6            19.9               3.3              6.0      15.4

## NBA Data Attributes

The command str(), short for structure, gives us a summary of each variable along with the size of the "data frame".

str(nba)
## 'data.frame':    30 obs. of  18 variables:
##  $Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ... ##$ Pick                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $Team : Factor w/ 23 levels "ATL","BOS","CHA",..: 15 3 23 5 21 20 9 22 8 15 ... ##$ Player                : Factor w/ 30 levels "Andre Drummond",..: 3 22 6 8 28 7 12 27 1 5 ...
##  $Position : Factor w/ 5 levels "C","PF","PG",..: 2 4 5 5 2 3 4 4 1 5 ... ##$ College               : Factor w/ 19 levels "Baylor University",..: 15 15 12 10 14 19 16 17 11 2 ...
##  $Games : int 131 140 129 131 140 164 159 154 141 130 ... ##$ Minutes               : int  4204 3527 4275 3828 1929 6104 4262 3398 3862 2757 ...
##  $Total.Points : int 2261 1152 2029 2007 672 3257 1486 1346 1571 907 ... ##$ Total.Rebounds        : int  1195 779 484 344 622 545 644 396 1528 238 ...
##  $Total.Assists : int 168 169 380 392 80 988 214 132 65 287 ... ##$ Field.Goal.Percentage : num  0.518 0.464 0.416 0.424 0.454 0.427 0.419 0.417 0.618 0.39 ...
##  $Three.Point.Percentage: Factor w/ 22 levels "0","0.133","0.167",..: 2 3 21 12 1 20 16 18 7 13 ... ##$ Free.Throw.Percentage : Factor w/ 27 levels "0.25","0.402",..: 20 13 21 15 6 26 16 22 2 9 ...
##  $Points.Per.Game : num 17.3 8.2 15.7 15.3 4.8 19.9 9.3 8.7 11.1 7 ... ##$ Rebounds.Per.Game     : num  9.1 5.6 3.8 2.6 4.4 3.3 4.1 2.6 10.8 1.8 ...
##  $Assists.Per.Game : num 1.3 1.2 2.9 3 0.6 6 1.3 0.9 0.5 2.2 ... ##$ Win.Share             : num  16.5 5.2 7 2.5 1.6 15.4 6 5.1 14.4 -0.5 ...

As we can see the nba data frame has 30 observations (rows) and 18 variables (columns).

## NBA Variables Summary

Let's summarize the values for each variable in NBA with the summary() command.

summary(nba)
##       Year           Pick            Team                 Player
##  Min.   :2012   Min.   : 1.00   HOU    : 3   Andre Drummond  : 1
##  1st Qu.:2012   1st Qu.: 8.25   BOS    : 2   Andrew Nicholson: 1
##  Median :2012   Median :15.50   CLE    : 2   Anthony Davis   : 1
##  Mean   :2012   Mean   :15.50   GSW    : 2   Arnett Moultrie : 1
##  3rd Qu.:2012   3rd Qu.:22.75   NOH    : 2   Austin Rivers   : 1
##  Max.   :2012   Max.   :30.00   POR    : 2   Bradley Beal    : 1
##                                 (Other):17   (Other)         :24
##  Position                         College       Games
##  C :7     University of Kentucky      : 4   Min.   :  3.00
##  PF:7     University of North Carolina: 4   1st Qu.: 94.25
##  PG:3     Duke University             : 2   Median :116.50
##  SF:5     Syracuse University         : 2   Mean   :109.20
##  SG:8     University of Connecticut   : 2   3rd Qu.:140.00
##           University of Washington    : 2   Max.   :164.00
##           (Other)                     :14
##     Minutes      Total.Points    Total.Rebounds   Total.Assists
##  Min.   :   9   Min.   :   0.0   Min.   :   0.0   Min.   :  0.0
##  1st Qu.:1230   1st Qu.: 429.2   1st Qu.: 199.5   1st Qu.: 54.5
##  Median :2310   Median : 958.0   Median : 381.0   Median :125.5
##  Mean   :2399   Mean   : 968.6   Mean   : 455.0   Mean   :170.8
##  3rd Qu.:3495   3rd Qu.:1243.2   3rd Qu.: 638.5   3rd Qu.:168.8
##  Max.   :6104   Max.   :3257.0   Max.   :1528.0   Max.   :988.0
##
##  Field.Goal.Percentage Three.Point.Percentage Free.Throw.Percentage
##  Min.   :0.0000        NoAttempts: 5          0.667  : 2
##  1st Qu.:0.4200        0         : 3          0.8    : 2
##  Median :0.4380        0.3       : 2          0.833  : 2
##  Mean   :0.4404        0.381     : 2          0.25   : 1
##  3rd Qu.:0.4953        0.133     : 1          0.402  : 1
##  Max.   :0.6180        0.167     : 1          0.521  : 1
##                        (Other)   :16          (Other):21
##  Points.Per.Game  Rebounds.Per.Game Assists.Per.Game   Win.Share
##  Min.   : 0.000   Min.   : 0.000    Min.   :0.000    Min.   :-0.800
##  1st Qu.: 4.500   1st Qu.: 1.925    1st Qu.:0.500    1st Qu.: 1.500
##  Median : 7.150   Median : 3.350    Median :1.000    Median : 2.800
##  Mean   : 7.673   Mean   : 3.693    Mean   :1.377    Mean   : 4.117
##  3rd Qu.: 9.525   3rd Qu.: 4.775    3rd Qu.:1.375    3rd Qu.: 5.200
##  Max.   :19.900   Max.   :10.800    Max.   :6.100    Max.   :16.500
## 

With this command we immediately have summary statistics of each variable.

## Scatterplots

Let's look at the relationship between pick selection and total points. First, we need to install and load ggplot2, a special package for plotting.

install.packages("ggplot2")
library(ggplot2)

Using the qplot() command we can create a simple scatter plot.

qplot(Pick, Total.Points, geom="point", data = nba, main = "Total Points vs. Pick")

## More Scatterplots

qplot(Pick, Total.Points, geom = "point", data = nba, colour = Position,
main = "Total Points vs. Pick" )

## Even More Scatterplots

qplot(Pick, Total.Points, geom = "point", data = nba,
main = "Total Points vs. Pick with Regression Line") +
geom_smooth(method = "lm")

Do not let the qplot() command or any of the parameters confuse you. We will discuss in detail this function in later lessons.

## Creating A New Variable

We will make a new variable in the nba data set to account for minutes per game that is, Minutes per game = minutes / games.

# Creates new column Minutes.Per.Game in nba data
nba$Minutes.Per.Game <- nba$Minutes / nba$Games  Notice that we had to place periods between words. R does not allow users to separate phrases with spaces. summary(nba$Minutes.Per.Game)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    3.00   14.32   20.57   19.52   25.06   37.22
names(nba) # Provides the name of each column
##  [1] "Year"                   "Pick"
##  [3] "Team"                   "Player"
##  [5] "Position"               "College"
##  [7] "Games"                  "Minutes"
##  [9] "Total.Points"           "Total.Rebounds"
## [11] "Total.Assists"          "Field.Goal.Percentage"
## [13] "Three.Point.Percentage" "Free.Throw.Percentage"
## [15] "Points.Per.Game"        "Rebounds.Per.Game"
## [17] "Assists.Per.Game"       "Win.Share"
## [19] "Minutes.Per.Game"

One way we can interpret this is to say that on average by a player's second year in the league they average about 19.52 minutes per game.

## Minutes Per Game Histogram

We now plot a histogram of the Minutes Per Game to see its distribution.

qplot(Minutes.Per.Game, data = nba, binwidth = 2.5,
main = "Histogram of Minutes Per Game")

# binwidth is the length of each rectangular bar

## Someone Played a lot of Minutes

We now take a closer look at a particular player, that is, Damian Lillard.

• [nba$Position == "XX"] - subsets the column with the choosen position XX ## Combine Positions We label both power and small forwards as just forward, and similiarily we do the same for point and shooting guards. This code works by creating a new column forward and guard by selecting the rows that contain c("PF", "SF") and c("PG", "SG") in the Position column, respectfully forward <- (nba$Position %in% c("PF", "SF")) #
head(forward)
## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE
guard <- (nba$Position %in% c("PG", "SG")) head(guard) ## [1] FALSE FALSE TRUE TRUE FALSE TRUE ## Average Points Per Game for Position Finding the mean of each based on the newly created position mean(nba$Points.Per.Game[guard])
## [1] 8.881818
mean(nba$Points.Per.Game[forward]) ## [1] 7.416667 ## Box Plots Perhaps we are interested in the different minutes per game for the different positions. We could compare this with a side by side boxplot. qplot(Position, Minutes.Per.Game, geom = "boxplot", data = nba, main = "Box Plot of Minutes Per Game by Position") From this, we notice that the median minutes per game of small forwards are slightly more than the other positions. ## Your Turn Try playing with chunks of code from this session to further investigate the NBA data: 1. Get a summary of the Assists.Per.Game values. 2. Make a boxplot comparing Rebounds.Per.Game for different positions. 3. Find the average Assists.Per.Game for the guards (point and shooting) and forwards (small and power). ## Answers ### 1. summary(nba$Assists.Per.Game)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.000   0.500   1.000   1.377   1.375   6.100

### 2.

qplot(Position, Rebounds.Per.Game, geom = "boxplot", data = nba,
main = "Box Plot of Rebounds Per Game by Position")

### 3.

guard <- (nba$Position %in% c("PG", "SG")) forward <- (nba$Position %in% c("PF", "SF"))
mean(nba$Assists.Per.Game[guard]) ## [1] 2.536364 mean(nba$Assists.Per.Game[forward])
## [1] 0.7833333