Motivating Example

Explore a real data set using R
Goal: get good understanding of using R for data management and exploration
Dont worry about understanding all coding right away
We will go back and explain how it all works in detail

NBA Draft Dataset

Shows the statistical outputs of players drafted in 2012 through their first two seasons in the NBA
Several pieces of info are accounted for
Pick, Team, Minutes Played Total Points, Points Per Game, etc.
Follow along using NBA Historical Data.csv

First look at data in R

Lets use R to look at the top few rows of the NBA data set. First, we load NBA using read.csv():

nba <- read.csv("NBA Draft 2012.csv")

Looking at the Data

The head() function allows you to look at first 6 rows of the data and tail() allows you to look at last 6 rows

head(nba)

##   Year Pick Team                 Player Position                College
## 1 2012    1  NOH          Anthony Davis       PF University of Kentucky
## 2 2012    2  CHA Michael Kidd-Gilchrist       SF University of Kentucky
## 3 2012    3  WAS           Bradley Beal       SG  University of Florida
## 4 2012    4  CLE           Dion Waiters       SG    Syracuse University
## 5 2012    5  SAC        Thomas Robinson       PF   University of Kansas
## 6 2012    6  POR         Damian Lillard       PG Weber State University
##   Games Minutes Total.Points Total.Rebounds Total.Assists
## 1   131    4204         2261           1195           168
## 2   140    3527         1152            779           169
## 3   129    4275         2029            484           380
## 4   131    3828         2007            344           392
## 5   140    1929          672            622            80
## 6   164    6104         3257            545           988
##   Field.Goal.Percentage Three.Point.Percentage Free.Throw.Percentage
## 1                 0.518                  0.133                 0.777
## 2                 0.464                  0.167                 0.682
## 3                 0.416                  0.396                 0.787
## 4                 0.424                  0.342                 0.714
## 5                 0.454                      0                 0.543
## 6                 0.427                  0.381                 0.859
##   Points.Per.Game Rebounds.Per.Game Assists.Per.Game Win.Share
## 1            17.3               9.1              1.3      16.5
## 2             8.2               5.6              1.2       5.2
## 3            15.7               3.8              2.9       7.0
## 4            15.3               2.6              3.0       2.5
## 5             4.8               4.4              0.6       1.6
## 6            19.9               3.3              6.0      15.4

NBA Data Attributes

The command str(), short for structure, gives us a summary of each variable along with the size of the "data frame".

str(nba)

## 'data.frame':    30 obs. of  18 variables:
##  $ Year                  : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Pick                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Team                  : Factor w/ 23 levels "ATL","BOS","CHA",..: 15 3 23 5 21 20 9 22 8 15 ...
##  $ Player                : Factor w/ 30 levels "Andre Drummond",..: 3 22 6 8 28 7 12 27 1 5 ...
##  $ Position              : Factor w/ 5 levels "C","PF","PG",..: 2 4 5 5 2 3 4 4 1 5 ...
##  $ College               : Factor w/ 19 levels "Baylor University",..: 15 15 12 10 14 19 16 17 11 2 ...
##  $ Games                 : int  131 140 129 131 140 164 159 154 141 130 ...
##  $ Minutes               : int  4204 3527 4275 3828 1929 6104 4262 3398 3862 2757 ...
##  $ Total.Points          : int  2261 1152 2029 2007 672 3257 1486 1346 1571 907 ...
##  $ Total.Rebounds        : int  1195 779 484 344 622 545 644 396 1528 238 ...
##  $ Total.Assists         : int  168 169 380 392 80 988 214 132 65 287 ...
##  $ Field.Goal.Percentage : num  0.518 0.464 0.416 0.424 0.454 0.427 0.419 0.417 0.618 0.39 ...
##  $ Three.Point.Percentage: Factor w/ 22 levels "0","0.133","0.167",..: 2 3 21 12 1 20 16 18 7 13 ...
##  $ Free.Throw.Percentage : Factor w/ 27 levels "0.25","0.402",..: 20 13 21 15 6 26 16 22 2 9 ...
##  $ Points.Per.Game       : num  17.3 8.2 15.7 15.3 4.8 19.9 9.3 8.7 11.1 7 ...
##  $ Rebounds.Per.Game     : num  9.1 5.6 3.8 2.6 4.4 3.3 4.1 2.6 10.8 1.8 ...
##  $ Assists.Per.Game      : num  1.3 1.2 2.9 3 0.6 6 1.3 0.9 0.5 2.2 ...
##  $ Win.Share             : num  16.5 5.2 7 2.5 1.6 15.4 6 5.1 14.4 -0.5 ...

As we can see the nba data frame has 30 observations (rows) and 18 variables (columns).

NBA Variables Summary

Let's summarize the values for each variable in NBA with the summary() command.

summary(nba)

##       Year           Pick            Team                 Player  
##  Min.   :2012   Min.   : 1.00   HOU    : 3   Andre Drummond  : 1  
##  1st Qu.:2012   1st Qu.: 8.25   BOS    : 2   Andrew Nicholson: 1  
##  Median :2012   Median :15.50   CLE    : 2   Anthony Davis   : 1  
##  Mean   :2012   Mean   :15.50   GSW    : 2   Arnett Moultrie : 1  
##  3rd Qu.:2012   3rd Qu.:22.75   NOH    : 2   Austin Rivers   : 1  
##  Max.   :2012   Max.   :30.00   POR    : 2   Bradley Beal    : 1  
##                                 (Other):17   (Other)         :24  
##  Position                         College       Games       
##  C :7     University of Kentucky      : 4   Min.   :  3.00  
##  PF:7     University of North Carolina: 4   1st Qu.: 94.25  
##  PG:3     Duke University             : 2   Median :116.50  
##  SF:5     Syracuse University         : 2   Mean   :109.20  
##  SG:8     University of Connecticut   : 2   3rd Qu.:140.00  
##           University of Washington    : 2   Max.   :164.00  
##           (Other)                     :14                   
##     Minutes      Total.Points    Total.Rebounds   Total.Assists  
##  Min.   :   9   Min.   :   0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:1230   1st Qu.: 429.2   1st Qu.: 199.5   1st Qu.: 54.5  
##  Median :2310   Median : 958.0   Median : 381.0   Median :125.5  
##  Mean   :2399   Mean   : 968.6   Mean   : 455.0   Mean   :170.8  
##  3rd Qu.:3495   3rd Qu.:1243.2   3rd Qu.: 638.5   3rd Qu.:168.8  
##  Max.   :6104   Max.   :3257.0   Max.   :1528.0   Max.   :988.0  
##                                                                  
##  Field.Goal.Percentage Three.Point.Percentage Free.Throw.Percentage
##  Min.   :0.0000        NoAttempts: 5          0.667  : 2           
##  1st Qu.:0.4200        0         : 3          0.8    : 2           
##  Median :0.4380        0.3       : 2          0.833  : 2           
##  Mean   :0.4404        0.381     : 2          0.25   : 1           
##  3rd Qu.:0.4953        0.133     : 1          0.402  : 1           
##  Max.   :0.6180        0.167     : 1          0.521  : 1           
##                        (Other)   :16          (Other):21           
##  Points.Per.Game  Rebounds.Per.Game Assists.Per.Game   Win.Share     
##  Min.   : 0.000   Min.   : 0.000    Min.   :0.000    Min.   :-0.800  
##  1st Qu.: 4.500   1st Qu.: 1.925    1st Qu.:0.500    1st Qu.: 1.500  
##  Median : 7.150   Median : 3.350    Median :1.000    Median : 2.800  
##  Mean   : 7.673   Mean   : 3.693    Mean   :1.377    Mean   : 4.117  
##  3rd Qu.: 9.525   3rd Qu.: 4.775    3rd Qu.:1.375    3rd Qu.: 5.200  
##  Max.   :19.900   Max.   :10.800    Max.   :6.100    Max.   :16.500  
##

With this command we immediately have summary statistics of each variable.

Scatterplots

Let's look at the relationship between pick selection and total points. First, we need to install and load ggplot2, a special package for plotting.

install.packages("ggplot2")

library(ggplot2)

Using the qplot() command we can create a simple scatter plot.

qplot(Pick, Total.Points, geom="point", data = nba, main = "Total Points vs. Pick")

More Scatterplots

qplot(Pick, Total.Points, geom = "point", data = nba, colour = Position,
      main = "Total Points vs. Pick" )

Even More Scatterplots

qplot(Pick, Total.Points, geom = "point", data = nba, 
      main = "Total Points vs. Pick with Regression Line") +
  geom_smooth(method = "lm")

Do not let the qplot() command or any of the parameters confuse you. We will discuss in detail this function in later lessons.

Creating A New Variable

We will make a new variable in the nba data set to account for minutes per game that is, Minutes per game = minutes / games.

# Creates new column Minutes.Per.Game in nba data
nba$Minutes.Per.Game <- nba$Minutes / nba$Games

Notice that we had to place periods between words. R does not allow users to separate phrases with spaces.

summary(nba$Minutes.Per.Game)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   14.32   20.57   19.52   25.06   37.22

names(nba) # Provides the name of each column

##  [1] "Year"                   "Pick"                  
##  [3] "Team"                   "Player"                
##  [5] "Position"               "College"               
##  [7] "Games"                  "Minutes"               
##  [9] "Total.Points"           "Total.Rebounds"        
## [11] "Total.Assists"          "Field.Goal.Percentage" 
## [13] "Three.Point.Percentage" "Free.Throw.Percentage" 
## [15] "Points.Per.Game"        "Rebounds.Per.Game"     
## [17] "Assists.Per.Game"       "Win.Share"             
## [19] "Minutes.Per.Game"

One way we can interpret this is to say that on average by a player's second year in the league they average about 19.52 minutes per game.

Minutes Per Game Histogram

We now plot a histogram of the Minutes Per Game to see its distribution.

qplot(Minutes.Per.Game, data = nba, binwidth = 2.5,
      main = "Histogram of Minutes Per Game")

# binwidth is the length of each rectangular bar

Someone Played a lot of Minutes

We now take a closer look at a particular player, that is, Damian Lillard.

nba[which.max(nba$Minutes.Per.Game),]

##   Year Pick Team         Player Position                College Games
## 6 2012    6  POR Damian Lillard       PG Weber State University   164
##   Minutes Total.Points Total.Rebounds Total.Assists Field.Goal.Percentage
## 6    6104         3257            545           988                 0.427
##   Three.Point.Percentage Free.Throw.Percentage Points.Per.Game
## 6                  0.381                 0.859            19.9
##   Rebounds.Per.Game Assists.Per.Game Win.Share Minutes.Per.Game
## 6               3.3                6      15.4         37.21951

Find Minutes By Position

Looking at the average minutes for separate positions, we noticed that Point Guards average the most minutes per game.

mean((nba$Minutes.Per.Game)[nba$Position == "PG"])

## [1] 22.97461

mean((nba$Minutes.Per.Game)[nba$Position == "SG"])

## [1] 19.44996

mean((nba$Minutes.Per.Game)[nba$Position == "SF"])

## [1] 21.93133

mean((nba$Minutes.Per.Game)[nba$Position == "PF"])

## [1] 17.46062

mean((nba$Minutes.Per.Game)[nba$Position == "C"])

## [1] 18.4517

As a breakdown:

mean() - provides mean of argument
nba$Minutes.Per.Game - selects the column in the nba data set named Minutes.Per.Game
[nba$Position == "XX"] - subsets the column with the choosen position XX

Combine Positions

We label both power and small forwards as just forward, and similiarily we do the same for point and shooting guards.

This code works by creating a new column forward and guard by selecting the rows that contain c("PF", "SF") and c("PG", "SG") in the Position column, respectfully

forward <- (nba$Position %in% c("PF", "SF")) # 
head(forward)

## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

guard <- (nba$Position %in% c("PG", "SG"))
head(guard)

## [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

Average Points Per Game for Position

Finding the mean of each based on the newly created position

mean(nba$Points.Per.Game[guard])

## [1] 8.881818

mean(nba$Points.Per.Game[forward])

## [1] 7.416667

Box Plots

Perhaps we are interested in the different minutes per game for the different positions. We could compare this with a side by side boxplot.

qplot(Position, Minutes.Per.Game, geom = "boxplot", data = nba,
      main = "Box Plot of Minutes Per Game by Position")

From this, we notice that the median minutes per game of small forwards are slightly more than the other positions.

Your Turn

Try playing with chunks of code from this session to further investigate the NBA data:

Get a summary of the Assists.Per.Game values.
Make a boxplot comparing Rebounds.Per.Game for different positions.
Find the average Assists.Per.Game for the guards (point and shooting) and forwards (small and power).

Answers

1.

summary(nba$Assists.Per.Game)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.500   1.000   1.377   1.375   6.100

2.

qplot(Position, Rebounds.Per.Game, geom = "boxplot", data = nba,
      main = "Box Plot of Rebounds Per Game by Position")

3.

guard <- (nba$Position %in% c("PG", "SG"))
forward <- (nba$Position %in% c("PF", "SF"))
mean(nba$Assists.Per.Game[guard])

## [1] 2.536364

mean(nba$Assists.Per.Game[forward])

## [1] 0.7833333