- Explore a real data set using R
- Goal: get good understanding of using R for data management and exploration
- Dont worry about understanding all coding right away
- We will go back and explain how it all works in detail
Lets use R to look at the top few rows of the NBA data set. First, we load NBA using read.csv()
:
nba <- read.csv("NBA Draft 2012.csv")
The head()
function allows you to look at first 6 rows of the data and tail()
allows you to look at last 6 rows
head(nba)
## Year Pick Team Player Position College ## 1 2012 1 NOH Anthony Davis PF University of Kentucky ## 2 2012 2 CHA Michael Kidd-Gilchrist SF University of Kentucky ## 3 2012 3 WAS Bradley Beal SG University of Florida ## 4 2012 4 CLE Dion Waiters SG Syracuse University ## 5 2012 5 SAC Thomas Robinson PF University of Kansas ## 6 2012 6 POR Damian Lillard PG Weber State University ## Games Minutes Total.Points Total.Rebounds Total.Assists ## 1 131 4204 2261 1195 168 ## 2 140 3527 1152 779 169 ## 3 129 4275 2029 484 380 ## 4 131 3828 2007 344 392 ## 5 140 1929 672 622 80 ## 6 164 6104 3257 545 988 ## Field.Goal.Percentage Three.Point.Percentage Free.Throw.Percentage ## 1 0.518 0.133 0.777 ## 2 0.464 0.167 0.682 ## 3 0.416 0.396 0.787 ## 4 0.424 0.342 0.714 ## 5 0.454 0 0.543 ## 6 0.427 0.381 0.859 ## Points.Per.Game Rebounds.Per.Game Assists.Per.Game Win.Share ## 1 17.3 9.1 1.3 16.5 ## 2 8.2 5.6 1.2 5.2 ## 3 15.7 3.8 2.9 7.0 ## 4 15.3 2.6 3.0 2.5 ## 5 4.8 4.4 0.6 1.6 ## 6 19.9 3.3 6.0 15.4
The command str()
, short for structure, gives us a summary of each variable along with the size of the "data frame".
str(nba)
## 'data.frame': 30 obs. of 18 variables: ## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ... ## $ Pick : int 1 2 3 4 5 6 7 8 9 10 ... ## $ Team : Factor w/ 23 levels "ATL","BOS","CHA",..: 15 3 23 5 21 20 9 22 8 15 ... ## $ Player : Factor w/ 30 levels "Andre Drummond",..: 3 22 6 8 28 7 12 27 1 5 ... ## $ Position : Factor w/ 5 levels "C","PF","PG",..: 2 4 5 5 2 3 4 4 1 5 ... ## $ College : Factor w/ 19 levels "Baylor University",..: 15 15 12 10 14 19 16 17 11 2 ... ## $ Games : int 131 140 129 131 140 164 159 154 141 130 ... ## $ Minutes : int 4204 3527 4275 3828 1929 6104 4262 3398 3862 2757 ... ## $ Total.Points : int 2261 1152 2029 2007 672 3257 1486 1346 1571 907 ... ## $ Total.Rebounds : int 1195 779 484 344 622 545 644 396 1528 238 ... ## $ Total.Assists : int 168 169 380 392 80 988 214 132 65 287 ... ## $ Field.Goal.Percentage : num 0.518 0.464 0.416 0.424 0.454 0.427 0.419 0.417 0.618 0.39 ... ## $ Three.Point.Percentage: Factor w/ 22 levels "0","0.133","0.167",..: 2 3 21 12 1 20 16 18 7 13 ... ## $ Free.Throw.Percentage : Factor w/ 27 levels "0.25","0.402",..: 20 13 21 15 6 26 16 22 2 9 ... ## $ Points.Per.Game : num 17.3 8.2 15.7 15.3 4.8 19.9 9.3 8.7 11.1 7 ... ## $ Rebounds.Per.Game : num 9.1 5.6 3.8 2.6 4.4 3.3 4.1 2.6 10.8 1.8 ... ## $ Assists.Per.Game : num 1.3 1.2 2.9 3 0.6 6 1.3 0.9 0.5 2.2 ... ## $ Win.Share : num 16.5 5.2 7 2.5 1.6 15.4 6 5.1 14.4 -0.5 ...
As we can see the nba data frame has 30 observations (rows) and 18 variables (columns).
Let's summarize the values for each variable in NBA with the summary()
command.
summary(nba)
## Year Pick Team Player ## Min. :2012 Min. : 1.00 HOU : 3 Andre Drummond : 1 ## 1st Qu.:2012 1st Qu.: 8.25 BOS : 2 Andrew Nicholson: 1 ## Median :2012 Median :15.50 CLE : 2 Anthony Davis : 1 ## Mean :2012 Mean :15.50 GSW : 2 Arnett Moultrie : 1 ## 3rd Qu.:2012 3rd Qu.:22.75 NOH : 2 Austin Rivers : 1 ## Max. :2012 Max. :30.00 POR : 2 Bradley Beal : 1 ## (Other):17 (Other) :24 ## Position College Games ## C :7 University of Kentucky : 4 Min. : 3.00 ## PF:7 University of North Carolina: 4 1st Qu.: 94.25 ## PG:3 Duke University : 2 Median :116.50 ## SF:5 Syracuse University : 2 Mean :109.20 ## SG:8 University of Connecticut : 2 3rd Qu.:140.00 ## University of Washington : 2 Max. :164.00 ## (Other) :14 ## Minutes Total.Points Total.Rebounds Total.Assists ## Min. : 9 Min. : 0.0 Min. : 0.0 Min. : 0.0 ## 1st Qu.:1230 1st Qu.: 429.2 1st Qu.: 199.5 1st Qu.: 54.5 ## Median :2310 Median : 958.0 Median : 381.0 Median :125.5 ## Mean :2399 Mean : 968.6 Mean : 455.0 Mean :170.8 ## 3rd Qu.:3495 3rd Qu.:1243.2 3rd Qu.: 638.5 3rd Qu.:168.8 ## Max. :6104 Max. :3257.0 Max. :1528.0 Max. :988.0 ## ## Field.Goal.Percentage Three.Point.Percentage Free.Throw.Percentage ## Min. :0.0000 NoAttempts: 5 0.667 : 2 ## 1st Qu.:0.4200 0 : 3 0.8 : 2 ## Median :0.4380 0.3 : 2 0.833 : 2 ## Mean :0.4404 0.381 : 2 0.25 : 1 ## 3rd Qu.:0.4953 0.133 : 1 0.402 : 1 ## Max. :0.6180 0.167 : 1 0.521 : 1 ## (Other) :16 (Other):21 ## Points.Per.Game Rebounds.Per.Game Assists.Per.Game Win.Share ## Min. : 0.000 Min. : 0.000 Min. :0.000 Min. :-0.800 ## 1st Qu.: 4.500 1st Qu.: 1.925 1st Qu.:0.500 1st Qu.: 1.500 ## Median : 7.150 Median : 3.350 Median :1.000 Median : 2.800 ## Mean : 7.673 Mean : 3.693 Mean :1.377 Mean : 4.117 ## 3rd Qu.: 9.525 3rd Qu.: 4.775 3rd Qu.:1.375 3rd Qu.: 5.200 ## Max. :19.900 Max. :10.800 Max. :6.100 Max. :16.500 ##
With this command we immediately have summary statistics of each variable.
Let's look at the relationship between pick selection and total points. First, we need to install and load ggplot2, a special package for plotting.
install.packages("ggplot2")
library(ggplot2)
Using the qplot()
command we can create a simple scatter plot.
qplot(Pick, Total.Points, geom="point", data = nba, main = "Total Points vs. Pick")
qplot(Pick, Total.Points, geom = "point", data = nba, colour = Position, main = "Total Points vs. Pick" )
qplot(Pick, Total.Points, geom = "point", data = nba, main = "Total Points vs. Pick with Regression Line") + geom_smooth(method = "lm")
Do not let the qplot()
command or any of the parameters confuse you. We will discuss in detail this function in later lessons.
We will make a new variable in the nba data set to account for minutes per game that is, Minutes per game = minutes / games
.
# Creates new column Minutes.Per.Game in nba data nba$Minutes.Per.Game <- nba$Minutes / nba$Games
Notice that we had to place periods between words. R does not allow users to separate phrases with spaces.
summary(nba$Minutes.Per.Game)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 14.32 20.57 19.52 25.06 37.22
names(nba) # Provides the name of each column
## [1] "Year" "Pick" ## [3] "Team" "Player" ## [5] "Position" "College" ## [7] "Games" "Minutes" ## [9] "Total.Points" "Total.Rebounds" ## [11] "Total.Assists" "Field.Goal.Percentage" ## [13] "Three.Point.Percentage" "Free.Throw.Percentage" ## [15] "Points.Per.Game" "Rebounds.Per.Game" ## [17] "Assists.Per.Game" "Win.Share" ## [19] "Minutes.Per.Game"
One way we can interpret this is to say that on average by a player's second year in the league they average about 19.52 minutes per game.
We now plot a histogram of the Minutes Per Game to see its distribution.
qplot(Minutes.Per.Game, data = nba, binwidth = 2.5, main = "Histogram of Minutes Per Game")
# binwidth is the length of each rectangular bar
We now take a closer look at a particular player, that is, Damian Lillard.
nba[which.max(nba$Minutes.Per.Game),]
## Year Pick Team Player Position College Games ## 6 2012 6 POR Damian Lillard PG Weber State University 164 ## Minutes Total.Points Total.Rebounds Total.Assists Field.Goal.Percentage ## 6 6104 3257 545 988 0.427 ## Three.Point.Percentage Free.Throw.Percentage Points.Per.Game ## 6 0.381 0.859 19.9 ## Rebounds.Per.Game Assists.Per.Game Win.Share Minutes.Per.Game ## 6 3.3 6 15.4 37.21951
Looking at the average minutes for separate positions, we noticed that Point Guards average the most minutes per game.
mean((nba$Minutes.Per.Game)[nba$Position == "PG"])
## [1] 22.97461
mean((nba$Minutes.Per.Game)[nba$Position == "SG"])
## [1] 19.44996
mean((nba$Minutes.Per.Game)[nba$Position == "SF"])
## [1] 21.93133
mean((nba$Minutes.Per.Game)[nba$Position == "PF"])
## [1] 17.46062
mean((nba$Minutes.Per.Game)[nba$Position == "C"])
## [1] 18.4517
As a breakdown:
mean()
- provides mean of argumentnba$Minutes.Per.Game
- selects the column in the nba data set named Minutes.Per.Game[nba$Position == "XX"]
- subsets the column with the choosen position XXWe label both power and small forwards as just forward, and similiarily we do the same for point and shooting guards.
This code works by creating a new column forward and guard by selecting the rows that contain c("PF", "SF")
and c("PG", "SG")
in the Position column, respectfully
forward <- (nba$Position %in% c("PF", "SF")) # head(forward)
## [1] TRUE TRUE FALSE FALSE TRUE FALSE
guard <- (nba$Position %in% c("PG", "SG")) head(guard)
## [1] FALSE FALSE TRUE TRUE FALSE TRUE
Finding the mean of each based on the newly created position
mean(nba$Points.Per.Game[guard])
## [1] 8.881818
mean(nba$Points.Per.Game[forward])
## [1] 7.416667
Perhaps we are interested in the different minutes per game for the different positions. We could compare this with a side by side boxplot.
qplot(Position, Minutes.Per.Game, geom = "boxplot", data = nba, main = "Box Plot of Minutes Per Game by Position")
From this, we notice that the median minutes per game of small forwards are slightly more than the other positions.
Try playing with chunks of code from this session to further investigate the NBA data:
summary(nba$Assists.Per.Game)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 0.500 1.000 1.377 1.375 6.100
qplot(Position, Rebounds.Per.Game, geom = "boxplot", data = nba, main = "Box Plot of Rebounds Per Game by Position")
guard <- (nba$Position %in% c("PG", "SG")) forward <- (nba$Position %in% c("PF", "SF")) mean(nba$Assists.Per.Game[guard])
## [1] 2.536364
mean(nba$Assists.Per.Game[forward])
## [1] 0.7833333