Data Structures

Data Frames

Data Frames are the work horse of R objects
Structured by rows and columns and can be indexed
Each column is a specified variable type
Columns names can be used to index a variable
Advice for naming variable applys to editing columns names
Can be specified by grouping vectors of equal length as columns

Data Frame Indexing

Elements indexed similar to a vector using [ ]
df[i,j] will select the element in the $i^{th}$ row and $j^{th}$ column
df[ ,j] will select the entire $j^{th}$ column and treat it as a vector
df[i ,] will select the entire $i^{th}$ row and treat it as a vector
Logical vectors can be used in place of i and j used to subset the row and columns

Adding a New Variable to a Data Frame

Create a new vector that is the same length as other columns
Append new column to the data frame using the $ operator
The new data frame column will adopt the name of the vector

Data Frame Demo

Loading previously used NBA data set:

nba <- read.csv("NBA Draft Class.csv")

Select position column (5th column):

nba[,5]

##   [1] PG SF SG PG PF SF SG PG C  PG PF SG SF C  PF C  C  C  C  PF SG C  PF
##  [24] SF PG PF PF PF C  SG SG PG PG PF SG PG SG SG PF PF SF PF PG PG PG PG
##  [47] PG PF SF C  SG PF SF SG PG PG SF C  SF C  PF PF SF SG SF C  SG PF PF
##  [70] C  SF C  PG PG SG SG PF SF SG SF SG PG PF SF PG PF C  PF C  PF C  PG
##  [93] PG SG SG PG PF PF SF C  SG SF SF PF PG PF PG SG SF PG PG SG PF SF SG
## [116] SG PF PG SF SF C  SG C  SG PG PF SF PF C  PF PF SG C  C  SG SG SG C 
## [139] PF SF PG C  SF SG SF C  C  SG SG PG SG PG C  C  SF SG PG PG SG SG C 
## [162] PF SF SG SG SG C  SG PG
## Levels: C PF PG SF SG

Demo (Continued)

Select team column with the $ operator:

nba$Team

##   [1] CHI MIA MIN SEA MEM NYK LAC CHA NJN IND SAC POR GSW PHO PHI TOR WAS
##  [18] CLE CHA NJN ORL UTA SEA HOU SAS NOH DET LAC MEM OKC SAC MIN GSW NYK
##  [35] TOR MIL NJN CHA IND PHO DET CHI PHI MIN ATL UTA NOH POR SAC DAL OKC
##  [52] CHI MEM MIN LAL WAS PHI NJN MIN SAC GSW DET LAC UTA IND NOH MEM TOR
##  [69] HOU MIL MIN CHI OKC BOS SAS POR MIN ATL MEM OKC NJN MEM ORL WAS CLE
##  [86] MIN UTA CLE TOR WAS SAC DET CHA MIL GSW UTA PHO HOU IND PHI NYK WAS
## [103] CHA MIN POR DEN OKC BOS DAL CHI SAS CHI NOH CHA WAS CLE SAC POR GSW
## [120] TOR DET NOH POR HOU PHO MIL PHI HOU DAL HOU ORL DEN BOS BOS ATL CLE
## [137] MEM IND MIA OKC CHI GSW CLE ORL WAS CHA PHO SAC DET MIN POR PHI OKC
## [154] DAL UTA MIL ATL ATL CLE CHI UTA BRK IND NYK LAC MIN DEN OKC PHO
## 32 Levels: ATL BOS BRK CHA CHI CLE DAL DEN DET GSW HOU IND LAC LAL ... WAS

Demo (Continued)

We now determine the row location, in our data, where the team is the Milwaukee Bucks.

bucks <- nba$Team == "MIL" # Creates vector of T/F values if the entry is MIL
head(bucks)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

This output doesn't show much. It would be much easier if we could see which positions are TRUE!

which(bucks == TRUE) # Tells row number where team is labeled MIL

## [1]  36  70  94 126 156

Demo (Continued)

Displaying part of the NBA data set where the team is Milwaukee by subsetting rows.

nba[nba$Team=="MIL", ]

##     Year Pick Team                Player Position
## 36  2009   10  MIL      Brandon Jennings       PG
## 70  2010   15  MIL         Larry Sanders        C
## 94  2011   10  MIL       Jimmer Fredette       SG
## 126 2012   14  MIL           John Henson       PF
## 156 2013   15  MIL Giannis Antetokounmpo       SG
##                              College Games Minutes Total.Points
## 36                        NoAttempts   371   12796         6187
## 70  Virginia Commonwealth University   206    4036         1320
## 94          Brigham Young University   179    2621         1229
## 126     University of North Carolina   133    2683         1159
## 156                       NoAttempts    77    1897          525
##     Total.Rebounds Total.Assists Field.Goal.Percentage
## 36            1232          2270                 0.390
## 70            1175           151                 0.477
## 94             196           264                 0.417
## 126            794           144                 0.519
## 156            339           150                 0.414
##     Three.Point.Percentage Free.Throw.Percentage Points.Per.Game
## 36                    0.35                 0.799            16.7
## 70                       0                 0.559             6.4
## 94                   0.401                 0.857             6.9
## 126                      0                 0.521             8.7
## 156                  0.347                 0.683             6.8
##     Rebounds.Per.Game Assists.Per.Game Win.Share     X
## 36                3.3              6.1      23.3 0.087
## 70                5.7              0.7       8.7 0.104
## 94                1.1              1.5       2.2 0.040
## 126               6.0              1.1       5.1 0.092
## 156               4.4              1.9       1.2 0.031

Creating our own Data Frame

Creating our own data frame using the data.frame() function:

mydf <- data.frame(NUMS = 1:5, 
                   LETS = letters[1:5],
                   SHOES = c("Nike", "Adidas", "Reebok", "Big Baller Brand", "Adidas"))
mydf

##   NUMS LETS            SHOES
## 1    1    a             Nike
## 2    2    b           Adidas
## 3    3    c           Reebok
## 4    4    d Big Baller Brand
## 5    5    e           Adidas

Note that in a data frame, each column has to have the same length!

Renaming columns

We can use the names() function to set that first column to lowercase:

names(mydf)[1] <- "nums" # Changes the names of the first column in mydf
mydf

##   nums LETS            SHOES
## 1    1    a             Nike
## 2    2    b           Adidas
## 3    3    c           Reebok
## 4    4    d Big Baller Brand
## 5    5    e           Adidas

We can also rename all the columns at once using the colnames() command.

colnames(mydf) <- c("numbers","letters","shoes") # Changes all columns at once
mydf

##   numbers letters            shoes
## 1       1       a             Nike
## 2       2       b           Adidas
## 3       3       c           Reebok
## 4       4       d Big Baller Brand
## 5       5       e           Adidas

Your Turn

Construct a data frame where column 1 contains 5 Milwaukee Bucks players and column 2 is their Pick number.
Select only the rows where the Pick number is even.
Determine which rows of the nba data set contains the Chicago Bulls.

Answers

1.

mydf <- data.frame(Player = c("Jennings","Sanders","Fredette","Henson","Antetokounmpo"), 
                   Pick = c(10,15,10,14,15)
                   )
mydf

##          Player Pick
## 1      Jennings   10
## 2       Sanders   15
## 3      Fredette   10
## 4        Henson   14
## 5 Antetokounmpo   15

2.

mydf[c(1,3,4),]

##     Player Pick
## 1 Jennings   10
## 3 Fredette   10
## 4   Henson   14

3.

bulls <- nba$Team == "CHI" 
which(bulls == TRUE)

## [1]   1  42  52  72 110 112 141 160

Lists

Lists are a structured collection of R objects
R objects in a list need not be the same type
Create lists using the list function
Lists indexed using double square brackets [[ ]] to select an object

List Example

Creating a list containing a matrix of size 2 by 5, and a vector of length 5, and a string:

mylist <- list(matrix(letters[1:10], nrow = 2, ncol = 5),
               c("Brady, Rodgers, Romo, Newton, Wilson"),
               "The Chicago Cubs won the 2016 World Series")
mylist

## [[1]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,] "a"  "c"  "e"  "g"  "i" 
## [2,] "b"  "d"  "f"  "h"  "j" 
## 
## [[2]]
## [1] "Brady, Rodgers, Romo, Newton, Wilson"
## 
## [[3]]
## [1] "The Chicago Cubs won the 2016 World Series"

Note that unlike data frames, list can contain elements of varying sizes and structures.

Use indexing to select the second list element:

mylist[[3]] # Selections third argument in mylist

## [1] "The Chicago Cubs won the 2016 World Series"

Your Turn

Create a list containing mydf as well as a vector of length 5 containing NFL wide receivers
Use indexing to select mydf from your list

Answers

1.

mylist <- list(mydf,
               c("Nelson","Bryant","Crabtree","Fitzgerald","Jones"))

2.

mylist[[1]]

##          Player Pick
## 1      Jennings   10
## 2       Sanders   15
## 3      Fredette   10
## 4        Henson   14
## 5 Antetokounmpo   15

Examining Objects

head(x) - View top 6 rows of a data frame
tail(x) - View bottom 6 rows of a data frame
summary(x) - Summary statistics
str(x) - View structure of object
dim(x) - View dimensions of object
length(x) - Returns the length of a vector

Examining Objects Demo

We can examine the first two values of an object by passing the n parameter to the head() function:

head(nba, n = 2) # n = 2 displays onlt the first two rows.

##   Year Pick Team          Player Position                 College Games
## 1 2008    1  CHI    Derrick Rose       PG   University of Memphis   289
## 2 2008    2  MIA Michael Beasley       SF Kansas State University   409
##   Minutes Total.Points Total.Rebounds Total.Assists Field.Goal.Percentage
## 1   10583         6017           1103          1954                  0.46
## 2   10170         5416           2007           539                  0.45
##   Three.Point.Percentage Free.Throw.Percentage Points.Per.Game
## 1                  0.312                 0.815            20.8
## 2                  0.348                 0.758            13.2
##   Rebounds.Per.Game Assists.Per.Game Win.Share     X
## 1               3.8              6.8      29.8 0.135
## 2               4.9              1.3      10.3 0.048

What's its structure?

str(nba)

## 'data.frame':    169 obs. of  19 variables:
##  $ Year                  : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ Pick                  : int  1 2 3 4 5 6 7 9 10 11 ...
##  $ Team                  : Factor w/ 32 levels "ATL","BOS","BRK",..: 5 16 18 29 15 21 13 4 19 12 ...
##  $ Player                : Factor w/ 169 levels "Al-Farouq Aminu",..: 44 115 124 139 99 36 55 31 23 82 ...
##  $ Position              : Factor w/ 5 levels "C","PF","PG",..: 3 4 5 3 2 4 5 3 1 3 ...
##  $ College               : Factor w/ 69 levels "Arizona State University",..: 48 18 59 38 38 24 15 61 34 36 ...
##  $ Games                 : int  289 409 435 440 364 285 311 429 342 381 ...
##  $ Minutes               : int  10583 10170 14132 14932 11933 8923 10649 10710 11339 7611 ...
##  $ Total.Points          : int  6017 5416 6447 8834 6989 4138 5430 4354 6168 3221 ...
##  $ Total.Rebounds        : int  1103 2007 1414 2171 4453 1327 792 785 2494 729 ...
##  $ Total.Assists         : int  1954 539 1292 3045 898 546 1021 1731 494 1092 ...
##  $ Field.Goal.Percentage : num  0.46 0.45 0.433 0.433 0.451 0.419 0.442 0.404 0.511 0.411 ...
##  $ Three.Point.Percentage: Factor w/ 96 levels "0","0.038","0.053",..: 41 62 86 37 72 78 77 84 1 67 ...
##  $ Free.Throw.Percentage : Factor w/ 129 levels "0.25","0.402",..: 105 74 107 105 105 114 101 123 91 106 ...
##  $ Points.Per.Game       : num  20.8 13.2 14.8 20.1 19.2 14.5 17.5 10.1 18 8.5 ...
##  $ Rebounds.Per.Game     : num  3.8 4.9 3.3 4.9 12.2 4.7 2.5 1.8 7.3 1.9 ...
##  $ Assists.Per.Game      : num  6.8 1.3 3 6.9 2.5 1.9 3.3 4 1.4 2.9 ...
##  $ Win.Share             : num  29.8 10.3 19.1 42.3 47 23.9 17.4 23.7 31.6 13 ...
##  $ X                     : num  0.135 0.048 0.065 0.136 0.189 0.129 0.078 0.106 0.134 0.082 ...

Your Turn

View the top 8 rows of nba data
What type of object is the nba data set?
How many rows are in nba data set? (try finding this using dim or indexing + length)

Answers

1.

head(nba,n = 8)

##   Year Pick Team            Player Position
## 1 2008    1  CHI      Derrick Rose       PG
## 2 2008    2  MIA   Michael Beasley       SF
## 3 2008    3  MIN         O.J. Mayo       SG
## 4 2008    4  SEA Russell Westbrook       PG
## 5 2008    5  MEM        Kevin Love       PF
## 6 2008    6  NYK  Danilo Gallinari       SF
## 7 2008    7  LAC       Eric Gordon       SG
## 8 2008    9  CHA     D.J. Augustin       PG
##                                 College Games Minutes Total.Points
## 1                 University of Memphis   289   10583         6017
## 2               Kansas State University   409   10170         5416
## 3     University of Southern California   435   14132         6447
## 4 University of California, Los Angeles   440   14932         8834
## 5 University of California, Los Angeles   364   11933         6989
## 6                            NoAttempts   285    8923         4138
## 7                    Indiana University   311   10649         5430
## 8         University of Texas at Austin   429   10710         4354
##   Total.Rebounds Total.Assists Field.Goal.Percentage
## 1           1103          1954                 0.460
## 2           2007           539                 0.450
## 3           1414          1292                 0.433
## 4           2171          3045                 0.433
## 5           4453           898                 0.451
## 6           1327           546                 0.419
## 7            792          1021                 0.442
## 8            785          1731                 0.404
##   Three.Point.Percentage Free.Throw.Percentage Points.Per.Game
## 1                  0.312                 0.815            20.8
## 2                  0.348                 0.758            13.2
## 3                   0.38                 0.821            14.8
## 4                  0.305                 0.815            20.1
## 5                  0.362                 0.815            19.2
## 6                  0.369                 0.844            14.5
## 7                  0.368                 0.809            17.5
## 8                  0.377                 0.874            10.1
##   Rebounds.Per.Game Assists.Per.Game Win.Share     X
## 1               3.8              6.8      29.8 0.135
## 2               4.9              1.3      10.3 0.048
## 3               3.3              3.0      19.1 0.065
## 4               4.9              6.9      42.3 0.136
## 5              12.2              2.5      47.0 0.189
## 6               4.7              1.9      23.9 0.129
## 7               2.5              3.3      17.4 0.078
## 8               1.8              4.0      23.7 0.106

2.

str(nba)

## 'data.frame':    169 obs. of  19 variables:
##  $ Year                  : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ Pick                  : int  1 2 3 4 5 6 7 9 10 11 ...
##  $ Team                  : Factor w/ 32 levels "ATL","BOS","BRK",..: 5 16 18 29 15 21 13 4 19 12 ...
##  $ Player                : Factor w/ 169 levels "Al-Farouq Aminu",..: 44 115 124 139 99 36 55 31 23 82 ...
##  $ Position              : Factor w/ 5 levels "C","PF","PG",..: 3 4 5 3 2 4 5 3 1 3 ...
##  $ College               : Factor w/ 69 levels "Arizona State University",..: 48 18 59 38 38 24 15 61 34 36 ...
##  $ Games                 : int  289 409 435 440 364 285 311 429 342 381 ...
##  $ Minutes               : int  10583 10170 14132 14932 11933 8923 10649 10710 11339 7611 ...
##  $ Total.Points          : int  6017 5416 6447 8834 6989 4138 5430 4354 6168 3221 ...
##  $ Total.Rebounds        : int  1103 2007 1414 2171 4453 1327 792 785 2494 729 ...
##  $ Total.Assists         : int  1954 539 1292 3045 898 546 1021 1731 494 1092 ...
##  $ Field.Goal.Percentage : num  0.46 0.45 0.433 0.433 0.451 0.419 0.442 0.404 0.511 0.411 ...
##  $ Three.Point.Percentage: Factor w/ 96 levels "0","0.038","0.053",..: 41 62 86 37 72 78 77 84 1 67 ...
##  $ Free.Throw.Percentage : Factor w/ 129 levels "0.25","0.402",..: 105 74 107 105 105 114 101 123 91 106 ...
##  $ Points.Per.Game       : num  20.8 13.2 14.8 20.1 19.2 14.5 17.5 10.1 18 8.5 ...
##  $ Rebounds.Per.Game     : num  3.8 4.9 3.3 4.9 12.2 4.7 2.5 1.8 7.3 1.9 ...
##  $ Assists.Per.Game      : num  6.8 1.3 3 6.9 2.5 1.9 3.3 4 1.4 2.9 ...
##  $ Win.Share             : num  29.8 10.3 19.1 42.3 47 23.9 17.4 23.7 31.6 13 ...
##  $ X                     : num  0.135 0.048 0.065 0.136 0.189 0.129 0.078 0.106 0.134 0.082 ...

# data frame

3.

dim(nba)

## [1] 169  19

dim(nba)[1] #Picks first output element

## [1] 169

Working with Output from a Function

Can save output from a function as an object
An object is generally a list of output objects
Can pull off items from the output for further computing
Examine objects using functions like str()

Saving Output Demo

Apply t-test using the NBA data set to see if the Points Per Game for players drafted in 2008 and 2010 are statistically different
t.test() can only handle two groups, so we subset out the every other year.

Demo (Continued)

Save the output of the t-test to an object:

tout <- t.test(Points.Per.Game ~ Year, data = nba[nba$Year %in% c("2008","2010"), ])
tout

## 
##  Welch Two Sample t-test
## 
## data:  Points.Per.Game by Year
## t = 2.7584, df = 53.045, p-value = 0.007951
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9010287 5.7028027
## sample estimates:
## mean in group 2008 mean in group 2010 
##          11.277778           7.975862

An interpretation of this is that there is a statistical difference in the average points scored between the 2008 and 2010 NBA draft classes. This is a possible way to determine the strength of a particular draft.

Let's look at the structure of this object:

str(tout)

## List of 9
##  $ statistic  : Named num 2.76
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 53
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.00795
##  $ conf.int   : atomic [1:2] 0.901 5.703
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num [1:2] 11.28 7.98
##   ..- attr(*, "names")= chr [1:2] "mean in group 2008" "mean in group 2010"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "difference in means"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Welch Two Sample t-test"
##  $ data.name  : chr "Points.Per.Game by Year"
##  - attr(*, "class")= chr "htest"

Demo: Extracting the P-Value

Since this is simply a list, we can use our regular indexing:

tout$p.value

## [1] 0.007951372

tout[[3]]

## [1] 0.007951372

Your Turn

Pull the p-value from t.test comparing the difference between Win Shares from the 2009 and 2011 NBA draft class.
What does this p-value imply?

Answer

1.

tout <- t.test(Win.Share ~ Year, data = nba[nba$Year %in% c("2008","2010"), ])
tout

## 
##  Welch Two Sample t-test
## 
## data:  Win.Share by Year
## t = 4.9787, df = 44.16, p-value = 1.027e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.83587 18.49236
## sample estimates:
## mean in group 2008 mean in group 2010 
##          21.529630           8.365517

2.

Since p = 1.027e-05 < .05, we are 95% confident there is a difference in the means of the two groups win shares. From this we can claim that the 2008 NBA draft class is superior to that of 2010.