Outline

  • Reading files: Excel and .csv

Importing Data

We often need to import in our own data rather than just using built-in datasets.

  • First need to tell R where the data is saved using setwd()
  • Data read in using R functions such as:
    • read.table() for reading in .txt files
    • read.csv() for reading in .csv files
  • Assign the data to new R object when reading in the file

Importing Data Demo

We first create a .csv file. We can use a text editor, excel… Then we load it in:

setwd(/pathtodirectory)
nfl <- read.csv("NFL Games.csv")

Data in Excel

Formats xls and csv - what's the difference? File extensions .xls or .xlsx are proprietary Excel formats, they are binary files .csv is the extension for comma separated value files. They are text files - and directly readable.

Excel file in Excel

Excel file in notepad

.csv file in notepad

Reading Files in R

Text files, usually comma separated or tabular separated (space) can both be read in with read.csv()

read.csv(file, header = TRUE, sep = ",", quote = "\\"",
 dec = ".", fill = TRUE, comment.char = "", ...)

For a comma separated file, we set sep = "," else for a tabular separated file use sep = ""

Reading .csv File

nba <- read.csv("NBA Draft Class.csv")
head(nba)
##   Year Pick Team            Player Position
## 1 2008    1  CHI      Derrick Rose       PG
## 2 2008    2  MIA   Michael Beasley       SF
## 3 2008    3  MIN         O.J. Mayo       SG
## 4 2008    4  SEA Russell Westbrook       PG
## 5 2008    5  MEM        Kevin Love       PF
## 6 2008    6  NYK  Danilo Gallinari       SF
##                                 College Games Minutes Total.Points
## 1                 University of Memphis   289   10583         6017
## 2               Kansas State University   409   10170         5416
## 3     University of Southern California   435   14132         6447
## 4 University of California, Los Angeles   440   14932         8834
## 5 University of California, Los Angeles   364   11933         6989
## 6                            NoAttempts   285    8923         4138
##   Total.Rebounds Total.Assists Field.Goal.Percentage
## 1           1103          1954                 0.460
## 2           2007           539                 0.450
## 3           1414          1292                 0.433
## 4           2171          3045                 0.433
## 5           4453           898                 0.451
## 6           1327           546                 0.419
##   Three.Point.Percentage Free.Throw.Percentage Points.Per.Game
## 1                  0.312                 0.815            20.8
## 2                  0.348                 0.758            13.2
## 3                   0.38                 0.821            14.8
## 4                  0.305                 0.815            20.1
## 5                  0.362                 0.815            19.2
## 6                  0.369                 0.844            14.5
##   Rebounds.Per.Game Assists.Per.Game Win.Share     X
## 1               3.8              6.8      29.8 0.135
## 2               4.9              1.3      10.3 0.048
## 3               3.3              3.0      19.1 0.065
## 4               4.9              6.9      42.3 0.136
## 5              12.2              2.5      47.0 0.189
## 6               4.7              1.9      23.9 0.129
str(nba)
## 'data.frame':    169 obs. of  19 variables:
##  $ Year                  : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ Pick                  : int  1 2 3 4 5 6 7 9 10 11 ...
##  $ Team                  : Factor w/ 32 levels "ATL","BOS","BRK",..: 5 16 18 29 15 21 13 4 19 12 ...
##  $ Player                : Factor w/ 169 levels "Al-Farouq Aminu",..: 44 115 124 139 99 36 55 31 23 82 ...
##  $ Position              : Factor w/ 5 levels "C","PF","PG",..: 3 4 5 3 2 4 5 3 1 3 ...
##  $ College               : Factor w/ 69 levels "Arizona State University",..: 48 18 59 38 38 24 15 61 34 36 ...
##  $ Games                 : int  289 409 435 440 364 285 311 429 342 381 ...
##  $ Minutes               : int  10583 10170 14132 14932 11933 8923 10649 10710 11339 7611 ...
##  $ Total.Points          : int  6017 5416 6447 8834 6989 4138 5430 4354 6168 3221 ...
##  $ Total.Rebounds        : int  1103 2007 1414 2171 4453 1327 792 785 2494 729 ...
##  $ Total.Assists         : int  1954 539 1292 3045 898 546 1021 1731 494 1092 ...
##  $ Field.Goal.Percentage : num  0.46 0.45 0.433 0.433 0.451 0.419 0.442 0.404 0.511 0.411 ...
##  $ Three.Point.Percentage: Factor w/ 96 levels "0","0.038","0.053",..: 41 62 86 37 72 78 77 84 1 67 ...
##  $ Free.Throw.Percentage : Factor w/ 129 levels "0.25","0.402",..: 105 74 107 105 105 114 101 123 91 106 ...
##  $ Points.Per.Game       : num  20.8 13.2 14.8 20.1 19.2 14.5 17.5 10.1 18 8.5 ...
##  $ Rebounds.Per.Game     : num  3.8 4.9 3.3 4.9 12.2 4.7 2.5 1.8 7.3 1.9 ...
##  $ Assists.Per.Game      : num  6.8 1.3 3 6.9 2.5 1.9 3.3 4 1.4 2.9 ...
##  $ Win.Share             : num  29.8 10.3 19.1 42.3 47 23.9 17.4 23.7 31.6 13 ...
##  $ X                     : num  0.135 0.048 0.065 0.136 0.189 0.129 0.078 0.106 0.134 0.082 ...

Reading Excel File

Need another package: readxl

install.packages(readxl)
library(readxl)

nba2 <- read_excel("NBA_Playoffs.xlsx",sheet = 1)
head(nba2)
## # A tibble: 6 x 62
##      Team   Opp       Date Number Round  Game Location `W/L` Importance
##     <chr> <chr>     <dttm>  <dbl> <dbl> <dbl>    <chr> <chr>      <dbl>
## 1 Celtics   ATL 2008-04-20      1     1     1     Home     W          0
## 2 Celtics   ATL 2008-04-23      2     1     2     Home     W          0
## 3 Celtics   ATL 2008-04-26      3     1     3     Away     L          0
## 4 Celtics   ATL 2008-04-28      4     1     4     Away     L          0
## 5 Celtics   ATL 2008-04-30      5     1     5     Home     W          0
## 6 Celtics   ATL 2008-05-02      6     1     6     Away     L          0
## # ... with 53 more variables: Tm <dbl>, Oppo <dbl>, Diff <dbl>, `Team
## #   FG` <dbl>, `Team FGA` <dbl>, `Team FG%` <dbl>, `Team 3P` <dbl>, `Team
## #   3PA` <dbl>, `Team 3P%` <dbl>, `Team FT` <dbl>, `Team FTA` <dbl>, `Team
## #   FT%` <dbl>, `Team ORB` <dbl>, `Team TRB` <dbl>, `Team AST` <dbl>,
## #   `Team STL` <dbl>, `Team BLK` <dbl>, `Team TOV` <dbl>, `Team PF` <dbl>,
## #   `Opponent FG` <dbl>, `Opponent FGA` <dbl>, `Opponent FG%` <dbl>,
## #   `Opponent 3P` <dbl>, `Opponent 3PA` <dbl>, `Opponent 3P%` <dbl>,
## #   `Opponent FT` <dbl>, `Opponent FTA` <dbl>, `Opponent FT%` <dbl>,
## #   `Opponent ORB` <dbl>, `Opponent TRB` <dbl>, `Opponent AST` <dbl>,
## #   `Opponent STL` <dbl>, `Opponent BLK` <dbl>, `Opponent TOV` <dbl>,
## #   `Opponent PF` <dbl>, `Advanced ORtg` <dbl>, `Advanced DRtg` <dbl>,
## #   `Advanced Pace` <dbl>, `Advanced FTr` <dbl>, `Advanced 3PAr` <dbl>,
## #   `Advanced TS%` <dbl>, `Advanced TRB%` <dbl>, `Advanced AST%` <dbl>,
## #   `Advanced STL%` <dbl>, `Advanced BLK%` <dbl>, `Offensive Four Factors
## #   eFG%` <dbl>, `Offensive Four Factors TOV%` <dbl>, `Offensive Four
## #   Factors ORB%` <dbl>, `Offensive Four Factors FT/FGA` <dbl>, `Defensive
## #   Four Factors eFG%` <dbl>, `Defensive Four Factors TOV%` <dbl>,
## #   `Defensive Four Factors DRB%` <dbl>, `Defensive Four Factors
## #   FT/FGA` <dbl>

Your Turn

Using the NFL Games.csv file, do the following:

  1. Read the data into an object called nfl.

  2. Use the head() function to display some rows.

  3. Use the str() function to see each variable type.

Answers

1.

nfl <- read.csv("NFL Games.csv")

2.

head(nfl)
##   GID SEAS WEEK DAY   V   H                     STAD TEMP HUMD WSPD WDIR
## 1   1 2000    1 SUN  SF ATL             Georgia Dome   79  \\N  \\N  \\N
## 2   2 2000    1 SUN JAC CLE Cleveland Browns Stadium   78   63    9   NE
## 3   3 2000    1 SUN PHI DAL            Texas Stadium  109   19    5    S
## 4   4 2000    1 SUN NYJ  GB            Lambeau Field   77   66    5    E
## 5   5 2000    1 SUN IND  KC        Arrowhead Stadium   90   50    8    E
## 6   6 2000    1 SUN SEA MIA       Pro Player Stadium   89   59   13    E
##            COND      SURF   OU  SPRV PTSFor PTSAgst
## 1          Dome AstroTurf 42.5   7.0     28      36
## 2         Sunny     Grass 38.0 -10.0     27       7
## 3         Sunny AstroTurf 40.0   6.0     41      14
## 4 Mostly Cloudy     Grass 36.0   2.5     20      16
## 5  Mostly Sunny     Grass 44.0  -3.0     27      14
## 6         Sunny     Grass 36.0   3.0      0      23

3.

str(nfl)
## 'data.frame':    3188 obs. of  17 variables:
##  $ GID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SEAS   : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ WEEK   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DAY    : Factor w/ 8 levels "FRI","MN","SAT",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ V      : Factor w/ 32 levels "ARI","ATL","BAL",..: 28 15 24 22 14 27 6 30 11 1 ...
##  $ H      : Factor w/ 32 levels "ARI","ATL","BAL",..: 2 8 9 12 16 17 18 19 20 21 ...
##  $ STAD   : Factor w/ 132 levels "3COM","3Com Park",..: 38 22 116 46 11 92 69 37 52 39 ...
##  $ TEMP   : Factor w/ 98 levels "-1","10","100",..: 78 77 7 76 90 88 64 70 88 79 ...
##  $ HUMD   : Factor w/ 97 levels "\\N","0","10",..: 1 59 13 62 46 55 1 91 1 76 ...
##  $ WSPD   : Factor w/ 37 levels "\\N","0","1",..: 1 37 33 33 36 7 1 33 1 25 ...
##  $ WDIR   : Factor w/ 21 levels "\\N","Calm","CALM",..: 1 8 12 4 4 4 1 17 1 17 ...
##  $ COND   : Factor w/ 22 levels "\\N","Chance Rain",..: 8 22 22 15 16 22 8 5 8 17 ...
##  $ SURF   : Factor w/ 18 levels "","A Turf Titan",..: 5 11 5 11 11 11 5 11 5 11 ...
##  $ OU     : num  42.5 38 40 36 44 36 47 35.5 39.5 40 ...
##  $ SPRV   : num  7 -10 6 2.5 -3 3 4.5 -3 1 7 ...
##  $ PTSFor : int  28 27 41 20 27 0 27 21 14 16 ...
##  $ PTSAgst: int  36 7 14 16 14 23 30 16 10 21 ...