- Reading files: Excel and .csv
We often need to import in our own data rather than just using built-in datasets.
setwd()
read.table()
for reading in .txt filesread.csv()
for reading in .csv filesWe first create a .csv file. We can use a text editor, excel… Then we load it in:
setwd(/pathtodirectory)
nfl <- read.csv("NFL Games.csv")
Formats xls and csv - what's the difference? File extensions .xls or .xlsx are proprietary Excel formats, they are binary files .csv is the extension for comma separated value files. They are text files - and directly readable.
Text files, usually comma separated or tabular separated (space) can both be read in with read.csv()
read.csv(file, header = TRUE, sep = ",", quote = "\\"", dec = ".", fill = TRUE, comment.char = "", ...)
For a comma separated file, we set sep = ","
else for a tabular separated file use sep = ""
nba <- read.csv("NBA Draft Class.csv") head(nba)
## Year Pick Team Player Position ## 1 2008 1 CHI Derrick Rose PG ## 2 2008 2 MIA Michael Beasley SF ## 3 2008 3 MIN O.J. Mayo SG ## 4 2008 4 SEA Russell Westbrook PG ## 5 2008 5 MEM Kevin Love PF ## 6 2008 6 NYK Danilo Gallinari SF ## College Games Minutes Total.Points ## 1 University of Memphis 289 10583 6017 ## 2 Kansas State University 409 10170 5416 ## 3 University of Southern California 435 14132 6447 ## 4 University of California, Los Angeles 440 14932 8834 ## 5 University of California, Los Angeles 364 11933 6989 ## 6 NoAttempts 285 8923 4138 ## Total.Rebounds Total.Assists Field.Goal.Percentage ## 1 1103 1954 0.460 ## 2 2007 539 0.450 ## 3 1414 1292 0.433 ## 4 2171 3045 0.433 ## 5 4453 898 0.451 ## 6 1327 546 0.419 ## Three.Point.Percentage Free.Throw.Percentage Points.Per.Game ## 1 0.312 0.815 20.8 ## 2 0.348 0.758 13.2 ## 3 0.38 0.821 14.8 ## 4 0.305 0.815 20.1 ## 5 0.362 0.815 19.2 ## 6 0.369 0.844 14.5 ## Rebounds.Per.Game Assists.Per.Game Win.Share X ## 1 3.8 6.8 29.8 0.135 ## 2 4.9 1.3 10.3 0.048 ## 3 3.3 3.0 19.1 0.065 ## 4 4.9 6.9 42.3 0.136 ## 5 12.2 2.5 47.0 0.189 ## 6 4.7 1.9 23.9 0.129
str(nba)
## 'data.frame': 169 obs. of 19 variables: ## $ Year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ... ## $ Pick : int 1 2 3 4 5 6 7 9 10 11 ... ## $ Team : Factor w/ 32 levels "ATL","BOS","BRK",..: 5 16 18 29 15 21 13 4 19 12 ... ## $ Player : Factor w/ 169 levels "Al-Farouq Aminu",..: 44 115 124 139 99 36 55 31 23 82 ... ## $ Position : Factor w/ 5 levels "C","PF","PG",..: 3 4 5 3 2 4 5 3 1 3 ... ## $ College : Factor w/ 69 levels "Arizona State University",..: 48 18 59 38 38 24 15 61 34 36 ... ## $ Games : int 289 409 435 440 364 285 311 429 342 381 ... ## $ Minutes : int 10583 10170 14132 14932 11933 8923 10649 10710 11339 7611 ... ## $ Total.Points : int 6017 5416 6447 8834 6989 4138 5430 4354 6168 3221 ... ## $ Total.Rebounds : int 1103 2007 1414 2171 4453 1327 792 785 2494 729 ... ## $ Total.Assists : int 1954 539 1292 3045 898 546 1021 1731 494 1092 ... ## $ Field.Goal.Percentage : num 0.46 0.45 0.433 0.433 0.451 0.419 0.442 0.404 0.511 0.411 ... ## $ Three.Point.Percentage: Factor w/ 96 levels "0","0.038","0.053",..: 41 62 86 37 72 78 77 84 1 67 ... ## $ Free.Throw.Percentage : Factor w/ 129 levels "0.25","0.402",..: 105 74 107 105 105 114 101 123 91 106 ... ## $ Points.Per.Game : num 20.8 13.2 14.8 20.1 19.2 14.5 17.5 10.1 18 8.5 ... ## $ Rebounds.Per.Game : num 3.8 4.9 3.3 4.9 12.2 4.7 2.5 1.8 7.3 1.9 ... ## $ Assists.Per.Game : num 6.8 1.3 3 6.9 2.5 1.9 3.3 4 1.4 2.9 ... ## $ Win.Share : num 29.8 10.3 19.1 42.3 47 23.9 17.4 23.7 31.6 13 ... ## $ X : num 0.135 0.048 0.065 0.136 0.189 0.129 0.078 0.106 0.134 0.082 ...
Need another package: readxl
install.packages(readxl)
library(readxl) nba2 <- read_excel("NBA_Playoffs.xlsx",sheet = 1) head(nba2)
## # A tibble: 6 x 62 ## Team Opp Date Number Round Game Location `W/L` Importance ## <chr> <chr> <dttm> <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 Celtics ATL 2008-04-20 1 1 1 Home W 0 ## 2 Celtics ATL 2008-04-23 2 1 2 Home W 0 ## 3 Celtics ATL 2008-04-26 3 1 3 Away L 0 ## 4 Celtics ATL 2008-04-28 4 1 4 Away L 0 ## 5 Celtics ATL 2008-04-30 5 1 5 Home W 0 ## 6 Celtics ATL 2008-05-02 6 1 6 Away L 0 ## # ... with 53 more variables: Tm <dbl>, Oppo <dbl>, Diff <dbl>, `Team ## # FG` <dbl>, `Team FGA` <dbl>, `Team FG%` <dbl>, `Team 3P` <dbl>, `Team ## # 3PA` <dbl>, `Team 3P%` <dbl>, `Team FT` <dbl>, `Team FTA` <dbl>, `Team ## # FT%` <dbl>, `Team ORB` <dbl>, `Team TRB` <dbl>, `Team AST` <dbl>, ## # `Team STL` <dbl>, `Team BLK` <dbl>, `Team TOV` <dbl>, `Team PF` <dbl>, ## # `Opponent FG` <dbl>, `Opponent FGA` <dbl>, `Opponent FG%` <dbl>, ## # `Opponent 3P` <dbl>, `Opponent 3PA` <dbl>, `Opponent 3P%` <dbl>, ## # `Opponent FT` <dbl>, `Opponent FTA` <dbl>, `Opponent FT%` <dbl>, ## # `Opponent ORB` <dbl>, `Opponent TRB` <dbl>, `Opponent AST` <dbl>, ## # `Opponent STL` <dbl>, `Opponent BLK` <dbl>, `Opponent TOV` <dbl>, ## # `Opponent PF` <dbl>, `Advanced ORtg` <dbl>, `Advanced DRtg` <dbl>, ## # `Advanced Pace` <dbl>, `Advanced FTr` <dbl>, `Advanced 3PAr` <dbl>, ## # `Advanced TS%` <dbl>, `Advanced TRB%` <dbl>, `Advanced AST%` <dbl>, ## # `Advanced STL%` <dbl>, `Advanced BLK%` <dbl>, `Offensive Four Factors ## # eFG%` <dbl>, `Offensive Four Factors TOV%` <dbl>, `Offensive Four ## # Factors ORB%` <dbl>, `Offensive Four Factors FT/FGA` <dbl>, `Defensive ## # Four Factors eFG%` <dbl>, `Defensive Four Factors TOV%` <dbl>, ## # `Defensive Four Factors DRB%` <dbl>, `Defensive Four Factors ## # FT/FGA` <dbl>
Using the NFL Games.csv file, do the following:
Read the data into an object called nfl.
Use the head()
function to display some rows.
Use the str()
function to see each variable type.
nfl <- read.csv("NFL Games.csv")
head(nfl)
## GID SEAS WEEK DAY V H STAD TEMP HUMD WSPD WDIR ## 1 1 2000 1 SUN SF ATL Georgia Dome 79 \\N \\N \\N ## 2 2 2000 1 SUN JAC CLE Cleveland Browns Stadium 78 63 9 NE ## 3 3 2000 1 SUN PHI DAL Texas Stadium 109 19 5 S ## 4 4 2000 1 SUN NYJ GB Lambeau Field 77 66 5 E ## 5 5 2000 1 SUN IND KC Arrowhead Stadium 90 50 8 E ## 6 6 2000 1 SUN SEA MIA Pro Player Stadium 89 59 13 E ## COND SURF OU SPRV PTSFor PTSAgst ## 1 Dome AstroTurf 42.5 7.0 28 36 ## 2 Sunny Grass 38.0 -10.0 27 7 ## 3 Sunny AstroTurf 40.0 6.0 41 14 ## 4 Mostly Cloudy Grass 36.0 2.5 20 16 ## 5 Mostly Sunny Grass 44.0 -3.0 27 14 ## 6 Sunny Grass 36.0 3.0 0 23
str(nfl)
## 'data.frame': 3188 obs. of 17 variables: ## $ GID : int 1 2 3 4 5 6 7 8 9 10 ... ## $ SEAS : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ... ## $ WEEK : int 1 1 1 1 1 1 1 1 1 1 ... ## $ DAY : Factor w/ 8 levels "FRI","MN","SAT",..: 6 6 6 6 6 6 6 6 6 6 ... ## $ V : Factor w/ 32 levels "ARI","ATL","BAL",..: 28 15 24 22 14 27 6 30 11 1 ... ## $ H : Factor w/ 32 levels "ARI","ATL","BAL",..: 2 8 9 12 16 17 18 19 20 21 ... ## $ STAD : Factor w/ 132 levels "3COM","3Com Park",..: 38 22 116 46 11 92 69 37 52 39 ... ## $ TEMP : Factor w/ 98 levels "-1","10","100",..: 78 77 7 76 90 88 64 70 88 79 ... ## $ HUMD : Factor w/ 97 levels "\\N","0","10",..: 1 59 13 62 46 55 1 91 1 76 ... ## $ WSPD : Factor w/ 37 levels "\\N","0","1",..: 1 37 33 33 36 7 1 33 1 25 ... ## $ WDIR : Factor w/ 21 levels "\\N","Calm","CALM",..: 1 8 12 4 4 4 1 17 1 17 ... ## $ COND : Factor w/ 22 levels "\\N","Chance Rain",..: 8 22 22 15 16 22 8 5 8 17 ... ## $ SURF : Factor w/ 18 levels "","A Turf Titan",..: 5 11 5 11 11 11 5 11 5 11 ... ## $ OU : num 42.5 38 40 36 44 36 47 35.5 39.5 40 ... ## $ SPRV : num 7 -10 6 2.5 -3 3 4.5 -3 1 7 ... ## $ PTSFor : int 28 27 41 20 27 0 27 21 14 16 ... ## $ PTSAgst: int 36 7 14 16 14 23 30 16 10 21 ...