- What is web scraping?
- Why do we need it?
- Scraping the web
In short, web scraping is a way of converting the data in HTML over to a format that can be easily accessed and used. Namely, we are scraping the data off a website automatically into a table.
Not all websites provide .xlsx or .csv files to download tables, and not everyone has time to copy and paste data into excel. If we can automate the process, a lot of time can be saved.
# Load relevant libraries library(rvest) # URL of website to gather data url <- 'http://www.espn.com/mlb/stats/batting' # Setting url as html webpage <- read_html(url) # Commands to gather data # 'table' is very specific to our task data_table <- html_nodes(webpage, 'table') # print out data as table data <- html_table(data_table)
data
## [[1]] ## X1 X2 X3 X4 ## 1 Sortable Batting Sortable Batting Sortable Batting Sortable Batting ## 2 RK PLAYER TEAM AB ## 3 1 Jose Altuve HOU 477 ## 4 2 Justin Turner LAD 351 ## 5 3 Charlie Blackmon COL 504 ## 6 4 Bryce Harper WSH 402 ## 7 5 Daniel Murphy WSH 428 ## 8 6 Avisail Garcia CHW 387 ## 9 7 Buster Posey SF 397 ## 10 8 Eric Hosmer KC 473 ## 11 9 Paul Goldschmidt ARI 448 ## 12 10 Eduardo Nunez BOS/SF 399 ## 13 RK PLAYER TEAM AB ## 14 11 Joey Votto CIN 438 ## 15 12 Nolan Arenado COL 478 ## 16 13 DJ LeMahieu COL 464 ## 17 14 Didi Gregorius NYY 397 ## 18 15 Chris Taylor LAD 382 ## 19 16 Jean Segura SEA 402 ## 20 17 Corey Seager LAD 439 ## 21 18 Marcell Ozuna MIA 472 ## 22 19 Tommy Pham STL 338 ## 23 20 Marwin Gonzalez HOU 348 ## 24 RK PLAYER TEAM AB ## 25 21 Ryan Zimmerman WSH 415 ## 26 22 Dustin Pedroia BOS 340 ## 27 23 George Springer HOU 416 ## 28 24 Jonathan Schoop BAL 479 ## 29 25 Anthony Rendon WSH 396 ## 30 26 Jose Ramirez CLE 467 ## 31 27 Eddie Rosario MIN 404 ## 32 28 Jose Abreu CHW 495 ## 33 29 David Peralta ARI 409 ## 34 30 Ender Inciarte ATL 523 ## 35 RK PLAYER TEAM AB ## 36 31 Josh Reddick HOU 386 ## 37 32 Tim Beckham BAL/TB 411 ## 38 33 Elvis Andrus TEX 496 ## 39 34 Yuli Gurriel HOU 430 ## 40 35 Melky Cabrera CHW/KC 480 ## 41 36 Dee Gordon MIA 496 ## 42 37 Ben Gamel SEA 390 ## 43 38 Lorenzo Cain KC 459 ## 44 39 Nelson Cruz SEA 432 ## 45 40 Justin Smoak TOR 440 ## X5 X6 X7 X8 ## 1 Sortable Batting Sortable Batting Sortable Batting Sortable Batting ## 2 R H 2B 3B ## 3 84 171 35 3 ## 4 58 118 22 0 ## 5 115 168 27 14 ## 6 92 131 27 1 ## 7 76 138 36 2 ## 8 52 123 21 3 ## 9 52 126 25 0 ## 10 77 149 23 1 ## 11 93 141 30 3 ## 12 50 125 28 0 ## 13 R H 2B 3B ## 14 87 137 24 1 ## 15 80 149 38 7 ## 16 75 144 23 1 ## 17 54 123 21 0 ## 18 71 118 30 4 ## 19 61 124 23 1 ## 20 74 135 30 0 ## 21 71 145 25 1 ## 22 68 103 15 1 ## 23 53 106 22 0 ## 24 R H 2B 3B ## 25 73 126 27 0 ## 26 37 103 17 0 ## 27 87 126 24 0 ## 28 75 144 30 0 ## 29 65 119 29 1 ## 30 80 140 39 5 ## 31 59 121 27 2 ## 32 73 148 34 4 ## 33 68 122 24 2 ## 34 78 156 20 2 ## 35 R H 2B 3B ## 36 66 115 25 3 ## 37 53 122 14 5 ## 38 79 147 33 3 ## 39 56 127 33 1 ## 40 63 141 21 2 ## 41 86 145 17 5 ## 42 58 114 21 4 ## 43 73 134 22 3 ## 44 67 126 24 0 ## 45 71 128 21 1 ## X9 X10 X11 X12 ## 1 Sortable Batting Sortable Batting Sortable Batting Sortable Batting ## 2 HR RBI SB CS ## 3 19 67 29 6 ## 4 17 57 5 1 ## 5 29 78 12 8 ## 6 29 87 2 2 ## 7 20 81 1 0 ## 8 13 60 5 2 ## 9 12 54 5 1 ## 10 20 69 6 0 ## 11 29 98 16 4 ## 12 9 49 21 6 ## 13 HR RBI SB CS ## 14 32 87 4 1 ## 15 28 107 3 2 ## 16 4 52 6 5 ## 17 18 58 2 1 ## 18 17 61 14 3 ## 19 7 35 18 7 ## 20 19 64 3 1 ## 21 29 97 0 2 ## 22 16 52 16 5 ## 23 21 72 5 2 ## 24 HR RBI SB CS ## 25 29 86 1 0 ## 26 6 54 4 3 ## 27 28 70 5 7 ## 28 27 93 1 0 ## 29 22 77 6 2 ## 30 18 59 12 4 ## 31 18 53 5 6 ## 32 25 77 1 0 ## 33 13 43 7 1 ## 34 10 44 17 6 ## 35 HR RBI SB CS ## 36 12 61 7 2 ## 37 17 48 6 4 ## 38 16 66 23 7 ## 39 15 61 3 2 ## 40 16 71 1 1 ## 41 1 25 43 10 ## 42 6 39 4 1 ## 43 13 40 23 2 ## 44 31 100 1 0 ## 45 33 80 0 1 ## X13 X14 X15 X16 ## 1 Sortable Batting Sortable Batting Sortable Batting Sortable Batting ## 2 BB SO AVG OBP ## 3 46 65 .358 .418 ## 4 46 38 .336 .425 ## 5 47 104 .333 .396 ## 6 66 92 .326 .419 ## 7 37 52 .322 .378 ## 8 18 85 .318 .360 ## 9 54 51 .317 .406 ## 10 50 80 .315 .379 ## 11 81 115 .315 .426 ## 12 15 43 .313 .340 ## 13 BB SO AVG OBP ## 14 103 65 .313 .447 ## 15 41 85 .312 .367 ## 16 46 64 .310 .375 ## 17 17 56 .310 .338 ## 18 40 110 .309 .379 ## 19 28 67 .308 .362 ## 20 60 108 .308 .392 ## 21 49 112 .307 .371 ## 22 48 94 .305 .397 ## 23 37 82 .305 .379 ## 24 BB SO AVG OBP ## 25 35 93 .304 .355 ## 26 41 37 .303 .378 ## 27 45 91 .303 .378 ## 28 29 112 .301 .349 ## 29 66 67 .301 .404 ## 30 39 61 .300 .355 ## 31 26 82 .300 .339 ## 32 27 95 .299 .348 ## 33 30 68 .298 .353 ## 34 37 79 .298 .342 ## 35 BB SO AVG OBP ## 36 33 61 .298 .345 ## 37 27 126 .297 .342 ## 38 30 86 .296 .339 ## 39 13 50 .295 .323 ## 40 30 65 .294 .335 ## 41 21 70 .292 .332 ## 42 33 97 .292 .347 ## 43 43 86 .292 .356 ## 44 52 105 .292 .373 ## 45 51 103 .291 .364 ## X17 X18 X19 ## 1 Sortable Batting Sortable Batting Sortable Batting ## 2 SLG OPS WAR ## 3 .564 .982 6.8 ## 4 .544 .970 4.4 ## 5 .615 1.011 4.3 ## 6 .614 1.034 4.6 ## 7 .556 .934 2.0 ## 8 .488 .848 3.3 ## 9 .471 .877 4.0 ## 10 .495 .874 3.3 ## 11 .589 1.015 5.5 ## 12 .451 .791 0.7 ## 13 SLG OPS WAR ## 14 .591 1.038 5.9 ## 15 .596 .964 5.8 ## 16 .390 .765 1.6 ## 17 .499 .837 3.1 ## 18 .542 .921 4.3 ## 19 .423 .784 2.3 ## 20 .506 .897 4.8 ## 21 .549 .920 4.5 ## 22 .497 .895 4.2 ## 23 .549 .928 3.4 ## 24 SLG OPS WAR ## 25 .578 .934 2.0 ## 26 .406 .784 1.7 ## 27 .563 .940 4.3 ## 28 .532 .881 4.3 ## 29 .545 .949 5.3 ## 30 .520 .875 3.6 ## 31 .510 .849 1.2 ## 32 .535 .883 2.8 ## 33 .462 .815 2.3 ## 34 .402 .744 2.4 ## 35 SLG OPS WAR ## 36 .472 .816 2.8 ## 37 .479 .822 3.0 ## 38 .472 .811 3.7 ## 39 .481 .804 2.0 ## 40 .446 .781 0.4 ## 41 .353 .685 1.8 ## 42 .413 .759 1.1 ## 43 .438 .794 4.2 ## 44 .563 .935 3.1 ## 45 .568 .933 3.1
If we clean this up a bit, we can make this look really good.
# Create data frame data = data.frame(data) # Remove repeated rows df <- unique(data) # Rename columns colnames(df) = df[2,] # Delete first two rows df<- df[-c(1:2),] # Show first 5 rows head(df)
## RK PLAYER TEAM AB R H 2B 3B HR RBI SB CS BB SO AVG OBP ## 3 1 Jose Altuve HOU 477 84 171 35 3 19 67 29 6 46 65 .358 .418 ## 4 2 Justin Turner LAD 351 58 118 22 0 17 57 5 1 46 38 .336 .425 ## 5 3 Charlie Blackmon COL 504 115 168 27 14 29 78 12 8 47 104 .333 .396 ## 6 4 Bryce Harper WSH 402 92 131 27 1 29 87 2 2 66 92 .326 .419 ## 7 5 Daniel Murphy WSH 428 76 138 36 2 20 81 1 0 37 52 .322 .378 ## 8 6 Avisail Garcia CHW 387 52 123 21 3 13 60 5 2 18 85 .318 .360 ## SLG OPS WAR ## 3 .564 .982 6.8 ## 4 .544 .970 4.4 ## 5 .615 1.011 4.3 ## 6 .614 1.034 4.6 ## 7 .556 .934 2.0 ## 8 .488 .848 3.3
The previous website consisted of a single table. However, the following website contains many.
# Website URL url <- 'http://www.nfl.com/player/antoniobrown/2508061/careerstats' webpage <- read_html(url) # Commands to gather data # 'table' is very specific to our task data_table <- html_nodes(webpage, 'table') data <- html_table(data_table)
Since there is more than one table, the data
is actually a list
containing each table separately.
# Table 1 data[1]
## [[1]] ## X1 X2 X3 X4 X5 X6 ## 1 Receiving Receiving Receiving Receiving Receiving Receiving ## 2 Year Team G Rec Yds Avg ## 3 2016 Pittsburgh Steelers 15 106 1,284 12.1 ## 4 ## 5 2015 Pittsburgh Steelers 16 136 1,834 13.5 ## 6 ## 7 2014 Pittsburgh Steelers 16 129 1,698 13.2 ## 8 ## 9 2013 Pittsburgh Steelers 16 110 1,499 13.6 ## 10 ## 11 2012 Pittsburgh Steelers 13 66 787 11.9 ## 12 ## 13 2011 Pittsburgh Steelers 16 69 1,108 16.1 ## 14 ## 15 2010 Pittsburgh Steelers 9 16 167 10.4 ## 16 ## 17 TOTAL TOTAL 101 632 8,377 13.3 ## X7 X8 X9 X10 X11 X12 X13 ## 1 Receiving Receiving Receiving Receiving Receiving Receiving Receiving ## 2 Yds/G Lng TD 20+ 40+ 1st FUM ## 3 85.6 51 12 22 3 64 0 ## 4 ## 5 114.6 59 10 25 8 84 1 ## 6 ## 7 106.1 63T 13 19 4 85 1 ## 8 ## 9 93.7 56 8 23 6 69 0 ## 10 ## 11 60.5 60T 5 10 2 43 2 ## 12 ## 13 69.2 79T 2 18 3 57 0 ## 14 ## 15 18.6 26 0 2 0 10 0 ## 16 ## 17 82.9 79 50 119 26 412 4
# Table 2 data[2]
## [[1]] ## X1 X2 X3 X4 X5 X6 X7 ## 1 Rushing Rushing Rushing Rushing Rushing Rushing Rushing ## 2 Year Team G Att Att/G Yds Avg ## 3 2016 Pittsburgh Steelers 15 3 0.2 9 3.0 ## 4 ## 5 2015 Pittsburgh Steelers 16 3 0.2 28 9.3 ## 6 ## 7 2014 Pittsburgh Steelers 16 4 0.2 13 3.3 ## 8 ## 9 2013 Pittsburgh Steelers 16 7 0.4 4 0.6 ## 10 ## 11 2012 Pittsburgh Steelers 13 7 0.5 24 3.4 ## 12 ## 13 2011 Pittsburgh Steelers 16 7 0.4 41 5.9 ## 14 ## 15 2010 Pittsburgh Steelers 9 -- 0.0 -- -- ## 16 ## 17 TOTAL TOTAL 101 31 0.3 119 3.8 ## X8 X9 X10 X11 X12 X13 X14 X15 ## 1 Rushing Rushing Rushing Rushing Rushing Rushing Rushing Rushing ## 2 Yds/G TD Lng 1st 1st% 20+ 40+ FUM ## 3 0.6 0 13 1 33.3 0 0 0 ## 4 ## 5 1.8 0 16 1 33.3 0 0 0 ## 6 ## 7 0.8 0 9 0 0.0 0 0 0 ## 8 ## 9 0.2 0 10 2 28.6 0 0 0 ## 10 ## 11 1.8 0 13 2 28.6 0 0 0 ## 12 ## 13 2.6 0 10 2 28.6 0 0 0 ## 14 ## 15 -- -- -- -- -- -- -- -- ## 16 ## 17 1.2 0 16 8 25.8 0 0 0
If each individual data frame in the list had the same number of columns, we could combine them into a single data frame as follows:
data.combine <- do.call("rbind",data)
Unfortunately the previous command does not work since each data frame has different number of columns.
url <- 'https://www.mlssoccer.com/stats/season?year=2017&group=g' webpage <- read_html(url) data_table <- html_nodes(webpage, 'table') data <- html_table(data_table) head(data)
## [[1]] ## Player Club POS GP GS MINS G A SHTS SOG GWG PKG/A ## 1 David Villa NYC F 24 23 2065 19 8 109 43 3 4/4 ## 2 Nemanja Nikolic CHI F 25 25 2153 16 3 82 41 4 2/3 ## 3 Diego Valeri POR M 24 24 2109 14 9 64 23 4 3/4 ## 4 Ignacio Piatti MTL M 19 17 1558 14 4 50 27 2 4/4 ## 5 Bradley Wright-Phillips NY F 23 22 1909 14 0 65 29 2 0/0 ## 6 David Accam CHI M 23 19 1677 13 7 49 30 4 2/2 ## 7 Sebastian Giovinco TOR F 20 20 1652 12 6 105 42 3 1/2 ## 8 CJ Sapong PHI F 24 21 2011 12 5 45 24 2 3/3 ## 9 Ola Kamara CLB F 26 25 2173 12 3 65 27 1 2/2 ## 10 Erick Torres HOU F 21 18 1560 12 2 56 21 3 5/6 ## 11 Maximiliano Urruti DAL F 21 21 1853 11 5 77 33 3 1/1 ## 12 Clint Dempsey SEA M 20 18 1655 11 3 80 27 3 2/3 ## 13 Christian Ramirez MIN F 22 21 1844 11 1 53 28 3 0/0 ## 14 Justin Meram CLB M 26 25 2018 10 7 49 23 5 0/0 ## 15 Chris Wondolowski SJ F 25 25 2231 10 5 54 20 1 1/1 ## 16 Hector Villalba ATL M 22 22 1766 10 4 53 25 3 0/0 ## 17 Fredy Montero VAN F 23 19 1776 10 3 65 24 1 2/4 ## 18 Fanendo Adi POR F 22 22 1885 10 3 66 28 1 3/3 ## 19 Daniel Royer NY M 22 20 1604 10 2 45 19 3 3/3 ## 20 Federico Higuain CLB M 18 18 1548 9 5 40 19 2 2/2 ## 21 Jozy Altidore TOR F 21 19 1697 9 5 49 20 3 3/5 ## 22 Cyle Larin ORL F 22 21 1842 9 2 46 26 5 0/1 ## 23 Josef Martinez ATL F 9 6 595 9 0 32 16 1 0/0 ## 24 Lee Nguyen NE M 23 23 1942 8 10 34 15 1 3/3 ## 25 Miguel Almiron ATL M 22 20 1820 8 9 58 25 3 1/1 ## HmG RdG G/90min SC% ## 1 13 6 0.83 17.4 ## 2 13 3 0.67 19.5 ## 3 9 5 0.60 21.9 ## 4 10 4 0.81 28.0 ## 5 6 8 0.66 21.5 ## 6 7 6 0.70 26.5 ## 7 9 3 0.65 11.4 ## 8 10 2 0.54 26.7 ## 9 7 5 0.50 18.5 ## 10 9 3 0.69 21.4 ## 11 6 5 0.53 14.3 ## 12 3 8 0.60 13.8 ## 13 7 4 0.54 20.8 ## 14 7 3 0.45 20.4 ## 15 9 1 0.40 18.5 ## 16 4 6 0.51 18.9 ## 17 6 4 0.51 15.4 ## 18 7 3 0.48 15.2 ## 19 7 3 0.56 22.2 ## 20 6 3 0.52 22.5 ## 21 4 5 0.48 18.4 ## 22 6 3 0.44 19.6 ## 23 6 3 1.36 28.1 ## 24 5 3 0.37 23.5 ## 25 6 2 0.40 13.8