## What Have We Been Predicting?

Up until now, we have been predicting ages, number of home runs, and outcomes of games. However, there has been a major flaw in what we have done. Now that we have a basic understanding of how certain algorithms and predictive modelling works, we can address this. Specifically, we have been predicting responses based off of current data.

For instance, when we are predicting the outcomes of the Miami Heat, we are using data from that game to determine the outcome. However, in real life, when we want to predict an outcome of a game, we do not know the final statistics before hand. We can only use data from previous games to determine what will happen in a future game.

## How Do We Solve This Problem?

A basic approach would be to gather previous data and combine in a meaningful way. One way would be to take the average statistics of the last 3 games, or instead of an average we can use the median. There is no one way to do this. Even choosing the last 3 games is completely arbitrary. If this was obvious how to design this then everyone would be able to determine the outcomes of sporting events.

A more advanced way would be to do utilize the previous idea as well as create new features (columns). The process of creating new features is known as feature engineering. For instance, it is known that an NFL quarterback (qb) is vital to a teams success. Hence, if a qb is injured we should create a new column in our data to account for this. However, feature engineering typically requires specialty domain knowledge.

There is no perfect way to answer this question. You just have to try everything and anything! To get an idea of what people have attempted, the Journal of Quantitative Analysis in Sports (https://www.degruyter.com/view/j/jqas) is a good starting point and contains articles across various sports.

## Example

We are going to predict the number of receiving yards Antonio Brown of the Pittsburgh Steelers will have in 2017. To do this, we are going to train on all other wide receivers in 2016 not named Antonio Brown. A key part to this is determining what our training set should look like.

# Load data set
wr.ab <- read.csv("NFL WR AB Prediction.csv")

# Display only Antonio Brown
wr.ab[wr.ab$Player == "BrownAn",] ## Player Year Pos Team Rec Targets Yards AvgYards TD Long X20. YardsPG ## 6 BrownAn 2016 WR PIT 106 155 1284 12.1 12 51 22 85.6 ## 7 BrownAn 2011 WR PIT 69 124 1108 16.1 2 79 2 69.3 ## 8 BrownAn 2012 WR PIT 66 106 787 11.9 5 60 5 60.5 ## 9 BrownAn 2013 WR PIT 110 167 1499 13.6 8 56 8 93.7 ## 10 BrownAn 2014 WR PIT 129 181 1698 13.2 13 63 13 106.1 ## 11 BrownAn 2015 WR PIT 136 193 1834 13.5 10 59 10 114.6 ## Fumbles YAC X1stDwns ## 6 0 411 64 ## 7 0 348 57 ## 8 4 357 43 ## 9 1 608 69 ## 10 2 632 85 ## 11 3 587 84 As we can see we have Antonio Brown's statistics from 2011 - 2016. How should we make use of this data to predict the amount of yards he will have in 2017? We are first going to take his average statistics over the years 2011 - 2016 and use that as the testing data. Hopefully this will act as an good indicator. # Load relevant library library(plyr) # Separate data into training and testing wr.train <- wr.ab[-which(wr.ab$Player == "BrownAn"),]
wr.ab.test <- wr.ab[wr.ab$Player == "BrownAn",] # Find mean stats using ddply meanstats <- ddply(wr.ab.test, .(Player),colwise(mean)) # Remove unnecessary information wr.train <- subset(wr.train, select = -c(Player,Year,Pos,Team)) wr.ab.test <- subset(meanstats, select = -c(Player,Year,Pos,Team)) # Training and ordinary least squares model <- lm(Yards~., wr.train) # Run prediction on model and testing data prediction <- predict(model,wr.ab.test) # Results prediction ## 1 ## 1307.776 Using an ordinary least squares, we predict that Antonio Brown will have 1307 receiving yards for the 2017-2018 NFL season. How do we match up against other predictions? NFL.com: 1205.61 Yards ESPN.com: 1420.1 Yards Fantasydata.com: 1495 Yards Certainly the expert websites have confidence in their own predictions. However, there is no way to know a priori which is closest to being correct until the season is over, but this surely is exciting! ## Using Median library(plyr) # Separate data into training and testing wr.train <- wr.ab[-which(wr.ab$Player == "BrownAn"),]
wr.ab.test <- wr.ab[wr.ab$Player == "BrownAn",] # Remove unnecessary information wr.train <- subset(wr.train, select = -c(Player,Year,Pos,Team)) # Find columnwise median of only numeric columns wr.ab.test <- ddply(wr.ab.test, .(Player),numcolwise(median)) # Training ordinary least squares model model <- lm(Yards~., wr.train) # Run prediction on model and testing data prediction <- predict(model,wr.ab.test) # Results prediction ## 1 ## 1344.152 Using the median of the last 5 games, we see that Antonio Brown is projected to have 1344 receiving yards for the 2017-2018 NFL season. However, there is still no way to see how accurate we are until the season is over. ## Your turn Using the NFL WR AB Prediction.csv data set, complete the following: 1. Apply an ordinary least squares regression using the average of the last 5 seasons (2011-2016) to predict the number of receiving touchdowns (TD) Antonio Brown will have for the 2017-2018 NFL season. 2. Repeat the same experiment as (1.) using a random forest. 3. Using the forementioned websites, compare your predictions with the experts ## Answers ### 1. # Separate data sets wr.train <- wr.ab[-which(wr.ab$Player == "BrownAn"),]
wr.ab.test <- wr.ab[wr.ab\$Player == "BrownAn",]

# Find columnwise mean
meanstats <- ddply(wr.ab.test, .(Player),colwise(mean))

# Remove unnecessary data
wr.train <- subset(wr.train, select = -c(Player,Year,Pos,Team))
wr.ab.test <- subset(meanstats, select = -c(Player,Year,Pos,Team))

# Train ordinary least squares
model <- lm(TD~., wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test)
# Results
prediction
##        1
## 7.150246

### 2.

# Load relevant library
library(randomForest)

# Random forest model
model <- randomForest(TD~.,wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test)
# Results
prediction
##        1
## 9.228567

### 3.

NFL.com: 8.71 Touchdowns

ESPN.com: 7.6 Touchdowns

Fantasydata.com: 9 Touchdowns