What Have We Been Predicting?

Up until now, we have been predicting ages, number of home runs, and outcomes of games. However, there has been a major flaw in what we have done. Now that we have a basic understanding of how certain algorithms and predictive modelling works, we can address this. Specifically, we have been predicting responses based off of current data.

For instance, when we are predicting the outcomes of the Miami Heat, we are using data from that game to determine the outcome. However, in real life, when we want to predict an outcome of a game, we do not know the final statistics before hand. We can only use data from previous games to determine what will happen in a future game.

How Do We Solve This Problem?

A basic approach would be to gather previous data and combine in a meaningful way. One way would be to take the average statistics of the last 3 games, or instead of an average we can use the median. There is no one way to do this. Even choosing the last 3 games is completely arbitrary. If this was obvious how to design this then everyone would be able to determine the outcomes of sporting events.

A more advanced way would be to do utilize the previous idea as well as create new features (columns). The process of creating new features is known as feature engineering. For instance, it is known that an NFL quarterback (qb) is vital to a teams success. Hence, if a qb is injured we should create a new column in our data to account for this. However, feature engineering typically requires specialty domain knowledge.

There is no perfect way to answer this question. You just have to try everything and anything! To get an idea of what people have attempted, the Journal of Quantitative Analysis in Sports (https://www.degruyter.com/view/j/jqas) is a good starting point and contains articles across various sports.

Example

We are going to predict the number of receiving yards Antonio Brown of the Pittsburgh Steelers will have in 2017. To do this, we are going to train on all other wide receivers in 2016 not named Antonio Brown. A key part to this is determining what our training set should look like.

# Load data set
wr.ab <- read.csv("NFL WR AB Prediction.csv")

# Display only Antonio Brown
wr.ab[wr.ab$Player == "BrownAn",]
##     Player Year Pos Team Rec Targets Yards AvgYards TD Long X20. YardsPG
## 6  BrownAn 2016  WR  PIT 106     155  1284     12.1 12   51   22    85.6
## 7  BrownAn 2011  WR  PIT  69     124  1108     16.1  2   79    2    69.3
## 8  BrownAn 2012  WR  PIT  66     106   787     11.9  5   60    5    60.5
## 9  BrownAn 2013  WR  PIT 110     167  1499     13.6  8   56    8    93.7
## 10 BrownAn 2014  WR  PIT 129     181  1698     13.2 13   63   13   106.1
## 11 BrownAn 2015  WR  PIT 136     193  1834     13.5 10   59   10   114.6
##    Fumbles YAC X1stDwns
## 6        0 411       64
## 7        0 348       57
## 8        4 357       43
## 9        1 608       69
## 10       2 632       85
## 11       3 587       84

As we can see we have Antonio Brown's statistics from 2011 - 2016. How should we make use of this data to predict the amount of yards he will have in 2017? We are first going to take his average statistics over the years 2011 - 2016 and use that as the testing data. Hopefully this will act as an good indicator.

# Load relevant library
library(plyr)

# Separate data into training and testing
wr.train <- wr.ab[-which(wr.ab$Player == "BrownAn"),]
wr.ab.test <- wr.ab[wr.ab$Player == "BrownAn",]

# Find mean stats using ddply
meanstats <- ddply(wr.ab.test, .(Player),colwise(mean))

# Remove unnecessary information
wr.train <- subset(wr.train, select = -c(Player,Year,Pos,Team))
wr.ab.test <- subset(meanstats, select = -c(Player,Year,Pos,Team))
# Training and ordinary least squares
model <- lm(Yards~., wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test) 
# Results
prediction
##        1 
## 1307.776

Using an ordinary least squares, we predict that Antonio Brown will have 1307 receiving yards for the 2017-2018 NFL season. How do we match up against other predictions?

NFL.com: 1205.61 Yards

ESPN.com: 1420.1 Yards

Fantasydata.com: 1495 Yards

Certainly the expert websites have confidence in their own predictions. However, there is no way to know a priori which is closest to being correct until the season is over, but this surely is exciting!

Using Median

library(plyr)

# Separate data into training and testing
wr.train <- wr.ab[-which(wr.ab$Player == "BrownAn"),]
wr.ab.test <- wr.ab[wr.ab$Player == "BrownAn",]

# Remove unnecessary information
wr.train <- subset(wr.train, select = -c(Player,Year,Pos,Team))
# Find columnwise median of only numeric columns
wr.ab.test <- ddply(wr.ab.test, .(Player),numcolwise(median))

# Training ordinary least squares model
model <- lm(Yards~., wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test) 
# Results
prediction
##        1 
## 1344.152

Using the median of the last 5 games, we see that Antonio Brown is projected to have 1344 receiving yards for the 2017-2018 NFL season. However, there is still no way to see how accurate we are until the season is over.

Your turn

Using the NFL WR AB Prediction.csv data set, complete the following:

  1. Apply an ordinary least squares regression using the average of the last 5 seasons (2011-2016) to predict the number of receiving touchdowns (TD) Antonio Brown will have for the 2017-2018 NFL season.

  2. Repeat the same experiment as (1.) using a random forest.

  3. Using the forementioned websites, compare your predictions with the experts

Answers

1.

# Separate data sets
wr.train <- wr.ab[-which(wr.ab$Player == "BrownAn"),]
wr.ab.test <- wr.ab[wr.ab$Player == "BrownAn",]

# Find columnwise mean 
meanstats <- ddply(wr.ab.test, .(Player),colwise(mean))

# Remove unnecessary data
wr.train <- subset(wr.train, select = -c(Player,Year,Pos,Team))
wr.ab.test <- subset(meanstats, select = -c(Player,Year,Pos,Team))

# Train ordinary least squares
model <- lm(TD~., wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test) 
# Results
prediction
##        1 
## 7.150246

2.

# Load relevant library
library(randomForest)

# Random forest model
model <- randomForest(TD~.,wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test) 
# Results
prediction
##        1 
## 9.228567

3.

NFL.com: 8.71 Touchdowns

ESPN.com: 7.6 Touchdowns

Fantasydata.com: 9 Touchdowns