---
title: "Predicting the Future"
output:
ioslides_presentation:
smaller: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## What Have We Been Predicting?
Up until now, we have been predicting ages, number of home runs, and outcomes of games. However, there has been a major flaw in what we have done. Now that we have a basic understanding of how certain algorithms and predictive modelling works, we can address this. Specifically, we have been predicting responses based off of current data.
For instance, when we are predicting the outcomes of the Miami Heat, we are using data from that game to determine the outcome. However, in real life, when we want to predict an outcome of a game, we do not know the final statistics before hand. We can only use data from previous games to determine what will happen in a future game.
## How Do We Solve This Problem?
A basic approach would be to gather previous data and combine in a meaningful way. One way would be to take the average statistics of the last 3 games, or instead of an average we can use the median. There is no *one* way to do this. Even choosing the last 3 games is completely arbitrary. If this was obvious how to design this then everyone would be able to determine the outcomes of sporting events.
A more advanced way would be to do utilize the previous idea as well as create new features (columns). The process of creating new features is known as **feature engineering**. For instance, it is known that an NFL quarterback (qb) is vital to a teams success. Hence, if a qb is injured we should create a new column in our data to account for this. However, **feature engineering** typically requires specialty domain knowledge.
There is no perfect way to answer this question. You just have to try everything and anything! To get an idea of what people have attempted, the *Journal of Quantitative Analysis in Sports* (https://www.degruyter.com/view/j/jqas) is a good starting point and contains articles across various sports.
## Example
We are going to predict the number of receiving yards Antonio Brown of the Pittsburgh Steelers will have in 2017.
To do this, we are going to train on all other wide receivers in 2016 not named Antonio Brown. A key part to this is determining what our training set should look like.
```{r}
# Load data set
wr.ab <- read.csv("NFL WR AB Prediction.csv")
# Display only Antonio Brown
wr.ab[wr.ab$Player == "BrownAn",]
```
##
As we can see we have Antonio Brown's statistics from 2011 - 2016. How should we make use of this data to predict the amount of yards he will have in 2017? We are first going to take his average statistics over the years 2011 - 2016 and use that as the testing data. Hopefully this will act as an good indicator.
```{r,message = F,warning = F}
# Load relevant library
library(plyr)
# Separate data into training and testing
wr.train <- wr.ab[-which(wr.ab$Player == "BrownAn"),]
wr.ab.test <- wr.ab[wr.ab$Player == "BrownAn",]
# Find mean stats using ddply
meanstats <- ddply(wr.ab.test, .(Player),colwise(mean))
# Remove unnecessary information
wr.train <- subset(wr.train, select = -c(Player,Year,Pos,Team))
wr.ab.test <- subset(meanstats, select = -c(Player,Year,Pos,Team))
# Training and ordinary least squares
model <- lm(Yards~., wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test)
# Results
prediction
```
##
Using an ordinary least squares, we predict that Antonio Brown will have 1307 receiving yards for the 2017-2018 NFL season. How do we match up against other predictions?
NFL.com: 1205.61 Yards
ESPN.com: 1420.1 Yards
Fantasydata.com: 1495 Yards
Certainly the *expert* websites have confidence in their own predictions. However, there is no way to know *a priori* which is closest to being correct until the season is over, but this surely is exciting!
## Using Median
```{r,message = F,warning = F}
library(plyr)
# Separate data into training and testing
wr.train <- wr.ab[-which(wr.ab$Player == "BrownAn"),]
wr.ab.test <- wr.ab[wr.ab$Player == "BrownAn",]
# Remove unnecessary information
wr.train <- subset(wr.train, select = -c(Player,Year,Pos,Team))
# Find columnwise median of only numeric columns
wr.ab.test <- ddply(wr.ab.test, .(Player),numcolwise(median))
# Training ordinary least squares model
model <- lm(Yards~., wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test)
# Results
prediction
```
Using the median of the last 5 games, we see that Antonio Brown is projected to have 1344 receiving yards for the 2017-2018 NFL season. However, there is still no way to see how accurate we are until the season is over.
## Your turn
Using the NFL WR AB Prediction.csv data set, complete the following:
1. Apply an ordinary least squares regression using the average of the last 5 seasons (2011-2016) to predict the number of receiving touchdowns (TD) Antonio Brown will have for the 2017-2018 NFL season.
2. Repeat the same experiment as (1.) using a random forest.
3. Using the forementioned websites, compare your predictions with the *experts*
## Answers
### 1.
```{r,message =F,warning=F}
# Separate data sets
wr.train <- wr.ab[-which(wr.ab$Player == "BrownAn"),]
wr.ab.test <- wr.ab[wr.ab$Player == "BrownAn",]
# Find columnwise mean
meanstats <- ddply(wr.ab.test, .(Player),colwise(mean))
# Remove unnecessary data
wr.train <- subset(wr.train, select = -c(Player,Year,Pos,Team))
wr.ab.test <- subset(meanstats, select = -c(Player,Year,Pos,Team))
# Train ordinary least squares
model <- lm(TD~., wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test)
# Results
prediction
```
##
### 2.
```{r,message=F,warning=F}
# Load relevant library
library(randomForest)
# Random forest model
model <- randomForest(TD~.,wr.train)
# Run prediction on model and testing data
prediction <- predict(model,wr.ab.test)
# Results
prediction
```
##
### 3.
NFL.com: 8.71 Touchdowns
ESPN.com: 7.6 Touchdowns
Fantasydata.com: 9 Touchdowns