---
title: "Introduction to Regression"
output:
ioslides_presentation:
smaller: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Outline
- Talk about common regression algorithms
- Do some simple exercises
## Regression
Simply, **regression** is concerned with modeling the relationship between input variables and an output variable. For instance input variables can be passing yards per game, interceptions, sacks, etc. and an output could be points scored.
The distinguishing characteristic between classification and regression is that regression attempts to predict a numerical value. That is, regression should not be used if you want to determine if a teams wins or lose, but rather if you want to predict how many points a team scores.
## Predictive Modeling
In this sense, a machine learning algorithm is a model with input factors such as passing yards per game, interceptions, sacks, etc... and an output say win/lose or total points scored.
Predictive modeling can be separated into two groups:
- Classification: Predicting categorical variable i.e. win/loss, above/below, low/medium/high, etc...
- Regression: Predict value i.e. points scored in a game, number of field goals, etc...
## Examples of Machine Learning Algorithms for Regression
- **Ordinary Least Squares**: Finds plane that minimized that sum-of-squared errors between the observed value and predicted response
- **Ridge Regression**: A penalized ordinary least squares using a second order penalty term
- **Least Absolute Shrinkage and Selection Operator (LASSO)**: A penalized ordinary least squares using a first order penalty term
- **Elastic Net**: Combination of both ridge regression and LASSO penalty
- Many many many more!
In some sense, you can think of Ridge Regression and LASSO as a special case of an Elastic Net.
## Applying a Model
Let's take our MLB statistics, and see if we can predict the number home runs (hr) in a season for Sammy Sosa (SosaSa) using everyone else as training data.
For this we will use the ordinary least squares model which essentially attempts to construct a *line of best fit* through all the data points in multi-dimensions.
##
```{r, warning = F,message=F}
mlb <- read.csv("MLB Stats.csv") # Load data
mlb <- subset(mlb, select = -c(tm,lg)) # Remove tm and lg since they are categorical
mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",] # Selects rows with SosaSa
model <- lm(hr ~., data = mlb.train[,-1])
# Builds linear model predicting hr(homeruns) based on all variables
# We need to remove id, tm, and lg variables since they are not numeric
```
##
```{r}
model # See model output
```
##
```{r}
prediction <- predict(model,mlb.test)
# Run prediction function based on our model
prediction
```
Note: Regression coefficients represent the mean change in the predicted value for a one unit change in the predictor variable.
To summarize a few things,
`lm(formula,data)` produces an ordinary least squares where `hr` is the response variable we want to predict. `~.` indicates that we are going to predict `hr` using the rest of the columns. Lastly, `data = mlb.train[,-1]` defines the data argument to be our mlb data without the first column, i.e. `id`.
Now, `predict(model,data)` gives us a prediction of the `hr` using the model built on the training set and the inputted testing data.
##
Create data frame to easily visualize results
```{r, warnings = F,message=F}
# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "SosaSa",],prediction)
# Selects only relevant columns
results <- results[c("id","year","hr","prediction")]
results
```
##
In this scenario we are able to predict the total home runs of Sammy Sosa perfectly. However, this is an extreme case and in generally we will not be as accurate. What is most likely happening is that there are some factors, that are heavily influcing the predictions such as the statistics Runs Batting in (RBI).
Let's do it again, but this time we are going to attempt to predict the age of Sammy Sosa throughout his career. If we do this, we should not be as accurate as in the other example. The reason is because there are no features that are correlated with age; hence, we should not be able to accurate predict his age. So let's see what happens!
##
```{r, warning = F,message=F}
mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",] # Selects rows with SosaSa
model <- lm(age ~., data = mlb.train[,-1])
# Builds linear model predicting age based on all variables
# We need to remove id, tm, and lg variables since they are not numeric
prediction.age <- predict(model,mlb.test)
# Run prediction function based on our model
# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age)
# Selects only relevant columns
results <- results[c("id","age","prediction.age")]
```
##
```{r}
head(results,n = 15)
```
It seems as though our hypothesis was right! It was not easy for our model to predict age since there were not many factors that correlate heavily with age. This does not imply that we may cannot predict the age accurately; we may just have to apply a more sophisticated machine learning technique.
## A More Complex Example Using LASSO
Here we run the same experiment. However, this time we are using a LASSO. Note that in our model `glmnet()`, there is an alpha parameter. When `alpha = 0`, we have a ridge regression. When `alpha = 1` we have LASSO, and any value in-between constitutes an elastic net.
##
```{r,warning = F,message=F}
library(glmnet)
mlb <- read.csv("MLB Stats.csv")
mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",] # Selects rows with SosaSa
mlb.train <- subset(mlb.train, select = -c(id,tm,lg)) # Removes features
mlb.test <- subset(mlb.test, select = -c(id,tm,lg,age)) # Removes features
mlb.trainx <- as.matrix(subset(mlb.train,select = -age)) # Turns data frame into matrix
mlb.trainy <- as.matrix(subset(mlb.train,select = age)) # Turns data frame into matrix
model <- glmnet(mlb.trainx,mlb.trainy, alpha = 1)
# Builds linear model predicting age based on all variables
# We need to remove id, tm, and lg variables since they are not numeric
prediction.age.lasso <- predict(model,as.matrix(mlb.test)) # Predict on test data
```
##
```{r}
prediction.age.lasso
```
##
The syntax for `glmnet()` is a bit different. The breakdown is as follows:
`glmnet(x,y,alpha)`: `x` corresponds to matrix of the training data, `y` correspondes to the vector, matrix, of the response variable, and `alpha` denotes which model we are going to use. If this does not make sense, documentation of the glmnet package is as follows: https://cran.r-project.org/web/packages/glmnet/glmnet.pdf
As we can see, the `glmnet()` command produces a sequence of possible predictions! However, this isn't very useful. We should have the computer pick the best sequence!
## Picking Optimal Sequence
```{r, warnings = F}
model.cv <- cv.glmnet(mlb.trainx,mlb.trainy, alpha = 1)
prediction.age.lasso <- predict(model.cv,as.matrix(mlb.test),
lambda = "lambda.min")
prediction.age.lasso
```
##
```{r}
# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age.lasso)
# Selects only relevant columns
results <- results[c("id","age","X1")]
colnames(results) <- c("id","age","predicted age lasso")
head(results,n = 15)
```
##
In short, `cv.glmnet()` uses cross validation to determine the optimal sequence finding an optimal lambda value. This is outside the scope of this course. However, for the curious learner there are many many online sources that discuss the topic of cross validation.
## Your Turn
Using the MLB Stats.csv file, predict the age of Barry Bonds (BondsBa) over his entire career using all other players as training data using:
1. Ordinary Least Squares
2. Ridge Regression `alpha = 0` with an optimal lambda
## Answers
### 1.
```{r, warning = F,message=F}
mlb <- read.csv("MLB Stats.csv")
mlb <- subset(mlb, select = -c(tm,lg))
mlb.train <- mlb[-which(mlb$id == "BondsBa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "BondsBa",] # Selects rows with SosaSa
model <- lm(age ~., data = mlb.train[,-1])
# Builds linear model predicting hr(homeruns) based on all variables
# We need to remove id, tm, and lg variables since they are not numeric
prediction <- predict(model,mlb.test)
# Run prediction function based on our model
# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "BondsBa",],prediction)
# Selects only relevant columns
results <- results[c("id","age","prediction")]
```
##
```{r}
results
```
##
### 2.
```{r,warnings = F,message=F}
mlb <- read.csv("MLB Stats.csv")
mlb.train <- mlb[-which(mlb$id == "BondsBa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "BondsBa",] # Selects rows with SosaSa
mlb.train <- subset(mlb.train, select = -c(id,tm,lg))
mlb.test <- subset(mlb.test, select = -c(id,tm,lg,age))
mlb.trainx <- as.matrix(subset(mlb.train,select = -age))
mlb.trainy <- as.matrix(subset(mlb.train,select = age))
model.cv <- cv.glmnet(mlb.trainx,mlb.trainy, alpha = 0)
prediction.age.ridge <- predict(model.cv,as.matrix(mlb.test),
lambda = "lambda.min")
# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "BondsBa",],prediction.age.ridge)
# Selects only relevant columns
results <- results[c("id","age","X1")]
colnames(results) <- c("id","age","predicted age ridge")
```
##
```{r}
results
```