---
title: "Measuring Error"
output:
ioslides_presentation:
smaller: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Outline
- Learn about different measures of errors for predictions
## Measuring Error
When it comes to predictive modelling having a way to measure error is very useful. Namely, if we know a specific outcome and our predicted outcome, we should be able to have a measure of our accuracy.
In a classification problem it is clear. We simply count the number of observations where we predict the correct class and output a number, or generate a *confusion matrix*.
In a regression problem it is not as simple. For instance, if we predict a value of 5 but the actual value was 5.5, are we close enough? Is there a measure to determine how successful we are? Yes there is!
## Types of Error Measure
Common measures include:
- **Mean Squared Error (MSE)**: Measures average sum of square distances between predicted and actual value
- **Root Mean Squared Error (RMSE)**: Square root of mean squared error
- **Mean Absolute Error (MAE)**: Measures average sum of absolute distance between actual and predicted
## Complicated Math
MSE = $\dfrac{1}{n}\sum\limits_{i=1}^{n}(\hat{x}_i-x_i)^2$
RMSE = $\sqrt{\dfrac{1}{n}\sum\limits_{i=1}^{n}(\hat{x}_i-x_i)^2}$
MAE = $\dfrac{1}{n}\sum\limits_{i=1}^{n}|\hat{x}_i-x_i|$
where $\hat{x}_i$ is the predicted value and $x_i$ is the actual value.
Simply put, if the closer each value is to zero, the better our predictions.
## Example
Let's go back to our MLB example where we predicted the age of Sammy Sosa using an ordinary least squares.
```{r,message=F,warning=F}
# Load MLB data
mlb <- read.csv("MLB Stats.csv")
# Subset to remove categorical data
mlb <- subset(mlb, select = -c(tm,lg))
# Selects rows without SosaSa
mlb.train <- mlb[-which(mlb$id == "SosaSa"),]
# Selects rows with SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",]
# Builds linear model predicting age based on all variables
# We need to remove id, tm, and lg variables since they are not numeric
model <- lm(age ~., data = mlb.train[,-1])
# Run prediction function based on our model
prediction.age <- predict(model,mlb.test)
# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age)
# Selects only relevant columns
results <- results[c("id","age","prediction.age")]
```
##
```{r}
# Results
results
```
##
Let's see how accurate we are using all three metrics from the `Metric` package. The input of the code is:
`mse(actual,predicted)`, `rmse(actual,predicted)`, and `mae(actual,predicted)`
```{r}
# Load relevant library
library(Metrics)
# Mean squared error
mse(results$age,results$prediction.age)
# Root mean squared error
rmse(results$age,results$prediction.age)
# Mean absolute error
mae(results$age,results$prediction.age)
```
## Using LASSO Estimate
```{r,warning = F,message=F}
# Load relevant library
library(glmnet)
# Load MLB data set
mlb <- read.csv("MLB Stats.csv")
# Selects rows without SosaSa
mlb.train <- mlb[-which(mlb$id == "SosaSa"),]
# Selects rows with SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",]
# Subset training data removing categorical data
mlb.train <- subset(mlb.train, select = -c(id,tm,lg))
# Subset training data removing categorical data plus age
mlb.test <- subset(mlb.test, select = -c(id,tm,lg,age))
# Subset training to create matrix of features
mlb.trainx <- as.matrix(subset(mlb.train,select = -age))
# Subset training to create vector of response variable
mlb.trainy <- as.matrix(subset(mlb.train,select = age))
# Builds LASSO model predicting age based on all variables
# We need to remove id, tm, and lg variables since they are not numeric
# Picks optimal lambda value
model.cv <- cv.glmnet(mlb.trainx,mlb.trainy, alpha = 1)
# Prediction based on cv.model and testing data
prediction.age.lasso <- predict(model.cv,as.matrix(mlb.test),
lambda = "lambda.min")
# Creates data frame with Sammy Sosa and the predictions as columns
results2 <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age.lasso)
# Selects only relevant columns
results2 <- results2[c("id","age","X1")]
# Renames columns
colnames(results2) <- c("id","age","predicted age lasso")
```
##
```{r}
# Results
results2
```
##
```{r}
# Mean squared error
mse(results2$age,results2$`predicted age lasso`)
# Root mean squared error
rmse(results2$age,results2$`predicted age lasso`)
# Mean absolute error
mae(results2$age,results2$`predicted age lasso`)
```
Comparing the each error measure, respectively, we see that LASSO is consistently a better model to use. It should be of note, that it makes little sense to compare different error measures for the same model. It is mainly used to measure the error between models.
## Your Turn
1. Create a function for each error measure (MSE, RMSE, MAE).
2. Run your functions on the results from the ordinary least squares to see if your function obtains the same values.
## Answers
### 1.
```{r}
mse.fun <- function(x,y){
mean((x-y)^2)
}
rmse.fun <- function(x,y){
sqrt(mean((x-y)^2))
}
mae.fun <- function(x,y){
mean(abs(x-y))
}
```
##
### 2.
```{r}
# Mean squared error
mse.fun(results$age,results$prediction.age)
# Root mean squared error
rmse.fun(results$age,results$prediction.age)
# Mean absolute error
mae.fun(results$age,results$prediction.age)
```