Outline

  • Learn about different measures of errors for predictions

Measuring Error

When it comes to predictive modelling having a way to measure error is very useful. Namely, if we know a specific outcome and our predicted outcome, we should be able to have a measure of our accuracy.

In a classification problem it is clear. We simply count the number of observations where we predict the correct class and output a number, or generate a confusion matrix.

In a regression problem it is not as simple. For instance, if we predict a value of 5 but the actual value was 5.5, are we close enough? Is there a measure to determine how successful we are? Yes there is!

Types of Error Measure

Common measures include:

  • Mean Squared Error (MSE): Measures average sum of square distances between predicted and actual value

  • Root Mean Squared Error (RMSE): Square root of mean squared error

  • Mean Absolute Error (MAE): Measures average sum of absolute distance between actual and predicted

Complicated Math

MSE = \(\dfrac{1}{n}\sum\limits_{i=1}^{n}(\hat{x}_i-x_i)^2\)

RMSE = \(\sqrt{\dfrac{1}{n}\sum\limits_{i=1}^{n}(\hat{x}_i-x_i)^2}\)

MAE = \(\dfrac{1}{n}\sum\limits_{i=1}^{n}|\hat{x}_i-x_i|\)

where \(\hat{x}_i\) is the predicted value and \(x_i\) is the actual value.

Simply put, if the closer each value is to zero, the better our predictions.

Example

Let's go back to our MLB example where we predicted the age of Sammy Sosa using an ordinary least squares.

# Load MLB data
mlb <- read.csv("MLB Stats.csv")
# Subset to remove categorical data
mlb <- subset(mlb, select = -c(tm,lg))
# Selects rows without SosaSa
mlb.train <- mlb[-which(mlb$id == "SosaSa"),] 
# Selects rows with SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",] 
# Builds linear model predicting age based on all variables
# We need to remove id, tm, and lg variables since they are not numeric
model <- lm(age ~., data = mlb.train[,-1])
# Run prediction function based on our model
prediction.age <- predict(model,mlb.test) 
# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age)
# Selects only relevant columns
results <- results[c("id","age","prediction.age")]

# Results
results
##        id age prediction.age
## 1  SosaSa  20       23.27857
## 2  SosaSa  20       22.93880
## 3  SosaSa  20       23.20457
## 4  SosaSa  21       18.40951
## 5  SosaSa  22       24.47351
## 6  SosaSa  23       23.91660
## 7  SosaSa  24       22.12227
## 8  SosaSa  25       23.46654
## 9  SosaSa  26       23.26329
## 10 SosaSa  27       23.25116
## 11 SosaSa  28       23.05700
## 12 SosaSa  29       21.71412
## 13 SosaSa  30       25.07838
## 14 SosaSa  31       27.50652
## 15 SosaSa  32       27.46382
## 16 SosaSa  33       30.92094
## 17 SosaSa  34       30.21568
## 18 SosaSa  35       33.61453
## 19 SosaSa  36       35.92567
## 20 SosaSa  38       33.96438

Let's see how accurate we are using all three metrics from the Metric package. The input of the code is: mse(actual,predicted), rmse(actual,predicted), and mae(actual,predicted)

# Load relevant library
library(Metrics)
# Mean squared error
mse(results$age,results$prediction.age)
## [1] 12.10603
# Root mean squared error
rmse(results$age,results$prediction.age)
## [1] 3.479372
# Mean absolute error
mae(results$age,results$prediction.age)
## [1] 3.091913

Using LASSO Estimate

# Load relevant library
library(glmnet)
# Load MLB data set
mlb <- read.csv("MLB Stats.csv")
# Selects rows without SosaSa
mlb.train <- mlb[-which(mlb$id == "SosaSa"),] 
# Selects rows with SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",] 
# Subset training data removing categorical data
mlb.train <- subset(mlb.train, select = -c(id,tm,lg))
# Subset training data removing categorical data plus age
mlb.test <- subset(mlb.test, select = -c(id,tm,lg,age))
# Subset training to create matrix of features
mlb.trainx <- as.matrix(subset(mlb.train,select = -age))
# Subset training to create vector of response variable
mlb.trainy <- as.matrix(subset(mlb.train,select = age))
# Builds LASSO model predicting age based on all variables
# We need to remove id, tm, and lg variables since they are not numeric
# Picks optimal lambda value
model.cv <- cv.glmnet(mlb.trainx,mlb.trainy, alpha = 1)
# Prediction based on cv.model and testing data
prediction.age.lasso <- predict(model.cv,as.matrix(mlb.test), 
                          lambda = "lambda.min")
# Creates data frame with Sammy Sosa and the predictions as columns
results2 <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age.lasso)
# Selects only relevant columns
results2 <- results2[c("id","age","X1")]
# Renames columns
colnames(results2) <- c("id","age","predicted age lasso")

# Results
results2
##        id age predicted age lasso
## 1  SosaSa  20            25.63168
## 2  SosaSa  20            26.21004
## 3  SosaSa  20            25.60014
## 4  SosaSa  21            20.63886
## 5  SosaSa  22            25.73862
## 6  SosaSa  23            26.07473
## 7  SosaSa  24            23.63417
## 8  SosaSa  25            24.55522
## 9  SosaSa  26            25.10762
## 10 SosaSa  27            25.35091
## 11 SosaSa  28            23.89876
## 12 SosaSa  29            24.70406
## 13 SosaSa  30            25.97751
## 14 SosaSa  31            28.15182
## 15 SosaSa  32            30.01765
## 16 SosaSa  33            31.79388
## 17 SosaSa  34            31.15191
## 18 SosaSa  35            32.42314
## 19 SosaSa  36            34.87235
## 20 SosaSa  38            34.41466

# Mean squared error
mse(results2$age,results2$`predicted age lasso`)
## [1] 11.14407
# Root mean squared error
rmse(results2$age,results2$`predicted age lasso`)
## [1] 3.338274
# Mean absolute error
mae(results2$age,results2$`predicted age lasso`)
## [1] 2.828134

Comparing the each error measure, respectively, we see that LASSO is consistently a better model to use. It should be of note, that it makes little sense to compare different error measures for the same model. It is mainly used to measure the error between models.

Your Turn

  1. Create a function for each error measure (MSE, RMSE, MAE).
  2. Run your functions on the results from the ordinary least squares to see if your function obtains the same values.

Answers

1.

mse.fun <- function(x,y){
  mean((x-y)^2)
}

rmse.fun <- function(x,y){
  sqrt(mean((x-y)^2))
}

mae.fun <- function(x,y){
  mean(abs(x-y))
}

2.

# Mean squared error
mse.fun(results$age,results$prediction.age)
## [1] 12.10603
# Root mean squared error
rmse.fun(results$age,results$prediction.age)
## [1] 3.479372
# Mean absolute error
mae.fun(results$age,results$prediction.age)
## [1] 3.091913