## Outline

• Learn about different measures of errors for predictions

## Measuring Error

When it comes to predictive modelling having a way to measure error is very useful. Namely, if we know a specific outcome and our predicted outcome, we should be able to have a measure of our accuracy.

In a classification problem it is clear. We simply count the number of observations where we predict the correct class and output a number, or generate a confusion matrix.

In a regression problem it is not as simple. For instance, if we predict a value of 5 but the actual value was 5.5, are we close enough? Is there a measure to determine how successful we are? Yes there is!

## Types of Error Measure

Common measures include:

• Mean Squared Error (MSE): Measures average sum of square distances between predicted and actual value

• Root Mean Squared Error (RMSE): Square root of mean squared error

• Mean Absolute Error (MAE): Measures average sum of absolute distance between actual and predicted

## Complicated Math

MSE = $$\dfrac{1}{n}\sum\limits_{i=1}^{n}(\hat{x}_i-x_i)^2$$

RMSE = $$\sqrt{\dfrac{1}{n}\sum\limits_{i=1}^{n}(\hat{x}_i-x_i)^2}$$

MAE = $$\dfrac{1}{n}\sum\limits_{i=1}^{n}|\hat{x}_i-x_i|$$

where $$\hat{x}_i$$ is the predicted value and $$x_i$$ is the actual value.

Simply put, if the closer each value is to zero, the better our predictions.

## Example

Let's go back to our MLB example where we predicted the age of Sammy Sosa using an ordinary least squares.

# Load MLB data
# Subset to remove categorical data
mlb <- subset(mlb, select = -c(tm,lg))
# Selects rows without SosaSa
mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows with SosaSa mlb.test <- mlb[mlb$id == "SosaSa",]
# Builds linear model predicting age based on all variables
# We need to remove id, tm, and lg variables since they are not numeric
model <- lm(age ~., data = mlb.train[,-1])
# Run prediction function based on our model
prediction.age <- predict(model,mlb.test)
# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age) # Selects only relevant columns results <- results[c("id","age","prediction.age")] # Results results ## id age prediction.age ## 1 SosaSa 20 23.27857 ## 2 SosaSa 20 22.93880 ## 3 SosaSa 20 23.20457 ## 4 SosaSa 21 18.40951 ## 5 SosaSa 22 24.47351 ## 6 SosaSa 23 23.91660 ## 7 SosaSa 24 22.12227 ## 8 SosaSa 25 23.46654 ## 9 SosaSa 26 23.26329 ## 10 SosaSa 27 23.25116 ## 11 SosaSa 28 23.05700 ## 12 SosaSa 29 21.71412 ## 13 SosaSa 30 25.07838 ## 14 SosaSa 31 27.50652 ## 15 SosaSa 32 27.46382 ## 16 SosaSa 33 30.92094 ## 17 SosaSa 34 30.21568 ## 18 SosaSa 35 33.61453 ## 19 SosaSa 36 35.92567 ## 20 SosaSa 38 33.96438 Let's see how accurate we are using all three metrics from the Metric package. The input of the code is: mse(actual,predicted), rmse(actual,predicted), and mae(actual,predicted) # Load relevant library library(Metrics) # Mean squared error mse(results$age,results$prediction.age) ## [1] 12.10603 # Root mean squared error rmse(results$age,results$prediction.age) ## [1] 3.479372 # Mean absolute error mae(results$age,results$prediction.age) ## [1] 3.091913 ## Using LASSO Estimate # Load relevant library library(glmnet) # Load MLB data set mlb <- read.csv("MLB Stats.csv") # Selects rows without SosaSa mlb.train <- mlb[-which(mlb$id == "SosaSa"),]
# Selects rows with SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",] # Subset training data removing categorical data mlb.train <- subset(mlb.train, select = -c(id,tm,lg)) # Subset training data removing categorical data plus age mlb.test <- subset(mlb.test, select = -c(id,tm,lg,age)) # Subset training to create matrix of features mlb.trainx <- as.matrix(subset(mlb.train,select = -age)) # Subset training to create vector of response variable mlb.trainy <- as.matrix(subset(mlb.train,select = age)) # Builds LASSO model predicting age based on all variables # We need to remove id, tm, and lg variables since they are not numeric # Picks optimal lambda value model.cv <- cv.glmnet(mlb.trainx,mlb.trainy, alpha = 1) # Prediction based on cv.model and testing data prediction.age.lasso <- predict(model.cv,as.matrix(mlb.test), lambda = "lambda.min") # Creates data frame with Sammy Sosa and the predictions as columns results2 <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age.lasso)
# Selects only relevant columns
results2 <- results2[c("id","age","X1")]
# Renames columns
colnames(results2) <- c("id","age","predicted age lasso")
# Results
results2
##        id age predicted age lasso
## 1  SosaSa  20            25.63168
## 2  SosaSa  20            26.21004
## 3  SosaSa  20            25.60014
## 4  SosaSa  21            20.63886
## 5  SosaSa  22            25.73862
## 6  SosaSa  23            26.07473
## 7  SosaSa  24            23.63417
## 8  SosaSa  25            24.55522
## 9  SosaSa  26            25.10762
## 10 SosaSa  27            25.35091
## 11 SosaSa  28            23.89876
## 12 SosaSa  29            24.70406
## 13 SosaSa  30            25.97751
## 14 SosaSa  31            28.15182
## 15 SosaSa  32            30.01765
## 16 SosaSa  33            31.79388
## 17 SosaSa  34            31.15191
## 18 SosaSa  35            32.42314
## 19 SosaSa  36            34.87235
## 20 SosaSa  38            34.41466
# Mean squared error
mse(results2$age,results2$predicted age lasso)
## [1] 11.14407
# Root mean squared error
rmse(results2$age,results2$predicted age lasso)
## [1] 3.338274
# Mean absolute error
mae(results2$age,results2$predicted age lasso)
## [1] 2.828134

Comparing the each error measure, respectively, we see that LASSO is consistently a better model to use. It should be of note, that it makes little sense to compare different error measures for the same model. It is mainly used to measure the error between models.

1. Create a function for each error measure (MSE, RMSE, MAE).
2. Run your functions on the results from the ordinary least squares to see if your function obtains the same values.

### 1.

mse.fun <- function(x,y){
mean((x-y)^2)
}

rmse.fun <- function(x,y){
sqrt(mean((x-y)^2))
}

mae.fun <- function(x,y){
mean(abs(x-y))
}

### 2.

# Mean squared error
mse.fun(results$age,results$prediction.age)
## [1] 12.10603
# Root mean squared error
rmse.fun(results$age,results$prediction.age)
## [1] 3.479372
# Mean absolute error
mae.fun(results$age,results$prediction.age)
## [1] 3.091913