- Learn about different measures of errors for predictions
When it comes to predictive modelling having a way to measure error is very useful. Namely, if we know a specific outcome and our predicted outcome, we should be able to have a measure of our accuracy.
In a classification problem it is clear. We simply count the number of observations where we predict the correct class and output a number, or generate a confusion matrix.
In a regression problem it is not as simple. For instance, if we predict a value of 5 but the actual value was 5.5, are we close enough? Is there a measure to determine how successful we are? Yes there is!
Common measures include:
Mean Squared Error (MSE): Measures average sum of square distances between predicted and actual value
Root Mean Squared Error (RMSE): Square root of mean squared error
Mean Absolute Error (MAE): Measures average sum of absolute distance between actual and predicted
MSE = \(\dfrac{1}{n}\sum\limits_{i=1}^{n}(\hat{x}_i-x_i)^2\)
RMSE = \(\sqrt{\dfrac{1}{n}\sum\limits_{i=1}^{n}(\hat{x}_i-x_i)^2}\)
MAE = \(\dfrac{1}{n}\sum\limits_{i=1}^{n}|\hat{x}_i-x_i|\)
where \(\hat{x}_i\) is the predicted value and \(x_i\) is the actual value.
Simply put, if the closer each value is to zero, the better our predictions.
Let's go back to our MLB example where we predicted the age of Sammy Sosa using an ordinary least squares.
# Load MLB data mlb <- read.csv("MLB Stats.csv") # Subset to remove categorical data mlb <- subset(mlb, select = -c(tm,lg)) # Selects rows without SosaSa mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows with SosaSa mlb.test <- mlb[mlb$id == "SosaSa",] # Builds linear model predicting age based on all variables # We need to remove id, tm, and lg variables since they are not numeric model <- lm(age ~., data = mlb.train[,-1]) # Run prediction function based on our model prediction.age <- predict(model,mlb.test) # Creates data frame with Sammy Sosa and the predictions as columns results <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age) # Selects only relevant columns results <- results[c("id","age","prediction.age")]
# Results results
## id age prediction.age ## 1 SosaSa 20 23.27857 ## 2 SosaSa 20 22.93880 ## 3 SosaSa 20 23.20457 ## 4 SosaSa 21 18.40951 ## 5 SosaSa 22 24.47351 ## 6 SosaSa 23 23.91660 ## 7 SosaSa 24 22.12227 ## 8 SosaSa 25 23.46654 ## 9 SosaSa 26 23.26329 ## 10 SosaSa 27 23.25116 ## 11 SosaSa 28 23.05700 ## 12 SosaSa 29 21.71412 ## 13 SosaSa 30 25.07838 ## 14 SosaSa 31 27.50652 ## 15 SosaSa 32 27.46382 ## 16 SosaSa 33 30.92094 ## 17 SosaSa 34 30.21568 ## 18 SosaSa 35 33.61453 ## 19 SosaSa 36 35.92567 ## 20 SosaSa 38 33.96438
Let's see how accurate we are using all three metrics from the Metric
package. The input of the code is: mse(actual,predicted)
, rmse(actual,predicted)
, and mae(actual,predicted)
# Load relevant library library(Metrics) # Mean squared error mse(results$age,results$prediction.age)
## [1] 12.10603
# Root mean squared error rmse(results$age,results$prediction.age)
## [1] 3.479372
# Mean absolute error mae(results$age,results$prediction.age)
## [1] 3.091913
# Load relevant library library(glmnet) # Load MLB data set mlb <- read.csv("MLB Stats.csv") # Selects rows without SosaSa mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows with SosaSa mlb.test <- mlb[mlb$id == "SosaSa",] # Subset training data removing categorical data mlb.train <- subset(mlb.train, select = -c(id,tm,lg)) # Subset training data removing categorical data plus age mlb.test <- subset(mlb.test, select = -c(id,tm,lg,age)) # Subset training to create matrix of features mlb.trainx <- as.matrix(subset(mlb.train,select = -age)) # Subset training to create vector of response variable mlb.trainy <- as.matrix(subset(mlb.train,select = age)) # Builds LASSO model predicting age based on all variables # We need to remove id, tm, and lg variables since they are not numeric # Picks optimal lambda value model.cv <- cv.glmnet(mlb.trainx,mlb.trainy, alpha = 1) # Prediction based on cv.model and testing data prediction.age.lasso <- predict(model.cv,as.matrix(mlb.test), lambda = "lambda.min") # Creates data frame with Sammy Sosa and the predictions as columns results2 <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age.lasso) # Selects only relevant columns results2 <- results2[c("id","age","X1")] # Renames columns colnames(results2) <- c("id","age","predicted age lasso")
# Results results2
## id age predicted age lasso ## 1 SosaSa 20 25.63168 ## 2 SosaSa 20 26.21004 ## 3 SosaSa 20 25.60014 ## 4 SosaSa 21 20.63886 ## 5 SosaSa 22 25.73862 ## 6 SosaSa 23 26.07473 ## 7 SosaSa 24 23.63417 ## 8 SosaSa 25 24.55522 ## 9 SosaSa 26 25.10762 ## 10 SosaSa 27 25.35091 ## 11 SosaSa 28 23.89876 ## 12 SosaSa 29 24.70406 ## 13 SosaSa 30 25.97751 ## 14 SosaSa 31 28.15182 ## 15 SosaSa 32 30.01765 ## 16 SosaSa 33 31.79388 ## 17 SosaSa 34 31.15191 ## 18 SosaSa 35 32.42314 ## 19 SosaSa 36 34.87235 ## 20 SosaSa 38 34.41466
# Mean squared error mse(results2$age,results2$`predicted age lasso`)
## [1] 11.14407
# Root mean squared error rmse(results2$age,results2$`predicted age lasso`)
## [1] 3.338274
# Mean absolute error mae(results2$age,results2$`predicted age lasso`)
## [1] 2.828134
Comparing the each error measure, respectively, we see that LASSO is consistently a better model to use. It should be of note, that it makes little sense to compare different error measures for the same model. It is mainly used to measure the error between models.
mse.fun <- function(x,y){ mean((x-y)^2) } rmse.fun <- function(x,y){ sqrt(mean((x-y)^2)) } mae.fun <- function(x,y){ mean(abs(x-y)) }
# Mean squared error mse.fun(results$age,results$prediction.age)
## [1] 12.10603
# Root mean squared error rmse.fun(results$age,results$prediction.age)
## [1] 3.479372
# Mean absolute error mae.fun(results$age,results$prediction.age)
## [1] 3.091913