- Discuss common classification algorithms
- Do some simple classification exercises
Simply, classification is concerned with predicting a category in which a new observation will fall into. For instance input variables can be passing yards per game, interceptions, sacks, etc. and an output could be win or lose.
The distinguishing characteristic between regression and classification is that classification attempts to predict a label for a new response. That is, classification should not be used if you want to determine how many points a team scores, but rather if you want to predict if a team wins or loses.
Let's take our NBA Playoffs data set, and see if we can predict the wins or losses for the Miami Heat. For this we will be using the random forest model.
library(randomForest) nba <- read.csv("NBA Playoffs.csv") # Load data #Remove some variables that will affect outcome nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Oppo)) # Training data consists of all teams not named Heat nba.train <- nba[-which(nba$Team == "Heat"),] # Testing data constists of Miami Heat nba.test <- nba[nba$Team == "Heat",] #Random Forest model #Use as.factor() so randomForest knows this is not regression model <- randomForest(as.factor(Outcome) ~., data = nba.train)
model
## ## Call: ## randomForest(formula = as.factor(Outcome) ~ ., data = nba.train) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 7 ## ## OOB estimate of error rate: 7.14% ## Confusion matrix: ## L W class.error ## L 224 20 0.08196721 ## W 21 309 0.06363636
The format for the randomForest()
is much like lm()
. The first argument is the formula and the second is the data set. Of course there are many more parameters. One can read them here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
For classification, the random forest prints out a confusion matrix. This particular matrix can be interpreted as a "built-in" accuracy test for the random forest. The entries along the diagonal is the total number of correctly labeled classes in the training set. If we add the numbers along the diagonal divided by all the numbers in the matrix and subtract that number from 1, we will have the error rate.
prediction = predict(model,nba.test) # Creates data frame with Sammy Sosa and the predictions as columns results <- data.frame(nba.test,prediction) # Selects only relevant columns results <- results[c("Team","Outcome","prediction")] head(results)
## Team Outcome prediction ## 65 Heat W W ## 66 Heat W W ## 67 Heat W W ## 68 Heat L L ## 69 Heat W W ## 70 Heat W W
Certainly if we have a lot of instances, our results data frame will be very long. To combat this, we can make a table similar to the output of the random forest using the table()
command.
#Creates table of predictions and outcome in testing set sum.table = table(prediction,nba.test$Outcome) sum.table
## ## prediction L W ## L 39 0 ## W 1 69
# Formula to sum diagonal and divide by sum of all entries sum(diag(sum.table))/sum(sum.table)
## [1] 0.9908257
From this we are able to see that we can predict the outcomes for the Miami Heat with very high accuracy. However, do not be fooled by this percentage. If it were this easy to predict games, then everyone would be doing it.
Recall that all the data we have in our testing set was collected after each game occurred. Specifically, if we are trying to predict Game 2, Round 1 for the Miami Heat, we can only use data from Game 1 and before. However, in our model we used data from Game 2 to predict game 2.
One way to remedy this is to build a different testing data set manually where we replace data from the game in question with an average or median values from prior games. However, we will consider this in a later lesson. The purpose of this introduction is to get accustomed to these machine learning algorithms
Essentially all we are doing is determining how well we can predict the outcomes of games given that we have all the statistics from that game. So, of course we should be able to predict the outcome fairly well!
As mentioned eariler, the random forest is in short a collection of uniquely built decision trees. So will a single decision tree do better or worse than the random forest? Let's see.
library(rpart) # Load relevant library # Using rpart with similar typset as randomForest and lm # Use as.factor() so rpart knows this is not regression model2 <- rpart(as.factor(Outcome) ~., data = nba.train)
model2
## n= 574 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## 1) root 574 244 W (0.425087108 0.574912892) ## 2) Advanced.DRtg>=107.45 259 81 L (0.687258687 0.312741313) ## 4) Advanced.ORtg< 113.65 157 9 L (0.942675159 0.057324841) ## 8) Advanced.DRtg>=110.05 126 1 L (0.992063492 0.007936508) * ## 9) Advanced.DRtg< 110.05 31 8 L (0.741935484 0.258064516) ## 18) Advanced.ORtg< 108.9 23 0 L (1.000000000 0.000000000) * ## 19) Advanced.ORtg>=108.9 8 0 W (0.000000000 1.000000000) * ## 5) Advanced.ORtg>=113.65 102 30 W (0.294117647 0.705882353) ## 10) Advanced.DRtg>=119.5 34 8 L (0.764705882 0.235294118) ## 20) Advanced.ORtg< 124.5 25 0 L (1.000000000 0.000000000) * ## 21) Advanced.ORtg>=124.5 9 1 W (0.111111111 0.888888889) * ## 11) Advanced.DRtg< 119.5 68 4 W (0.058823529 0.941176471) * ## 3) Advanced.DRtg< 107.45 315 66 W (0.209523810 0.790476190) ## 6) Advanced.ORtg< 101.45 104 41 L (0.605769231 0.394230769) ## 12) Advanced.DRtg>=95.2 64 5 L (0.921875000 0.078125000) * ## 13) Advanced.DRtg< 95.2 40 4 W (0.100000000 0.900000000) * ## 7) Advanced.ORtg>=101.45 211 3 W (0.014218009 0.985781991) *
This is the spliting criteria the tree used to partition the data. However, it is not immediately clear how to read this. Instead we can plot a nice image.
library(RColorBrewer) # Load relevant libraries library(rattle) fancyRpartPlot(model2,sub="") # Construct plot
From this we can follow the structure of the tree. Since the Advanced Defensive Rating is the first split, we can interpret that to be the most important factor. We can interpret the rest of the tree by moving down along the edges to each node.
Now let's see how accurate a single decision tree is.
# Prediction using rpart model prediction2 <- predict(model2,nba.test,type = "class") #Create table similar to random forest table sum.table2 <- table(prediction2,nba.test$Outcome) sum.table2
## ## prediction2 L W ## L 38 1 ## W 2 68
#Compute accuracy sum(diag(sum.table2))/sum(sum.table2)
## [1] 0.9724771
Using a single decision tree, our accuracy was only 97.2% which is slightly below that of the random forest. Although, there are plenty of exceptions the random forest will typically perform better than a single tree because it contains an ensemble of tree, and votes with the majority of the tree votes.
There are two main tuning parameters within randomForest()
, namely, mtry
and ntree
. mtry
is the number of features sampled at each split, while ntree
is the number of trees in its ensemble. Using the same code above, experiment with these two parameters by testing every combination of mtry = 10,50
, and ntree = 100,500
. Is there a noticeable difference?
Using the e1071
library, run the same experiment above using Naive Bayes classifier and Support Vector Machines with default parameters. Get the prediction accuracy of each one. (If you need help, refer to the e1071
documentation on CRAN.)
mtry = 10, ntree = 100
nba <- read.csv("NBA Playoffs.csv") # Load data # Remove certain features nba <- subset(nba,select = -c(Date,Points,Oppo,Diff)) #B uilt training and testing data nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] # Random forest model with certain parameters model <- randomForest(Outcome ~., data = nba.train, mtry = 10, ntree = 100) # Use model to predict on testing data prediction = predict(model,nba.test) # Create table of predictions and testing Outcomes sum.table = table(prediction,nba.test$Outcome) # Get percentage sum(diag(sum.table))/sum(sum.table)
## [1] 0.9816514
mtry = 10, ntree = 500
nba <- read.csv("NBA Playoffs.csv") nba <- subset(nba,select = -c(Date,Points,Oppo,Diff)) nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] model <- randomForest(Outcome ~., data = nba.train, mtry = 10, ntree = 500) prediction = predict(model,nba.test) sum.table = table(prediction,nba.test$Outcome) sum(diag(sum.table))/sum(sum.table)
## [1] 0.9816514
mtry = 50, ntree = 100
nba <- read.csv("NBA Playoffs.csv") nba <- subset(nba,select = -c(Date,Points,Oppo,Diff)) nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] model <- randomForest(Outcome ~., data = nba.train, mtry = 50, ntree = 100) prediction = predict(model,nba.test) sum.table = table(prediction,nba.test$Outcome) sum(diag(sum.table))/sum(sum.table)
## [1] 0.9816514
mtry = 50, ntree = 500
nba <- read.csv("NBA Playoffs.csv") nba <- subset(nba,select = -c(Date,Points,Oppo,Diff)) nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] model <- randomForest(Outcome ~., data = nba.train, mtry = 50, ntree = 500) prediction = predict(model,nba.test) sum.table = table(prediction,nba.test$Outcome) sum(diag(sum.table))/sum(sum.table)
## [1] 0.9816514
There is no clear indication of which tuning parameters affect the accuracy the most.
Naive Bayes Classifier
library(e1071) nba <- read.csv("NBA Playoffs.csv") nba <- subset(nba,select = -c(Date,Points,Oppo,Diff)) nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] # Only difference is naiveBayes as opposed to randomForest model <- naiveBayes(Outcome ~., data = nba.train) prediction = predict(model,nba.test) sum.table = table(prediction,nba.test$Outcome) sum(diag(sum.table))/sum(sum.table)
## [1] 0.9449541
Support Vector Machine
library(e1071) nba <- read.csv("NBA Playoffs.csv") nba <- subset(nba,select = -c(Date,Points,Oppo,Diff)) nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] # Only difference is svm as opposed to randomForest model <- svm(Outcome ~., data = nba.train) prediction = predict(model,nba.test) sum.table = table(prediction,nba.test$Outcome) sum(diag(sum.table))/sum(sum.table)
## [1] 0.9541284