Introduction to Classification

Outline

Discuss common classification algorithms
Do some simple classification exercises

Classification

Simply, classification is concerned with predicting a category in which a new observation will fall into. For instance input variables can be passing yards per game, interceptions, sacks, etc. and an output could be win or lose.

The distinguishing characteristic between regression and classification is that classification attempts to predict a label for a new response. That is, classification should not be used if you want to determine how many points a team scores, but rather if you want to predict if a team wins or loses.

Examples of Machine Learning Algorithms for Classification

Decision Trees: Partitions data into small homogenous groups
Random Forest: Ensemble (collection) of decision trees constructed in a unique way where each tree has a vote towards the class label
Naive Bayes Classifier: Uses conditional probabilities to make predictions
Support Vector Machine: Constructs hyperplanes classify data into groups
Many many many more!

Applying a Model

Let's take our NBA Playoffs data set, and see if we can predict the wins or losses for the Miami Heat. For this we will be using the random forest model.

library(randomForest)

nba <- read.csv("NBA Playoffs.csv") # Load data

#Remove some variables that will affect outcome
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Oppo)) 

# Training data consists of all teams not named Heat
nba.train <- nba[-which(nba$Team == "Heat"),] 
# Testing data constists of Miami Heat
nba.test <- nba[nba$Team == "Heat",]

#Random Forest model
#Use as.factor() so randomForest knows this is not regression
model <- randomForest(as.factor(Outcome) ~., data = nba.train)

model

## 
## Call:
##  randomForest(formula = as.factor(Outcome) ~ ., data = nba.train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 7.14%
## Confusion matrix:
##     L   W class.error
## L 224  20  0.08196721
## W  21 309  0.06363636

The format for the randomForest() is much like lm(). The first argument is the formula and the second is the data set. Of course there are many more parameters. One can read them here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf

For classification, the random forest prints out a confusion matrix. This particular matrix can be interpreted as a "built-in" accuracy test for the random forest. The entries along the diagonal is the total number of correctly labeled classes in the training set. If we add the numbers along the diagonal divided by all the numbers in the matrix and subtract that number from 1, we will have the error rate.

Prediction

prediction = predict(model,nba.test)

# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(nba.test,prediction)

# Selects only relevant columns
results <- results[c("Team","Outcome","prediction")]
head(results)

##    Team Outcome prediction
## 65 Heat       W          W
## 66 Heat       W          W
## 67 Heat       W          W
## 68 Heat       L          L
## 69 Heat       W          W
## 70 Heat       W          W

Certainly if we have a lot of instances, our results data frame will be very long. To combat this, we can make a table similar to the output of the random forest using the table() command.

#Creates table of predictions and outcome in testing set
sum.table = table(prediction,nba.test$Outcome)
sum.table

##           
## prediction  L  W
##          L 39  0
##          W  1 69

# Formula to sum diagonal and divide by sum of all entries
sum(diag(sum.table))/sum(sum.table)

## [1] 0.9908257

Do Not Be Tricked

From this we are able to see that we can predict the outcomes for the Miami Heat with very high accuracy. However, do not be fooled by this percentage. If it were this easy to predict games, then everyone would be doing it.

Recall that all the data we have in our testing set was collected after each game occurred. Specifically, if we are trying to predict Game 2, Round 1 for the Miami Heat, we can only use data from Game 1 and before. However, in our model we used data from Game 2 to predict game 2.

One way to remedy this is to build a different testing data set manually where we replace data from the game in question with an average or median values from prior games. However, we will consider this in a later lesson. The purpose of this introduction is to get accustomed to these machine learning algorithms

Essentially all we are doing is determining how well we can predict the outcomes of games given that we have all the statistics from that game. So, of course we should be able to predict the outcome fairly well!

Running Experiment Again Using a Decision Tree

As mentioned eariler, the random forest is in short a collection of uniquely built decision trees. So will a single decision tree do better or worse than the random forest? Let's see.

library(rpart) # Load relevant library

# Using rpart with similar typset as randomForest and lm
# Use as.factor() so rpart knows this is not regression
model2 <- rpart(as.factor(Outcome) ~., data = nba.train)

model2

## n= 574 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 574 244 W (0.425087108 0.574912892)  
##    2) Advanced.DRtg>=107.45 259  81 L (0.687258687 0.312741313)  
##      4) Advanced.ORtg< 113.65 157   9 L (0.942675159 0.057324841)  
##        8) Advanced.DRtg>=110.05 126   1 L (0.992063492 0.007936508) *
##        9) Advanced.DRtg< 110.05 31   8 L (0.741935484 0.258064516)  
##         18) Advanced.ORtg< 108.9 23   0 L (1.000000000 0.000000000) *
##         19) Advanced.ORtg>=108.9 8   0 W (0.000000000 1.000000000) *
##      5) Advanced.ORtg>=113.65 102  30 W (0.294117647 0.705882353)  
##       10) Advanced.DRtg>=119.5 34   8 L (0.764705882 0.235294118)  
##         20) Advanced.ORtg< 124.5 25   0 L (1.000000000 0.000000000) *
##         21) Advanced.ORtg>=124.5 9   1 W (0.111111111 0.888888889) *
##       11) Advanced.DRtg< 119.5 68   4 W (0.058823529 0.941176471) *
##    3) Advanced.DRtg< 107.45 315  66 W (0.209523810 0.790476190)  
##      6) Advanced.ORtg< 101.45 104  41 L (0.605769231 0.394230769)  
##       12) Advanced.DRtg>=95.2 64   5 L (0.921875000 0.078125000) *
##       13) Advanced.DRtg< 95.2 40   4 W (0.100000000 0.900000000) *
##      7) Advanced.ORtg>=101.45 211   3 W (0.014218009 0.985781991) *

This is the spliting criteria the tree used to partition the data. However, it is not immediately clear how to read this. Instead we can plot a nice image.

library(RColorBrewer) # Load relevant libraries
library(rattle)

fancyRpartPlot(model2,sub="") # Construct plot

From this we can follow the structure of the tree. Since the Advanced Defensive Rating is the first split, we can interpret that to be the most important factor. We can interpret the rest of the tree by moving down along the edges to each node.

Now let's see how accurate a single decision tree is.

# Prediction using rpart model
prediction2 <- predict(model2,nba.test,type = "class")
#Create table similar to random forest table
sum.table2 <- table(prediction2,nba.test$Outcome)
sum.table2

##            
## prediction2  L  W
##           L 38  1
##           W  2 68

#Compute accuracy
sum(diag(sum.table2))/sum(sum.table2)

## [1] 0.9724771

Using a single decision tree, our accuracy was only 97.2% which is slightly below that of the random forest. Although, there are plenty of exceptions the random forest will typically perform better than a single tree because it contains an ensemble of tree, and votes with the majority of the tree votes.

Your Turn

There are two main tuning parameters within randomForest(), namely, mtry and ntree. mtry is the number of features sampled at each split, while ntree is the number of trees in its ensemble. Using the same code above, experiment with these two parameters by testing every combination of mtry = 10,50, and ntree = 100,500. Is there a noticeable difference?
Using the e1071 library, run the same experiment above using Naive Bayes classifier and Support Vector Machines with default parameters. Get the prediction accuracy of each one. (If you need help, refer to the e1071 documentation on CRAN.)

Answers

1.

mtry = 10, ntree = 100

nba <- read.csv("NBA Playoffs.csv") # Load data
# Remove certain features
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
#B uilt training and testing data
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
# Random forest model with certain parameters
model <- randomForest(Outcome ~., data = nba.train,
                      mtry = 10, ntree = 100)
# Use model to predict on testing data
prediction = predict(model,nba.test)
# Create table of predictions and testing Outcomes
sum.table = table(prediction,nba.test$Outcome)
# Get percentage
sum(diag(sum.table))/sum(sum.table)

## [1] 0.9816514

mtry = 10, ntree = 500

nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train,
                      mtry = 10, ntree = 500)
prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)

## [1] 0.9816514

mtry = 50, ntree = 100

nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train,
                      mtry = 50, ntree = 100)

prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)

## [1] 0.9816514

mtry = 50, ntree = 500

nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train,
                      mtry = 50, ntree = 500)

prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)

## [1] 0.9816514

There is no clear indication of which tuning parameters affect the accuracy the most.

2.

Naive Bayes Classifier

library(e1071)
nba <- read.csv("NBA Playoffs.csv")

nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))

nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]

# Only difference is naiveBayes as opposed to randomForest
model <- naiveBayes(Outcome ~., data = nba.train)


prediction = predict(model,nba.test)

sum.table = table(prediction,nba.test$Outcome)

sum(diag(sum.table))/sum(sum.table)

## [1] 0.9449541

Support Vector Machine

library(e1071)
nba <- read.csv("NBA Playoffs.csv")

nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))

nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]

# Only difference is svm as opposed to randomForest
model <- svm(Outcome ~., data = nba.train)

prediction = predict(model,nba.test)

sum.table = table(prediction,nba.test$Outcome)

sum(diag(sum.table))/sum(sum.table)

## [1] 0.9541284