---
title: "Introduction to Classification"
output:
ioslides_presentation:
smaller: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Outline
- Discuss common classification algorithms
- Do some simple classification exercises
## Classification
Simply, **classification** is concerned with predicting a category in which a new observation will fall into. For instance input variables can be passing yards per game, interceptions, sacks, etc. and an output could be win or lose.
The distinguishing characteristic between regression and classification is that classification attempts to predict a label for a new response. That is, classification should not be used if you want to determine how many points a team scores, but rather if you want to predict if a team wins or loses.
## Examples of Machine Learning Algorithms for Classification
- **Decision Trees**: Partitions data into small homogenous groups
- **Random Forest**: Ensemble (collection) of decision trees constructed in a unique way where each tree has a vote towards the class label
- **Naive Bayes Classifier**: Uses conditional probabilities to make predictions
- **Support Vector Machine**: Constructs hyperplanes classify data into groups
- Many many many more!
## Applying a Model
Let's take our NBA Playoffs data set, and see if we can predict the wins or losses for the Miami Heat. For this we will be using the random forest model.
```{r, warning = F, message = F}
library(randomForest)
nba <- read.csv("NBA Playoffs.csv") # Load data
#Remove some variables that will affect outcome
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Oppo))
# Training data consists of all teams not named Heat
nba.train <- nba[-which(nba$Team == "Heat"),]
# Testing data constists of Miami Heat
nba.test <- nba[nba$Team == "Heat",]
#Random Forest model
#Use as.factor() so randomForest knows this is not regression
model <- randomForest(as.factor(Outcome) ~., data = nba.train)
```
##
```{r}
model
```
The format for the `randomForest()` is much like `lm()`. The first argument is the formula and the second is the data set. Of course there are many more parameters. One can read them here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
For classification, the random forest prints out a **confusion matrix**. This particular matrix can be interpreted as a "built-in" accuracy test for the random forest. The entries along the diagonal is the total number of correctly labeled classes in the training set. If we add the numbers along the diagonal divided by all the numbers in the matrix and subtract that number from 1, we will have the error rate.
## Prediction
```{r}
prediction = predict(model,nba.test)
# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(nba.test,prediction)
# Selects only relevant columns
results <- results[c("Team","Outcome","prediction")]
head(results)
```
##
Certainly if we have a lot of instances, our results data frame will be very long. To combat this, we can make a table similar to the output of the random forest using the `table()` command.
```{r}
#Creates table of predictions and outcome in testing set
sum.table = table(prediction,nba.test$Outcome)
sum.table
# Formula to sum diagonal and divide by sum of all entries
sum(diag(sum.table))/sum(sum.table)
```
## Do Not Be Tricked
From this we are able to see that we can predict the outcomes for the Miami Heat with very high accuracy. However, do not be fooled by this percentage. If it were this easy to predict games, then everyone would be doing it.
Recall that all the data we have in our testing set was collected after each game occurred. Specifically, if we are trying to predict Game 2, Round 1 for the Miami Heat, we can only use data from Game 1 and before. However, in our model we used data from Game 2 to predict game 2.
One way to remedy this is to build a different testing data set manually where we replace data from the game in question with an average or median values from prior games. However, we will consider this in a later lesson. The purpose of this introduction is to get accustomed to these machine learning algorithms
Essentially all we are doing is determining how well we can predict the outcomes of games given that we have all the statistics from that game. So, of course we should be able to predict the outcome fairly well!
## Running Experiment Again Using a Decision Tree
As mentioned eariler, the random forest is in short a collection of uniquely built decision trees. So will a single decision tree do better or worse than the random forest? Let's see.
```{r}
library(rpart) # Load relevant library
# Using rpart with similar typset as randomForest and lm
# Use as.factor() so rpart knows this is not regression
model2 <- rpart(as.factor(Outcome) ~., data = nba.train)
```
##
```{r}
model2
```
This is the spliting criteria the tree used to partition the data. However, it is not immediately clear how to read this. Instead we can plot a nice image.
##
```{r,message=F,warning=F,fig.height= 3.5,fig.width=6}
library(RColorBrewer) # Load relevant libraries
library(rattle)
fancyRpartPlot(model2,sub="") # Construct plot
```
From this we can follow the structure of the tree. Since the *Advanced Defensive Rating* is the first split, we can interpret that to be the most important factor. We can interpret the rest of the tree by moving down along the edges to each node.
##
Now let's see how accurate a single decision tree is.
```{r}
# Prediction using rpart model
prediction2 <- predict(model2,nba.test,type = "class")
#Create table similar to random forest table
sum.table2 <- table(prediction2,nba.test$Outcome)
sum.table2
#Compute accuracy
sum(diag(sum.table2))/sum(sum.table2)
```
Using a single decision tree, our accuracy was only 97.2% which is slightly below that of the random forest. Although, there are plenty of exceptions the random forest will typically perform better than a single tree because it contains an ensemble of tree, and votes with the majority of the tree votes.
## Your Turn
1. There are two main tuning parameters within `randomForest()`, namely, `mtry` and `ntree`. `mtry` is the number of features sampled at each split, while `ntree` is the number of trees in its ensemble. Using the same code above, experiment with these two parameters by testing every combination of `mtry = 10,50`, and `ntree = 100,500`. Is there a noticeable difference?
2. Using the `e1071` library, run the same experiment above using Naive Bayes classifier and Support Vector Machines with default parameters. Get the prediction accuracy of each one. (If you need help, refer to the `e1071` documentation on CRAN.)
## Answers
### 1.
mtry = 10, ntree = 100
```{r, warning = F, message = F}
nba <- read.csv("NBA Playoffs.csv") # Load data
# Remove certain features
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
#B uilt training and testing data
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
# Random forest model with certain parameters
model <- randomForest(Outcome ~., data = nba.train,
mtry = 10, ntree = 100)
# Use model to predict on testing data
prediction = predict(model,nba.test)
# Create table of predictions and testing Outcomes
sum.table = table(prediction,nba.test$Outcome)
# Get percentage
sum(diag(sum.table))/sum(sum.table)
```
##
mtry = 10, ntree = 500
```{r, warning = F, message = F}
nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train,
mtry = 10, ntree = 500)
prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)
```
##
mtry = 50, ntree = 100
```{r, warning = F, message = F}
nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train,
mtry = 50, ntree = 100)
prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)
```
##
mtry = 50, ntree = 500
```{r, warning = F, message = F}
nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train,
mtry = 50, ntree = 500)
prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)
```
There is no clear indication of which tuning parameters affect the accuracy the most.
## 2.
Naive Bayes Classifier
```{r,warning = F,message = F}
library(e1071)
nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
# Only difference is naiveBayes as opposed to randomForest
model <- naiveBayes(Outcome ~., data = nba.train)
prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)
```
##
Support Vector Machine
```{r,warning = F,message = F}
library(e1071)
nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff))
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
# Only difference is svm as opposed to randomForest
model <- svm(Outcome ~., data = nba.train)
prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)
```