---
title: "Feature Selection"
output:
ioslides_presentation:
smaller: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Outline
- Understand importance of feature selection
- Apply a few techniques
## Feature Selection
Feature selection is a way of automatically selecting the best features (predictors) in your data set that are the most relevant to your predictive model. In applying feature selection, we are reducing the total amount of *columns* in our data set. One may think that a model with as many predictors as possible will be the most accurate; however, often times, this is not true! Feature selection removes the following features:
- Irrelevant
- Unneeded
- Redunant
In doing this, we not only (hopefully) improve our model accuracy but also increase the computational efficiency.
Note: Feature selection is an important part of applied predictive modelling. There are many many more things to learn about this.
```{r,echo=FALSE,eval = F}
## Types of Feature Selection
- **Wrapper Methods**: Evaluates models using algorithm that adds and/or removes features to find the best combination that maximizes accuracy.
- **Filter Methods**: Evaluates subsets of features before applying algorithm and is generally a pre-processing step. The measure on which to determine subsets can be an barrage of statistical tests such as: Linear Discriminant Analysis, Principal Component Analysis, Pearson's Correlation, etc...
```
## Example of Feature Selection
Observing a correlation matrix
```{r}
mlb <- read.csv("MLB Stats.csv")
head(cor(mlb[,-c(1,4,5)]))
```
## Another Example
Ranking features by importance using a stepwise regression
```{r}
model <- glm(hr~., data = mlb[,-c(1,4,5)])
step(model, direction = "forward")
```
##
```{r}
model <- glm(hr~., data = mlb[,-c(1,4,5)])
step(model, direction = "backward")
```
##
```{r}
model <- glm(hr~., data = mlb[,-c(1,4,5)])
step(model, direction = "both")
```
## Models With Built-in Feature Select
Random Forest for regression
```{r,message=F}
library(randomForest)
model <- randomForest(hr~.,data = mlb)
model$importance
```
Random Forest for classification
```{r}
nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Oppo))
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train)
head(model$importance)
```
## Your Turn
Using the NBA Playoffs.csv data set, complete the following:
1. Run a random forest model to gather the top 5 most important features for predicting the outcome of the Miami Heat. The following code may be of use:
```{r,eval = F}
feats.df = data.frame(model$importance)
feats.df <- feats.df[order(-feats.df$MeanDecreaseGini), , drop = FALSE]
top.feats <- rownames(feats.df)[1:5]
```
2. Using those top 5 features and the random forest, predict the outcome for the Miami Heat using the rest of the data as training. Output a single number to gauge the accuracy of our model.
3. Repeat (1.) - (2.) using the top 10 and top 20 most important features.
## Answer
### 1. & 2.
```{r}
feats.df = data.frame(model$importance)
feats.df <- feats.df[order(-feats.df$MeanDecreaseGini), , drop = FALSE]
top.feats <- rownames(feats.df)[1:5]
nba <- nba[,which(colnames(nba) %in% c(top.feats,"Team","Outcome"))]
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train[,-1])
prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)
```
##
### 3.
Top 10
```{r}
nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Opp))
top.feats <- rownames(feats.df)[1:10]
nba <- nba[,which(colnames(nba) %in% c(top.feats,"Team","Outcome"))]
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train[,-1])
prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)
```
##
Top 20
```{r}
nba <- read.csv("NBA Playoffs.csv")
nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Opp))
top.feats <- rownames(feats.df)[1:20]
nba <- nba[,which(colnames(nba) %in% c(top.feats,"Team","Outcome"))]
nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]
model <- randomForest(Outcome ~., data = nba.train[,-1])
prediction = predict(model,nba.test)
sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)
```