Feature Selection

Outline

Understand importance of feature selection
Apply a few techniques

Feature Selection

Feature selection is a way of automatically selecting the best features (predictors) in your data set that are the most relevant to your predictive model. In applying feature selection, we are reducing the total amount of columns in our data set. One may think that a model with as many predictors as possible will be the most accurate; however, often times, this is not true! Feature selection removes the following features:

Irrelevant
Unneeded
Redunant

In doing this, we not only (hopefully) improve our model accuracy but also increase the computational efficiency.

Note: Feature selection is an important part of applied predictive modelling. There are many many more things to learn about this.

Example of Feature Selection

Observing a correlation matrix

mlb <- read.csv("MLB Stats.csv")

head(cor(mlb[,-c(1,4,5)]))

##              year         age           g          pa          ab
## year  1.000000000  0.69292289 -0.09982083 -0.09782626 -0.12951155
## age   0.692922891  1.00000000 -0.01766747 -0.03017634 -0.05672408
## g    -0.099820827 -0.01766747  1.00000000  0.98505867  0.97367704
## pa   -0.097826256 -0.03017634  0.98505867  1.00000000  0.98731807
## ab   -0.129511548 -0.05672408  0.97367704  0.98731807  1.00000000
## r     0.005708515 -0.05365696  0.85355339  0.89113405  0.84297221
##                 r           h         X2b        X3b        hr        rbi
## year  0.005708515 -0.12290342 -0.05207639 -0.4218759 0.3051903 0.21661034
## age  -0.053656960 -0.05583127 -0.01308574 -0.2596250 0.0164766 0.01250502
## g     0.853553393  0.91652459  0.78447018  0.3828360 0.4850828 0.72475073
## pa    0.891134052  0.94672737  0.81693863  0.3887161 0.4932655 0.73735857
## ab    0.842972205  0.95037439  0.81882522  0.4086391 0.4227231 0.69056648
## r     1.000000000  0.87660936  0.78515664  0.3357498 0.6702643 0.81250811
##              sb         cs         bb          so           ba       obp
## year -0.3045645 -0.3909126 0.07763288  0.28276633 -0.008514741 0.1470013
## age  -0.2582114 -0.3605789 0.10599122 -0.05525395  0.059449900 0.1839460
## g     0.3087790  0.3497150 0.65853795  0.55507663  0.347027003 0.3350005
## pa    0.3191405  0.3612025 0.66913518  0.55329085  0.380716062 0.3595305
## ab    0.3192950  0.3761301 0.54551024  0.52979399  0.354141420 0.2571095
## r     0.3193035  0.3119072 0.76222169  0.58882599  0.493634328 0.5393551
##             slg       ops       ops.          tb         gdp        hbp
## year 0.27868426 0.2463425 0.13764186  0.06149245 -0.01891597 0.27473058
## age  0.08032514 0.1218244 0.04732968 -0.03679742  0.01594652 0.06618181
## g    0.27677897 0.3126431 0.33757566  0.86665586  0.61808941 0.32753524
## pa   0.29468430 0.3338308 0.35506861  0.89168608  0.62351496 0.36718216
## ab   0.21149897 0.2392633 0.26011884  0.85875695  0.65087330 0.32517457
## r    0.54502584 0.5722219 0.56871041  0.93739179  0.51136156 0.47240564
##               sh           sf        ibb
## year -0.40837494  0.019734179 0.09329986
## age  -0.13076838 -0.002810514 0.15737826
## g     0.14216405  0.556270172 0.28755856
## pa    0.14042328  0.569150016 0.28412764
## ab    0.16989374  0.554334693 0.17289246
## r    -0.01943339  0.523207089 0.40391119

Another Example

Ranking features by importance using a stepwise regression

model <- glm(hr~., data = mlb[,-c(1,4,5)])
step(model, direction = "forward")

## Start:  AIC=-17751.01
## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + 
##     cs + bb + so + ba + obp + slg + ops + ops. + tb + gdp + hbp + 
##     sh + sf + ibb

## 
## Call:  glm(formula = hr ~ year + age + g + pa + ab + r + h + X2b + X3b + 
##     rbi + sb + cs + bb + so + ba + obp + slg + ops + ops. + tb + 
##     gdp + hbp + sh + sf + ibb, data = mlb[, -c(1, 4, 5)])
## 
## Coefficients:
## (Intercept)         year          age            g           pa  
##  -6.749e-13    3.038e-16    6.384e-16   -4.268e-16   -6.105e-16  
##          ab            r            h          X2b          X3b  
##   7.840e-16   -4.720e-16   -3.333e-01   -3.333e-01   -6.667e-01  
##         rbi           sb           cs           bb           so  
##  -3.493e-16    6.364e-16   -2.228e-15    6.376e-16   -2.561e-16  
##          ba          obp          slg          ops         ops.  
##  -1.977e-13    1.984e-12    2.012e-12   -1.890e-12   -2.223e-16  
##          tb          gdp          hbp           sh           sf  
##   3.333e-01   -4.904e-16    8.365e-16    1.519e-17    5.197e-17  
##         ibb  
##  -2.958e-16  
## 
## Degrees of Freedom: 300 Total (i.e. Null);  275 Residual
## Null Deviance:       76810 
## Residual Deviance: 3.6e-25   AIC: -17750

model <- glm(hr~., data = mlb[,-c(1,4,5)])
step(model, direction = "backward")

## Start:  AIC=-17751.01
## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + 
##     cs + bb + so + ba + obp + slg + ops + ops. + tb + gdp + hbp + 
##     sh + sf + ibb
## 
##        Df Deviance      AIC
## - gdp   1     0.00 -17829.9
## - cs    1     0.00 -17795.2
## - g     1     0.00 -17761.6
## <none>        0.00 -17751.0
## - obp   1     0.00 -17680.9
## - slg   1     0.00 -17658.8
## - hbp   1     0.00 -17650.1
## - ba    1     0.00 -17631.0
## - bb    1     0.00 -17606.3
## - ops   1     0.00 -17603.1
## - sb    1     0.00 -17602.5
## - ab    1     0.00 -17596.9
## - sh    1     0.00 -17569.4
## - so    1     0.00 -17562.1
## - sf    1     0.00 -17551.1
## - ops.  1     0.00 -17527.2
## - age   1     0.00 -17500.2
## - pa    1     0.00 -17496.1
## - ibb   1     0.00 -17475.2
## - rbi   1     0.00 -17470.4
## - year  1     0.00 -17466.8
## - r     1     0.00 -17392.9
## - X3b   1   355.25    956.1
## - h     1   807.63   1203.3
## - X2b   1   883.14   1230.2
## - tb    1  2659.33   1562.0
## 
## Step:  AIC=-17829.86
## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + 
##     cs + bb + so + ba + obp + slg + ops + ops. + tb + hbp + sh + 
##     sf + ibb
## 
##        Df Deviance      AIC
## <none>        0.00 -17829.9
## - cs    1     0.00 -17788.6
## - hbp   1     0.00 -17768.9
## - sh    1     0.00 -17741.0
## - ba    1     0.00 -17711.9
## - so    1     0.00 -17707.0
## - sb    1     0.00 -17674.4
## - ibb   1     0.00 -17630.9
## - slg   1     0.00 -17626.3
## - sf    1     0.00 -17588.2
## - rbi   1     0.00 -17550.6
## - g     1     0.00 -17545.1
## - ab    1     0.00 -17542.8
## - obp   1     0.00 -17542.3
## - year  1     0.00 -17496.1
## - ops   1     0.00 -17495.9
## - bb    1     0.00 -17468.3
## - age   1     0.00 -17458.4
## - ops.  1     0.00 -17447.4
## - r     1     0.00 -17439.9
## - pa    1     0.00 -17426.9
## - X3b   1   356.11    954.8
## - h     1   812.98   1203.3
## - X2b   1   884.15   1228.5
## - tb    1  2680.10   1562.3

## 
## Call:  glm(formula = hr ~ year + age + g + pa + ab + r + h + X2b + X3b + 
##     rbi + sb + cs + bb + so + ba + obp + slg + ops + ops. + tb + 
##     hbp + sh + sf + ibb, data = mlb[, -c(1, 4, 5)])
## 
## Coefficients:
## (Intercept)         year          age            g           pa  
##  -2.302e-13    8.209e-17    7.904e-16   -6.450e-16    4.581e-16  
##          ab            r            h          X2b          X3b  
##  -2.172e-16   -6.172e-16   -3.333e-01   -3.333e-01   -6.667e-01  
##         rbi           sb           cs           bb           so  
##  -3.128e-16    6.912e-16   -2.197e-15   -2.952e-16   -2.455e-16  
##          ba          obp          slg          ops         ops.  
##  -1.900e-13    1.957e-12    1.990e-12   -1.869e-12   -2.164e-16  
##          tb          hbp           sh           sf          ibb  
##   3.333e-01   -9.068e-17   -8.831e-16   -9.065e-16   -3.007e-16  
## 
## Degrees of Freedom: 300 Total (i.e. Null);  276 Residual
## Null Deviance:       76810 
## Residual Deviance: 2.789e-25     AIC: -17830

model <- glm(hr~., data = mlb[,-c(1,4,5)])
step(model, direction = "both")

## Start:  AIC=-17751.01
## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + 
##     cs + bb + so + ba + obp + slg + ops + ops. + tb + gdp + hbp + 
##     sh + sf + ibb
## 
##        Df Deviance      AIC
## - gdp   1     0.00 -17829.9
## - cs    1     0.00 -17795.2
## - g     1     0.00 -17761.6
## <none>        0.00 -17751.0
## - obp   1     0.00 -17680.9
## - slg   1     0.00 -17658.8
## - hbp   1     0.00 -17650.1
## - ba    1     0.00 -17631.0
## - bb    1     0.00 -17606.3
## - ops   1     0.00 -17603.1
## - sb    1     0.00 -17602.5
## - ab    1     0.00 -17596.9
## - sh    1     0.00 -17569.4
## - so    1     0.00 -17562.1
## - sf    1     0.00 -17551.1
## - ops.  1     0.00 -17527.2
## - age   1     0.00 -17500.2
## - pa    1     0.00 -17496.1
## - ibb   1     0.00 -17475.2
## - rbi   1     0.00 -17470.4
## - year  1     0.00 -17466.8
## - r     1     0.00 -17392.9
## - X3b   1   355.25    956.1
## - h     1   807.63   1203.3
## - X2b   1   883.14   1230.2
## - tb    1  2659.33   1562.0
## 
## Step:  AIC=-17829.86
## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + 
##     cs + bb + so + ba + obp + slg + ops + ops. + tb + hbp + sh + 
##     sf + ibb
## 
##        Df Deviance      AIC
## <none>        0.00 -17829.9
## - cs    1     0.00 -17788.6
## - hbp   1     0.00 -17768.9
## + gdp   1     0.00 -17751.0
## - sh    1     0.00 -17741.0
## - ba    1     0.00 -17711.9
## - so    1     0.00 -17707.0
## - sb    1     0.00 -17674.4
## - ibb   1     0.00 -17630.9
## - slg   1     0.00 -17626.3
## - sf    1     0.00 -17588.2
## - rbi   1     0.00 -17550.6
## - g     1     0.00 -17545.1
## - ab    1     0.00 -17542.8
## - obp   1     0.00 -17542.3
## - year  1     0.00 -17496.1
## - ops   1     0.00 -17495.9
## - bb    1     0.00 -17468.3
## - age   1     0.00 -17458.4
## - ops.  1     0.00 -17447.4
## - r     1     0.00 -17439.9
## - pa    1     0.00 -17426.9
## - X3b   1   356.11    954.8
## - h     1   812.98   1203.3
## - X2b   1   884.15   1228.5
## - tb    1  2680.10   1562.3

## 
## Call:  glm(formula = hr ~ year + age + g + pa + ab + r + h + X2b + X3b + 
##     rbi + sb + cs + bb + so + ba + obp + slg + ops + ops. + tb + 
##     hbp + sh + sf + ibb, data = mlb[, -c(1, 4, 5)])
## 
## Coefficients:
## (Intercept)         year          age            g           pa  
##  -2.302e-13    8.209e-17    7.904e-16   -6.450e-16    4.581e-16  
##          ab            r            h          X2b          X3b  
##  -2.172e-16   -6.172e-16   -3.333e-01   -3.333e-01   -6.667e-01  
##         rbi           sb           cs           bb           so  
##  -3.128e-16    6.912e-16   -2.197e-15   -2.952e-16   -2.455e-16  
##          ba          obp          slg          ops         ops.  
##  -1.900e-13    1.957e-12    1.990e-12   -1.869e-12   -2.164e-16  
##          tb          hbp           sh           sf          ibb  
##   3.333e-01   -9.068e-17   -8.831e-16   -9.065e-16   -3.007e-16  
## 
## Degrees of Freedom: 300 Total (i.e. Null);  276 Residual
## Null Deviance:       76810 
## Residual Deviance: 2.789e-25     AIC: -17830

Models With Built-in Feature Select

Random Forest for regression

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.2.5

model <- randomForest(hr~.,data = mlb)

model$importance

##      IncNodePurity
## id      3676.41386
## year     254.87907
## age      138.41147
## tm      2620.14926
## lg        64.37073
## g        388.26508
## pa       457.28427
## ab       224.33838
## r       2163.01282
## h        499.86113
## X2b      437.15829
## X3b      109.82679
## rbi    21934.50808
## sb       148.75438
## cs       126.55827
## bb       894.75981
## so      8368.15834
## ba       459.42495
## obp      299.90717
## slg    14496.51064
## ops     4028.04810
## ops.    1931.42202
## tb     10987.49095
## gdp      200.69173
## hbp      385.94407
## sh       158.49883
## sf       179.80985
## ibb      290.81977

Random Forest for classification

nba <- read.csv("NBA Playoffs.csv")

nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Oppo))

nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]

model <- randomForest(Outcome ~., data = nba.train)
head(model$importance)

##          MeanDecreaseGini
## Team            7.9112791
## Opp            13.5127449
## Number          1.1053539
## Round           0.6088538
## Game            1.0720260
## Location        1.8388409

Your Turn

Using the NBA Playoffs.csv data set, complete the following:

Run a random forest model to gather the top 5 most important features for predicting the outcome of the Miami Heat. The following code may be of use:

feats.df = data.frame(model$importance)

feats.df <- feats.df[order(-feats.df$MeanDecreaseGini), , drop = FALSE]

top.feats <- rownames(feats.df)[1:5]

Using those top 5 features and the random forest, predict the outcome for the Miami Heat using the rest of the data as training. Output a single number to gauge the accuracy of our model.
Repeat (1.) - (2.) using the top 10 and top 20 most important features.

Answer

1. & 2.

feats.df = data.frame(model$importance)

feats.df <- feats.df[order(-feats.df$MeanDecreaseGini), , drop = FALSE]

top.feats <- rownames(feats.df)[1:5]


nba <- nba[,which(colnames(nba) %in% c(top.feats,"Team","Outcome"))]

nba.train <- nba[-which(nba$Team == "Heat"),]
nba.test <- nba[nba$Team == "Heat",]

model <- randomForest(Outcome ~., data = nba.train[,-1])

prediction = predict(model,nba.test)

sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)

## [1] 0.9908257

3.

Top 10

nba <- read.csv("NBA Playoffs.csv")

nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Opp))

top.feats <- rownames(feats.df)[1:10]

nba <- nba[,which(colnames(nba) %in% c(top.feats,"Team","Outcome"))]

nba.train <- nba[-which(nba$Team == "Heat"),]

nba.test <- nba[nba$Team == "Heat",]

model <- randomForest(Outcome ~., data = nba.train[,-1])

prediction = predict(model,nba.test)

sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)

## [1] 0.9816514

Top 20

nba <- read.csv("NBA Playoffs.csv")

nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Opp))

top.feats <- rownames(feats.df)[1:20]

nba <- nba[,which(colnames(nba) %in% c(top.feats,"Team","Outcome"))]

nba.train <- nba[-which(nba$Team == "Heat"),]

nba.test <- nba[nba$Team == "Heat",]

model <- randomForest(Outcome ~., data = nba.train[,-1])

prediction = predict(model,nba.test)

sum.table = table(prediction,nba.test$Outcome)
sum(diag(sum.table))/sum(sum.table)

## [1] 0.9724771