- Understand importance of feature selection
- Apply a few techniques
Feature selection is a way of automatically selecting the best features (predictors) in your data set that are the most relevant to your predictive model. In applying feature selection, we are reducing the total amount of columns in our data set. One may think that a model with as many predictors as possible will be the most accurate; however, often times, this is not true! Feature selection removes the following features:
In doing this, we not only (hopefully) improve our model accuracy but also increase the computational efficiency.
Note: Feature selection is an important part of applied predictive modelling. There are many many more things to learn about this.
Observing a correlation matrix
mlb <- read.csv("MLB Stats.csv") head(cor(mlb[,-c(1,4,5)]))
## year age g pa ab ## year 1.000000000 0.69292289 -0.09982083 -0.09782626 -0.12951155 ## age 0.692922891 1.00000000 -0.01766747 -0.03017634 -0.05672408 ## g -0.099820827 -0.01766747 1.00000000 0.98505867 0.97367704 ## pa -0.097826256 -0.03017634 0.98505867 1.00000000 0.98731807 ## ab -0.129511548 -0.05672408 0.97367704 0.98731807 1.00000000 ## r 0.005708515 -0.05365696 0.85355339 0.89113405 0.84297221 ## r h X2b X3b hr rbi ## year 0.005708515 -0.12290342 -0.05207639 -0.4218759 0.3051903 0.21661034 ## age -0.053656960 -0.05583127 -0.01308574 -0.2596250 0.0164766 0.01250502 ## g 0.853553393 0.91652459 0.78447018 0.3828360 0.4850828 0.72475073 ## pa 0.891134052 0.94672737 0.81693863 0.3887161 0.4932655 0.73735857 ## ab 0.842972205 0.95037439 0.81882522 0.4086391 0.4227231 0.69056648 ## r 1.000000000 0.87660936 0.78515664 0.3357498 0.6702643 0.81250811 ## sb cs bb so ba obp ## year -0.3045645 -0.3909126 0.07763288 0.28276633 -0.008514741 0.1470013 ## age -0.2582114 -0.3605789 0.10599122 -0.05525395 0.059449900 0.1839460 ## g 0.3087790 0.3497150 0.65853795 0.55507663 0.347027003 0.3350005 ## pa 0.3191405 0.3612025 0.66913518 0.55329085 0.380716062 0.3595305 ## ab 0.3192950 0.3761301 0.54551024 0.52979399 0.354141420 0.2571095 ## r 0.3193035 0.3119072 0.76222169 0.58882599 0.493634328 0.5393551 ## slg ops ops. tb gdp hbp ## year 0.27868426 0.2463425 0.13764186 0.06149245 -0.01891597 0.27473058 ## age 0.08032514 0.1218244 0.04732968 -0.03679742 0.01594652 0.06618181 ## g 0.27677897 0.3126431 0.33757566 0.86665586 0.61808941 0.32753524 ## pa 0.29468430 0.3338308 0.35506861 0.89168608 0.62351496 0.36718216 ## ab 0.21149897 0.2392633 0.26011884 0.85875695 0.65087330 0.32517457 ## r 0.54502584 0.5722219 0.56871041 0.93739179 0.51136156 0.47240564 ## sh sf ibb ## year -0.40837494 0.019734179 0.09329986 ## age -0.13076838 -0.002810514 0.15737826 ## g 0.14216405 0.556270172 0.28755856 ## pa 0.14042328 0.569150016 0.28412764 ## ab 0.16989374 0.554334693 0.17289246 ## r -0.01943339 0.523207089 0.40391119
Ranking features by importance using a stepwise regression
model <- glm(hr~., data = mlb[,-c(1,4,5)]) step(model, direction = "forward")
## Start: AIC=-17751.01 ## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + ## cs + bb + so + ba + obp + slg + ops + ops. + tb + gdp + hbp + ## sh + sf + ibb
## ## Call: glm(formula = hr ~ year + age + g + pa + ab + r + h + X2b + X3b + ## rbi + sb + cs + bb + so + ba + obp + slg + ops + ops. + tb + ## gdp + hbp + sh + sf + ibb, data = mlb[, -c(1, 4, 5)]) ## ## Coefficients: ## (Intercept) year age g pa ## -6.749e-13 3.038e-16 6.384e-16 -4.268e-16 -6.105e-16 ## ab r h X2b X3b ## 7.840e-16 -4.720e-16 -3.333e-01 -3.333e-01 -6.667e-01 ## rbi sb cs bb so ## -3.493e-16 6.364e-16 -2.228e-15 6.376e-16 -2.561e-16 ## ba obp slg ops ops. ## -1.977e-13 1.984e-12 2.012e-12 -1.890e-12 -2.223e-16 ## tb gdp hbp sh sf ## 3.333e-01 -4.904e-16 8.365e-16 1.519e-17 5.197e-17 ## ibb ## -2.958e-16 ## ## Degrees of Freedom: 300 Total (i.e. Null); 275 Residual ## Null Deviance: 76810 ## Residual Deviance: 3.6e-25 AIC: -17750
model <- glm(hr~., data = mlb[,-c(1,4,5)]) step(model, direction = "backward")
## Start: AIC=-17751.01 ## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + ## cs + bb + so + ba + obp + slg + ops + ops. + tb + gdp + hbp + ## sh + sf + ibb ## ## Df Deviance AIC ## - gdp 1 0.00 -17829.9 ## - cs 1 0.00 -17795.2 ## - g 1 0.00 -17761.6 ## <none> 0.00 -17751.0 ## - obp 1 0.00 -17680.9 ## - slg 1 0.00 -17658.8 ## - hbp 1 0.00 -17650.1 ## - ba 1 0.00 -17631.0 ## - bb 1 0.00 -17606.3 ## - ops 1 0.00 -17603.1 ## - sb 1 0.00 -17602.5 ## - ab 1 0.00 -17596.9 ## - sh 1 0.00 -17569.4 ## - so 1 0.00 -17562.1 ## - sf 1 0.00 -17551.1 ## - ops. 1 0.00 -17527.2 ## - age 1 0.00 -17500.2 ## - pa 1 0.00 -17496.1 ## - ibb 1 0.00 -17475.2 ## - rbi 1 0.00 -17470.4 ## - year 1 0.00 -17466.8 ## - r 1 0.00 -17392.9 ## - X3b 1 355.25 956.1 ## - h 1 807.63 1203.3 ## - X2b 1 883.14 1230.2 ## - tb 1 2659.33 1562.0 ## ## Step: AIC=-17829.86 ## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + ## cs + bb + so + ba + obp + slg + ops + ops. + tb + hbp + sh + ## sf + ibb ## ## Df Deviance AIC ## <none> 0.00 -17829.9 ## - cs 1 0.00 -17788.6 ## - hbp 1 0.00 -17768.9 ## - sh 1 0.00 -17741.0 ## - ba 1 0.00 -17711.9 ## - so 1 0.00 -17707.0 ## - sb 1 0.00 -17674.4 ## - ibb 1 0.00 -17630.9 ## - slg 1 0.00 -17626.3 ## - sf 1 0.00 -17588.2 ## - rbi 1 0.00 -17550.6 ## - g 1 0.00 -17545.1 ## - ab 1 0.00 -17542.8 ## - obp 1 0.00 -17542.3 ## - year 1 0.00 -17496.1 ## - ops 1 0.00 -17495.9 ## - bb 1 0.00 -17468.3 ## - age 1 0.00 -17458.4 ## - ops. 1 0.00 -17447.4 ## - r 1 0.00 -17439.9 ## - pa 1 0.00 -17426.9 ## - X3b 1 356.11 954.8 ## - h 1 812.98 1203.3 ## - X2b 1 884.15 1228.5 ## - tb 1 2680.10 1562.3
## ## Call: glm(formula = hr ~ year + age + g + pa + ab + r + h + X2b + X3b + ## rbi + sb + cs + bb + so + ba + obp + slg + ops + ops. + tb + ## hbp + sh + sf + ibb, data = mlb[, -c(1, 4, 5)]) ## ## Coefficients: ## (Intercept) year age g pa ## -2.302e-13 8.209e-17 7.904e-16 -6.450e-16 4.581e-16 ## ab r h X2b X3b ## -2.172e-16 -6.172e-16 -3.333e-01 -3.333e-01 -6.667e-01 ## rbi sb cs bb so ## -3.128e-16 6.912e-16 -2.197e-15 -2.952e-16 -2.455e-16 ## ba obp slg ops ops. ## -1.900e-13 1.957e-12 1.990e-12 -1.869e-12 -2.164e-16 ## tb hbp sh sf ibb ## 3.333e-01 -9.068e-17 -8.831e-16 -9.065e-16 -3.007e-16 ## ## Degrees of Freedom: 300 Total (i.e. Null); 276 Residual ## Null Deviance: 76810 ## Residual Deviance: 2.789e-25 AIC: -17830
model <- glm(hr~., data = mlb[,-c(1,4,5)]) step(model, direction = "both")
## Start: AIC=-17751.01 ## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + ## cs + bb + so + ba + obp + slg + ops + ops. + tb + gdp + hbp + ## sh + sf + ibb ## ## Df Deviance AIC ## - gdp 1 0.00 -17829.9 ## - cs 1 0.00 -17795.2 ## - g 1 0.00 -17761.6 ## <none> 0.00 -17751.0 ## - obp 1 0.00 -17680.9 ## - slg 1 0.00 -17658.8 ## - hbp 1 0.00 -17650.1 ## - ba 1 0.00 -17631.0 ## - bb 1 0.00 -17606.3 ## - ops 1 0.00 -17603.1 ## - sb 1 0.00 -17602.5 ## - ab 1 0.00 -17596.9 ## - sh 1 0.00 -17569.4 ## - so 1 0.00 -17562.1 ## - sf 1 0.00 -17551.1 ## - ops. 1 0.00 -17527.2 ## - age 1 0.00 -17500.2 ## - pa 1 0.00 -17496.1 ## - ibb 1 0.00 -17475.2 ## - rbi 1 0.00 -17470.4 ## - year 1 0.00 -17466.8 ## - r 1 0.00 -17392.9 ## - X3b 1 355.25 956.1 ## - h 1 807.63 1203.3 ## - X2b 1 883.14 1230.2 ## - tb 1 2659.33 1562.0 ## ## Step: AIC=-17829.86 ## hr ~ year + age + g + pa + ab + r + h + X2b + X3b + rbi + sb + ## cs + bb + so + ba + obp + slg + ops + ops. + tb + hbp + sh + ## sf + ibb ## ## Df Deviance AIC ## <none> 0.00 -17829.9 ## - cs 1 0.00 -17788.6 ## - hbp 1 0.00 -17768.9 ## + gdp 1 0.00 -17751.0 ## - sh 1 0.00 -17741.0 ## - ba 1 0.00 -17711.9 ## - so 1 0.00 -17707.0 ## - sb 1 0.00 -17674.4 ## - ibb 1 0.00 -17630.9 ## - slg 1 0.00 -17626.3 ## - sf 1 0.00 -17588.2 ## - rbi 1 0.00 -17550.6 ## - g 1 0.00 -17545.1 ## - ab 1 0.00 -17542.8 ## - obp 1 0.00 -17542.3 ## - year 1 0.00 -17496.1 ## - ops 1 0.00 -17495.9 ## - bb 1 0.00 -17468.3 ## - age 1 0.00 -17458.4 ## - ops. 1 0.00 -17447.4 ## - r 1 0.00 -17439.9 ## - pa 1 0.00 -17426.9 ## - X3b 1 356.11 954.8 ## - h 1 812.98 1203.3 ## - X2b 1 884.15 1228.5 ## - tb 1 2680.10 1562.3
## ## Call: glm(formula = hr ~ year + age + g + pa + ab + r + h + X2b + X3b + ## rbi + sb + cs + bb + so + ba + obp + slg + ops + ops. + tb + ## hbp + sh + sf + ibb, data = mlb[, -c(1, 4, 5)]) ## ## Coefficients: ## (Intercept) year age g pa ## -2.302e-13 8.209e-17 7.904e-16 -6.450e-16 4.581e-16 ## ab r h X2b X3b ## -2.172e-16 -6.172e-16 -3.333e-01 -3.333e-01 -6.667e-01 ## rbi sb cs bb so ## -3.128e-16 6.912e-16 -2.197e-15 -2.952e-16 -2.455e-16 ## ba obp slg ops ops. ## -1.900e-13 1.957e-12 1.990e-12 -1.869e-12 -2.164e-16 ## tb hbp sh sf ibb ## 3.333e-01 -9.068e-17 -8.831e-16 -9.065e-16 -3.007e-16 ## ## Degrees of Freedom: 300 Total (i.e. Null); 276 Residual ## Null Deviance: 76810 ## Residual Deviance: 2.789e-25 AIC: -17830
Random Forest for regression
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.5
model <- randomForest(hr~.,data = mlb) model$importance
## IncNodePurity ## id 3676.41386 ## year 254.87907 ## age 138.41147 ## tm 2620.14926 ## lg 64.37073 ## g 388.26508 ## pa 457.28427 ## ab 224.33838 ## r 2163.01282 ## h 499.86113 ## X2b 437.15829 ## X3b 109.82679 ## rbi 21934.50808 ## sb 148.75438 ## cs 126.55827 ## bb 894.75981 ## so 8368.15834 ## ba 459.42495 ## obp 299.90717 ## slg 14496.51064 ## ops 4028.04810 ## ops. 1931.42202 ## tb 10987.49095 ## gdp 200.69173 ## hbp 385.94407 ## sh 158.49883 ## sf 179.80985 ## ibb 290.81977
Random Forest for classification
nba <- read.csv("NBA Playoffs.csv") nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Oppo)) nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] model <- randomForest(Outcome ~., data = nba.train) head(model$importance)
## MeanDecreaseGini ## Team 7.9112791 ## Opp 13.5127449 ## Number 1.1053539 ## Round 0.6088538 ## Game 1.0720260 ## Location 1.8388409
Using the NBA Playoffs.csv data set, complete the following:
feats.df = data.frame(model$importance) feats.df <- feats.df[order(-feats.df$MeanDecreaseGini), , drop = FALSE] top.feats <- rownames(feats.df)[1:5]
Using those top 5 features and the random forest, predict the outcome for the Miami Heat using the rest of the data as training. Output a single number to gauge the accuracy of our model.
Repeat (1.) - (2.) using the top 10 and top 20 most important features.
feats.df = data.frame(model$importance) feats.df <- feats.df[order(-feats.df$MeanDecreaseGini), , drop = FALSE] top.feats <- rownames(feats.df)[1:5] nba <- nba[,which(colnames(nba) %in% c(top.feats,"Team","Outcome"))] nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] model <- randomForest(Outcome ~., data = nba.train[,-1]) prediction = predict(model,nba.test) sum.table = table(prediction,nba.test$Outcome) sum(diag(sum.table))/sum(sum.table)
## [1] 0.9908257
Top 10
nba <- read.csv("NBA Playoffs.csv") nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Opp)) top.feats <- rownames(feats.df)[1:10] nba <- nba[,which(colnames(nba) %in% c(top.feats,"Team","Outcome"))] nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] model <- randomForest(Outcome ~., data = nba.train[,-1]) prediction = predict(model,nba.test) sum.table = table(prediction,nba.test$Outcome) sum(diag(sum.table))/sum(sum.table)
## [1] 0.9816514
Top 20
nba <- read.csv("NBA Playoffs.csv") nba <- subset(nba,select = -c(Date,Points,Oppo,Diff,Opp)) top.feats <- rownames(feats.df)[1:20] nba <- nba[,which(colnames(nba) %in% c(top.feats,"Team","Outcome"))] nba.train <- nba[-which(nba$Team == "Heat"),] nba.test <- nba[nba$Team == "Heat",] model <- randomForest(Outcome ~., data = nba.train[,-1]) prediction = predict(model,nba.test) sum.table = table(prediction,nba.test$Outcome) sum(diag(sum.table))/sum(sum.table)
## [1] 0.9724771