Outline

  • Talk about common regression algorithms
  • Do some simple exercises

Regression

Simply, regression is concerned with modeling the relationship between input variables and an output variable. For instance input variables can be passing yards per game, interceptions, sacks, etc. and an output could be points scored.

The distinguishing characteristic between classification and regression is that regression attempts to predict a numerical value. That is, regression should not be used if you want to determine if a teams wins or lose, but rather if you want to predict how many points a team scores.

Predictive Modeling

In this sense, a machine learning algorithm is a model with input factors such as passing yards per game, interceptions, sacks, etc… and an output say win/lose or total points scored.

Predictive modeling can be separated into two groups:

  • Classification: Predicting categorical variable i.e. win/loss, above/below, low/medium/high, etc…
  • Regression: Predict value i.e. points scored in a game, number of field goals, etc…

Examples of Machine Learning Algorithms for Regression

  • Ordinary Least Squares: Finds plane that minimized that sum-of-squared errors between the observed value and predicted response
  • Ridge Regression: A penalized ordinary least squares using a second order penalty term
  • Least Absolute Shrinkage and Selection Operator (LASSO): A penalized ordinary least squares using a first order penalty term
  • Elastic Net: Combination of both ridge regression and LASSO penalty
  • Many many many more!

In some sense, you can think of Ridge Regression and LASSO as a special case of an Elastic Net.

Applying a Model

Let's take our MLB statistics, and see if we can predict the number home runs (hr) in a season for Sammy Sosa (SosaSa) using everyone else as training data.

For this we will use the ordinary least squares model which essentially attempts to construct a line of best fit through all the data points in multi-dimensions.

mlb <- read.csv("MLB Stats.csv") # Load data

mlb <- subset(mlb, select = -c(tm,lg)) # Remove tm and lg since they are categorical

mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",] # Selects rows with SosaSa

model <- lm(hr ~., data = mlb.train[,-1])
# Builds linear model predicting hr(homeruns) based on all variables
# We need to remove id, tm, and lg variables since they are not numeric

model # See model output
## 
## Call:
## lm(formula = hr ~ ., data = mlb.train[, -1])
## 
## Coefficients:
## (Intercept)         year          age            g           pa  
##   1.337e-12   -6.412e-16   -2.194e-16   -5.575e-16    4.149e-16  
##          ab            r            h          X2b          X3b  
##  -3.158e-16   -3.776e-17   -3.333e-01   -3.333e-01   -6.667e-01  
##         rbi           sb           cs           bb           so  
##   1.948e-16    2.483e-16   -3.314e-16   -4.619e-16   -2.663e-17  
##          ba          obp          slg          ops         ops.  
##  -8.563e-14    1.055e-12    8.167e-13   -8.554e-13   -1.609e-16  
##          tb          gdp          hbp           sh           sf  
##   3.333e-01   -3.056e-16   -5.068e-16   -8.591e-16   -6.112e-16  
##         ibb  
##  -1.171e-16

prediction <- predict(model,mlb.test) 
# Run prediction function based on our model
prediction
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  4  1  3 15 10  8 33 25 36 40 36 66 63 50 64 49 40 35 14 21

Note: Regression coefficients represent the mean change in the predicted value for a one unit change in the predictor variable.

To summarize a few things,

lm(formula,data) produces an ordinary least squares where hr is the response variable we want to predict. ~. indicates that we are going to predict hr using the rest of the columns. Lastly, data = mlb.train[,-1] defines the data argument to be our mlb data without the first column, i.e. id.

Now, predict(model,data) gives us a prediction of the hr using the model built on the training set and the inputted testing data.

Create data frame to easily visualize results

# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "SosaSa",],prediction)

# Selects only relevant columns
results <- results[c("id","year","hr","prediction")]
results
##        id year hr prediction
## 1  SosaSa 1989  4          4
## 2  SosaSa 1989  1          1
## 3  SosaSa 1989  3          3
## 4  SosaSa 1990 15         15
## 5  SosaSa 1991 10         10
## 6  SosaSa 1992  8          8
## 7  SosaSa 1993 33         33
## 8  SosaSa 1994 25         25
## 9  SosaSa 1995 36         36
## 10 SosaSa 1996 40         40
## 11 SosaSa 1997 36         36
## 12 SosaSa 1998 66         66
## 13 SosaSa 1999 63         63
## 14 SosaSa 2000 50         50
## 15 SosaSa 2001 64         64
## 16 SosaSa 2002 49         49
## 17 SosaSa 2003 40         40
## 18 SosaSa 2004 35         35
## 19 SosaSa 2005 14         14
## 20 SosaSa 2007 21         21

In this scenario we are able to predict the total home runs of Sammy Sosa perfectly. However, this is an extreme case and in generally we will not be as accurate. What is most likely happening is that there are some factors, that are heavily influcing the predictions such as the statistics Runs Batting in (RBI).

Let's do it again, but this time we are going to attempt to predict the age of Sammy Sosa throughout his career. If we do this, we should not be as accurate as in the other example. The reason is because there are no features that are correlated with age; hence, we should not be able to accurate predict his age. So let's see what happens!

mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",] # Selects rows with SosaSa

model <- lm(age ~., data = mlb.train[,-1])
# Builds linear model predicting age based on all variables
# We need to remove id, tm, and lg variables since they are not numeric

prediction.age <- predict(model,mlb.test) 
# Run prediction function based on our model

# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age)

# Selects only relevant columns
results <- results[c("id","age","prediction.age")]

head(results,n = 15)
##        id age prediction.age
## 1  SosaSa  20       23.27857
## 2  SosaSa  20       22.93880
## 3  SosaSa  20       23.20457
## 4  SosaSa  21       18.40951
## 5  SosaSa  22       24.47351
## 6  SosaSa  23       23.91660
## 7  SosaSa  24       22.12227
## 8  SosaSa  25       23.46654
## 9  SosaSa  26       23.26329
## 10 SosaSa  27       23.25116
## 11 SosaSa  28       23.05700
## 12 SosaSa  29       21.71412
## 13 SosaSa  30       25.07838
## 14 SosaSa  31       27.50652
## 15 SosaSa  32       27.46382

It seems as though our hypothesis was right! It was not easy for our model to predict age since there were not many factors that correlate heavily with age. This does not imply that we may cannot predict the age accurately; we may just have to apply a more sophisticated machine learning technique.

A More Complex Example Using LASSO

Here we run the same experiment. However, this time we are using a LASSO. Note that in our model glmnet(), there is an alpha parameter. When alpha = 0, we have a ridge regression. When alpha = 1 we have LASSO, and any value in-between constitutes an elastic net.

library(glmnet)

mlb <- read.csv("MLB Stats.csv")

mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "SosaSa",] # Selects rows with SosaSa

mlb.train <- subset(mlb.train, select = -c(id,tm,lg)) # Removes features
mlb.test <- subset(mlb.test, select = -c(id,tm,lg,age)) # Removes features
  
mlb.trainx <- as.matrix(subset(mlb.train,select = -age)) # Turns data frame into matrix
mlb.trainy <- as.matrix(subset(mlb.train,select = age)) # Turns data frame into matrix


model <- glmnet(mlb.trainx,mlb.trainy, alpha = 1)
# Builds linear model predicting age based on all variables
# We need to remove id, tm, and lg variables since they are not numeric

prediction.age.lasso <- predict(model,as.matrix(mlb.test)) # Predict on test data

prediction.age.lasso
##          s0       s1       s2       s3       s4       s5       s6       s7
## 1  30.85765 30.53101 30.23339 29.96221 29.71512 29.48998 29.28485 29.09793
## 2  30.85765 30.53101 30.23339 29.96221 29.71512 29.48998 29.28485 29.09793
## 3  30.85765 30.53101 30.23339 29.96221 29.71512 29.48998 29.28485 29.09793
## 4  30.85765 30.57728 30.32181 30.08904 29.87694 29.68369 29.50761 29.34717
## 5  30.85765 30.62354 30.41022 30.21586 30.03876 29.87740 29.73037 29.59640
## 6  30.85765 30.66980 30.49864 30.34268 30.20058 30.07110 29.95313 29.84564
## 7  30.85765 30.71606 30.58706 30.46951 30.36240 30.26481 30.17589 30.09487
## 8  30.85765 30.76233 30.67547 30.59633 30.52422 30.45852 30.39865 30.34410
## 9  30.85765 30.80859 30.76389 30.72315 30.68604 30.65223 30.62141 30.59334
## 10 30.85765 30.85485 30.85230 30.84998 30.84786 30.84593 30.84417 30.84257
## 11 30.85765 30.90112 30.94072 30.97680 31.00968 31.03964 31.06694 31.09181
## 12 30.85765 30.94738 31.02913 31.10363 31.17150 31.23335 31.28970 31.34104
## 13 30.85765 30.99364 31.11755 31.23045 31.33332 31.42705 31.51246 31.59028
## 14 30.85765 31.03990 31.20596 31.35727 31.49514 31.62076 31.73522 31.83951
## 15 30.85765 31.08617 31.29438 31.48410 31.65696 31.81447 31.95798 32.08875
## 16 30.85765 31.13243 31.38280 31.61092 31.81878 32.00817 32.18074 32.33798
## 17 30.85765 31.17869 31.47121 31.73774 31.98060 32.20188 32.40350 32.58721
## 18 30.85765 31.22495 31.55963 31.86457 32.14242 32.39559 32.62626 32.83645
## 19 30.85765 31.27122 31.64804 31.99139 32.30424 32.58929 32.84903 33.08568
## 20 30.85765 31.36374 31.82487 32.24504 32.62788 32.97671 33.29455 33.58415
##          s8       s9      s10      s11      s12      s13      s14      s15
## 1  28.92762 28.77244 28.63105 28.50222 28.38483 28.27787 28.20146 28.14917
## 2  28.92762 28.77244 28.63105 28.50222 28.38483 28.27787 28.25868 28.30568
## 3  28.92762 28.77244 28.63105 28.50222 28.38483 28.27787 28.24385 28.26510
## 4  29.20098 29.06778 28.94641 28.83582 28.73506 28.64325 28.36520 27.95160
## 5  29.47433 29.36311 29.26177 29.16943 29.08530 29.00863 28.85739 28.65248
## 6  29.74769 29.65845 29.57713 29.50304 29.43553 29.37402 29.31356 29.25483
## 7  30.02105 29.95378 29.89249 29.83665 29.78576 29.73940 29.54299 29.23695
## 8  30.29440 30.24912 30.20785 30.17025 30.13600 30.10478 30.01611 29.88566
## 9  30.56776 30.54445 30.52321 30.50386 30.48623 30.47016 30.30911 30.04168
## 10 30.84111 30.83978 30.83857 30.83747 30.83646 30.83555 30.69111 30.44114
## 11 31.11447 31.13512 31.15393 31.17108 31.18670 31.20093 30.98834 30.60875
## 12 31.38783 31.43045 31.46929 31.50468 31.53693 31.56631 31.37670 31.02560
## 13 31.66118 31.72579 31.78465 31.83829 31.88716 31.93169 31.75870 31.42507
## 14 31.93454 32.02112 32.10001 32.17190 32.23740 32.29708 32.14706 31.84193
## 15 32.20789 32.31646 32.41537 32.50551 32.58763 32.66246 32.56085 32.32834
## 16 32.48125 32.61179 32.73074 32.83911 32.93786 33.02784 32.96192 32.77997
## 17 32.75460 32.90712 33.04610 33.17272 33.28810 33.39322 33.34604 33.18523
## 18 33.02796 33.20246 33.36146 33.50633 33.63833 33.75860 33.74923 33.64266
## 19 33.30132 33.49779 33.67682 33.83994 33.98856 34.12399 34.23507 34.32616
## 20 33.84803 34.08846 34.30754 34.50715 34.68903 34.85475 34.93973 34.96279
##         s16      s17      s18      s19      s20      s21      s22      s23
## 1  28.10153 28.05812 28.01857 27.98253 27.94412 27.90589 27.63734 27.36136
## 2  28.34850 28.38752 28.42308 28.45547 28.47829 28.47312 28.24949 28.00482
## 3  28.28447 28.30212 28.31821 28.33286 28.34131 28.29181 27.97314 27.64903
## 4  27.57476 27.23140 26.91854 26.63348 26.36664 26.08917 25.36836 24.71272
## 5  28.46579 28.29568 28.14068 27.99945 27.86331 27.73913 27.47521 27.23964
## 6  29.20131 29.15256 29.10813 29.06765 29.02338 28.96726 28.61866 28.26558
## 7  28.95811 28.70403 28.47253 28.26160 28.06460 27.83974 27.35518 26.89584
## 8  29.76681 29.65851 29.55984 29.46993 29.37947 29.24405 28.69543 28.13583
## 9  29.79801 29.57599 29.37369 29.18937 29.02157 28.85166 28.47549 28.11596
## 10 30.21339 30.00587 29.81678 29.64449 29.48248 29.29574 28.89652 28.49718
## 11 30.26288 29.94774 29.66060 29.39897 29.15717 28.91350 28.30673 27.74076
## 12 30.70570 30.41422 30.14863 29.90664 29.68785 29.49145 29.00572 28.48512
## 13 31.12108 30.84410 30.59172 30.36177 30.14775 29.92875 29.47758 28.99926
## 14 31.56390 31.31058 31.07976 30.86944 30.68455 30.54739 30.25570 29.97357
## 15 32.11649 31.92345 31.74757 31.58731 31.46680 31.47770 31.40194 31.29783
## 16 32.61419 32.46313 32.32549 32.20008 32.08912 32.00460 31.92592 31.84400
## 17 33.03871 32.90521 32.78356 32.67273 32.56887 32.46391 32.30181 32.15073
## 18 33.54556 33.45709 33.37647 33.30301 33.22837 33.12814 32.99235 32.86919
## 19 34.40914 34.48476 34.55366 34.61644 34.66648 34.67807 34.64153 34.62842
## 20 34.98378 35.00292 35.02035 35.03623 35.04243 35.01764 34.96371 34.93770
##         s24      s25      s26      s27      s28      s29      s30      s31
## 1  27.12337 26.88712 26.64907 26.43228 26.23193 26.05661 25.93021 25.82132
## 2  27.78023 27.56533 27.35767 27.16829 26.98202 26.80592 26.65138 26.48940
## 3  27.36387 27.09385 26.84414 26.61563 26.39643 26.20404 26.03384 25.87506
## 4  24.19953 23.67150 23.12467 22.62921 22.19346 21.81405 21.49190 21.17994
## 5  27.06611 26.86455 26.61976 26.39897 26.20684 26.03976 25.92645 25.85942
## 6  27.96061 27.66533 27.37953 27.11908 26.87908 26.66713 26.49607 26.34211
## 7  26.50146 26.09175 25.63140 25.21455 24.86006 24.55246 24.26078 24.03458
## 8  27.62657 27.13218 26.62595 26.16720 25.78617 25.45520 25.18756 24.95801
## 9  27.80648 27.49063 27.14786 26.83728 26.57051 26.33634 26.02721 25.69038
## 10 28.13138 27.75812 27.33826 26.95688 26.62354 26.33522 26.04788 25.79492
## 11 27.27554 26.79181 26.26238 25.78549 25.40638 25.08098 24.77344 24.45362
## 12 27.97141 27.47006 26.92484 26.43304 26.05858 25.73681 25.43649 25.17115
## 13 28.52619 28.07581 27.60659 27.18299 26.85602 26.57095 26.36578 26.22688
## 14 29.72879 29.48137 29.21690 28.98057 28.82300 28.69446 28.56225 28.41173
## 15 31.18084 31.04990 30.86855 30.70495 30.58033 30.48036 30.37361 30.24356
## 16 31.75900 31.68535 31.62141 31.56623 31.55503 31.54450 31.57066 31.65402
## 17 32.02163 31.88254 31.71699 31.56816 31.45504 31.36204 31.29225 31.24179
## 18 32.76574 32.65821 32.53837 32.43257 32.37278 32.32203 32.31066 32.35356
## 19 34.64665 34.65151 34.64866 34.64850 34.66395 34.67429 34.71127 34.77056
## 20 34.94619 34.92955 34.88081 34.83908 34.82289 34.81427 34.75587 34.62890
##         s32      s33      s34      s35      s36      s37      s38      s39
## 1  25.72202 25.63168 25.54938 25.47438 25.40566 25.34338 25.27575 25.21349
## 2  26.34318 26.21004 26.08873 25.97822 25.87722 25.78549 25.67252 25.57385
## 3  25.73113 25.60014 25.48079 25.37205 25.27254 25.18227 25.09614 25.02346
## 4  20.89643 20.63886 20.40418 20.19037 19.99359 19.81602 19.67956 19.55238
## 5  25.79605 25.73862 25.68629 25.63859 25.59445 25.55476 25.54097 25.51820
## 6  26.20210 26.07473 25.95867 25.85293 25.75608 25.66828 25.57201 25.48233
## 7  23.82487 23.63417 23.46041 23.30206 23.15697 23.02538 22.92014 22.80305
## 8  24.74716 24.55522 24.38032 24.22095 24.07538 23.94301 23.84414 23.73306
## 9  25.38516 25.10762 24.85476 24.62439 24.41291 24.22163 24.06263 23.93632
## 10 25.56243 25.35091 25.15817 24.98255 24.82183 24.67594 24.53150 24.38641
## 11 24.16293 23.89876 23.65808 23.43879 23.23715 23.05503 22.88928 22.74532
## 12 24.92668 24.70406 24.50120 24.31634 24.14773 23.99418 23.79536 23.59519
## 13 26.09647 25.97751 25.86909 25.77027 25.68080 25.59869 25.52039 25.41341
## 14 28.27556 28.15182 28.03909 27.93639 27.84188 27.75660 27.66553 27.58899
## 15 30.12525 30.01765 29.91960 29.83028 29.74838 29.67420 29.51836 29.36168
## 16 31.72731 31.79388 31.85452 31.90975 31.96074 32.00655 31.98512 31.93622
## 17 31.19468 31.15191 31.11293 31.07741 31.04472 31.01518 30.94303 30.86817
## 18 32.38998 32.42314 32.45335 32.48085 32.50610 32.52887 32.55530 32.55900
## 19 34.82378 34.87235 34.91660 34.95692 34.99351 35.02695 35.07681 35.11612
## 20 34.51655 34.41466 34.32184 34.23731 34.15886 34.08871 34.02829 33.99832
##         s40      s41      s42      s43      s44      s45      s46      s47
## 1  25.16654 25.12205 25.08199 25.03378 24.95781 24.88408 24.81706 24.75646
## 2  25.48878 25.41122 25.34097 25.26916 25.08712 24.90324 24.73703 24.58389
## 3  24.96284 24.90774 24.85776 24.80564 24.76135 24.71959 24.68137 24.64853
## 4  19.42947 19.31977 19.22010 19.12589 19.05475 18.99299 18.93640 18.88766
## 5  25.50619 25.49272 25.48096 25.45190 25.42367 25.40275 25.38356 25.37027
## 6  25.40503 25.33414 25.26999 25.20281 25.12182 25.04403 24.97319 24.91061
## 7  22.69489 22.59596 22.50641 22.41366 22.36118 22.31907 22.27995 22.24632
## 8  23.61840 23.51571 23.42240 23.33016 23.21176 23.09768 22.99398 22.90152
## 9  23.82395 23.72462 23.63432 23.57143 23.54018 23.50975 23.48133 23.45992
## 10 24.25546 24.13620 24.02826 23.91711 23.82714 23.74448 23.66857 23.60029
## 11 22.62492 22.51469 22.41498 22.34053 22.31599 22.29560 22.27586 22.26085
## 12 23.42618 23.26895 23.12703 22.99646 22.93242 22.87365 22.81846 22.76990
## 13 25.30797 25.21097 25.12323 25.04118 24.98270 24.92771 24.87697 24.83164
## 14 27.52737 27.47116 27.42048 27.39000 27.34816 27.30294 27.26135 27.22722
## 15 29.21763 29.08659 28.96829 28.84642 28.75973 28.67657 28.59991 28.53803
## 16 31.89082 31.84726 31.80835 31.75351 31.71011 31.67111 31.63523 31.60330
## 17 30.81155 30.75707 30.70828 30.65402 30.62423 30.59877 30.57484 30.55514
## 18 32.56083 32.56120 32.56189 32.55279 32.53425 32.51839 32.50394 32.49338
## 19 35.15050 35.18128 35.20934 35.23310 35.24555 35.25929 35.27202 35.27986
## 20 33.97708 33.95933 33.94324 33.93332 33.91908 33.90423 33.89064 33.87912
##         s48      s49      s50      s51      s52      s53      s54      s55
## 1  24.68947 24.63077 24.57693 24.44574 24.33565 24.23790 24.17698 24.11233
## 2  24.43862 24.30751 24.18780 24.04491 23.93331 23.83359 23.76424 23.70564
## 3  24.61039 24.57760 24.54730 24.42879 24.31866 24.21921 24.15238 24.08722
## 4  18.83526 18.79001 18.74845 18.66512 18.60200 18.54874 18.54811 18.53070
## 5  25.36180 25.35564 25.34986 25.24033 25.16613 25.10536 25.09323 25.04742
## 6  24.84724 24.79152 24.74046 24.66443 24.59210 24.52636 24.48035 24.44179
## 7  22.20712 22.17334 22.14237 22.13482 22.12396 22.11566 22.12752 22.13176
## 8  22.81625 22.73999 22.67058 22.72478 22.77565 22.82096 22.85260 22.89287
## 9  23.44948 23.44002 23.43142 23.42918 23.41803 23.40760 23.40592 23.40594
## 10 23.52553 23.45958 23.39939 23.41761 23.42286 23.42678 23.43135 23.44315
## 11 22.23896 22.22107 22.20481 22.24667 22.27550 22.30228 22.35129 22.39580
## 12 22.71202 22.66142 22.61499 22.57184 22.51579 22.46497 22.42637 22.38781
## 13 24.78322 24.74050 24.70157 24.72405 24.74068 24.75562 24.76899 24.78256
## 14 27.20232 27.18040 27.16071 27.19449 27.22160 27.24528 27.24333 27.24529
## 15 28.50515 28.47489 28.44700 28.34212 28.23714 28.14282 28.07843 28.01279
## 16 31.56923 31.53943 31.51203 31.45745 31.39882 31.34524 31.30651 31.27346
## 17 30.53240 30.51322 30.49566 30.46922 30.43728 30.40863 30.39801 30.38591
## 18 32.48727 32.48262 32.47871 32.56563 32.63705 32.70026 32.74460 32.79587
## 19 35.26658 35.25613 35.24673 35.28917 35.32185 35.35021 35.36915 35.39745
## 20 33.86939 33.86080 33.85312 33.82509 33.80135 33.78030 33.77305 33.76442
##         s56      s57      s58      s59      s60      s61      s62      s63
## 1  24.04993 23.98888 23.93521 23.88631 23.84136 23.80064 23.76351 23.73023
## 2  23.65397 23.61461 23.57769 23.54278 23.50965 23.47870 23.45007 23.42430
## 3  24.02560 23.96544 23.91148 23.86240 23.81751 23.77712 23.74042 23.70747
## 4  18.50764 18.47765 18.45269 18.43053 18.41114 18.39381 18.37767 18.36170
## 5  24.99474 24.93063 24.87942 24.83431 24.79365 24.75752 24.72482 24.69493
## 6  24.40646 24.37530 24.34634 24.31956 24.29490 24.27243 24.25186 24.23345
## 7  22.13116 22.12563 22.12227 22.12013 22.11920 22.11883 22.11874 22.11933
## 8  22.93143 22.97397 23.01000 23.04210 23.07173 23.09786 23.12107 23.14240
## 9  23.40495 23.40485 23.40357 23.40274 23.40319 23.40394 23.40437 23.40425
## 10 23.45360 23.46426 23.47295 23.48092 23.48870 23.49566 23.50211 23.50841
## 11 22.43302 22.46439 22.49204 22.51764 22.54255 22.56534 22.58576 22.60340
## 12 22.34905 22.30742 22.27099 22.23873 22.21027 22.18500 22.16248 22.14418
## 13 24.79381 24.80384 24.81275 24.82102 24.82916 24.83644 24.84300 24.85110
## 14 27.24752 27.25313 27.25660 27.25924 27.26239 27.26493 27.26637 27.26759
## 15 27.94909 27.88106 27.82137 27.76765 27.71857 27.67450 27.63497 27.59961
## 16 31.24298 31.21382 31.18796 31.16445 31.14255 31.12279 31.10533 31.09111
## 17 30.37216 30.35503 30.34059 30.32786 30.31651 30.30650 30.29772 30.29009
## 18 32.84455 32.89424 32.93762 32.97668 33.01250 33.04464 33.07374 33.10063
## 19 35.42593 35.45752 35.48423 35.50791 35.52925 35.54827 35.56542 35.58108
## 20 33.75507 33.74535 33.73667 33.72890 33.72206 33.71598 33.71034 33.70358
##         s64      s65      s66      s67      s68      s69      s70      s71
## 1  23.70169 23.67547 23.65299 23.63255 23.61380 23.59645 23.57958 23.56424
## 2  23.40395 23.38551 23.36737 23.35124 23.33586 23.32091 23.30625 23.29211
## 3  23.67787 23.65106 23.62912 23.60880 23.59030 23.57337 23.55757 23.54338
## 4  18.34038 18.31990 18.30243 18.28779 18.27406 18.26005 18.24868 18.23843
## 5  24.66545 24.63601 24.60537 24.57665 24.55049 24.52633 24.50062 24.47763
## 6  24.21804 24.20436 24.19330 24.18402 24.17535 24.16653 24.15604 24.14581
## 7  22.12292 22.12567 22.12217 22.12030 22.11841 22.11533 22.11167 22.10803
## 8  23.16635 23.18886 23.20988 23.23259 23.25304 23.26977 23.28558 23.29920
## 9  23.40130 23.40014 23.39317 23.38785 23.38290 23.37676 23.36640 23.35643
## 10 23.51683 23.52468 23.53117 23.53925 23.54636 23.55142 23.55589 23.55892
## 11 22.61592 22.62865 22.63771 22.64910 22.65910 22.66715 22.67631 22.68425
## 12 22.13645 22.13169 22.12654 22.12305 22.12034 22.11584 22.10524 22.09508
## 13 24.87021 24.89009 24.90868 24.92839 24.94672 24.96166 24.97417 24.98582
## 14 27.27053 27.27619 27.28379 27.29412 27.30393 27.31317 27.32427 27.33458
## 15 27.56781 27.53999 27.52350 27.50813 27.49448 27.48145 27.46760 27.45478
## 16 31.08519 31.08004 31.07650 31.07282 31.06949 31.06522 31.05504 31.04495
## 17 30.28420 30.27878 30.27472 30.27137 30.26823 30.26477 30.25782 30.25072
## 18 33.12965 33.15663 33.18009 33.20406 33.22573 33.24519 33.26344 33.27939
## 19 35.59694 35.61164 35.62479 35.63834 35.65038 35.66080 35.67150 35.68063
## 20 33.68856 33.67391 33.66020 33.64679 33.63420 33.62266 33.61288 33.60356
##         s72      s73      s74      s75      s76      s77      s78      s79
## 1  23.55089 23.53893 23.52827 23.51844 23.51456 23.50066 23.49798 23.48771
## 2  23.27929 23.26757 23.25694 23.24707 23.24255 23.22940 23.22623 23.21590
## 3  23.53116 23.52033 23.51077 23.50204 23.49846 23.48639 23.48393 23.47511
## 4  18.22894 18.22031 18.21248 18.20541 18.20040 18.19327 18.18967 18.18309
## 5  24.45772 24.43980 24.42376 24.40880 24.40379 24.38101 24.37751 24.36122
## 6  24.13648 24.12798 24.12029 24.11323 24.10940 24.10061 24.09779 24.09091
## 7  22.10436 22.10101 22.09796 22.09533 22.09171 22.09127 22.08843 22.08716
## 8  23.31004 23.31946 23.32757 23.33523 23.33403 23.35024 23.34888 23.35879
## 9  23.34704 23.33839 23.33051 23.32340 23.31835 23.31080 23.30736 23.30053
## 10 23.56073 23.56213 23.56320 23.56434 23.56198 23.56712 23.56522 23.56762
## 11 22.69043 22.69590 22.70070 22.70548 22.70422 22.71534 22.71463 22.72070
## 12 22.08595 22.07775 22.07037 22.06381 22.05815 22.05217 22.04789 22.04285
## 13 24.99582 25.00492 25.01305 25.02085 25.02082 25.03597 25.03562 25.04566
## 14 27.34322 27.35091 27.35764 27.36406 27.36529 27.37615 27.37756 27.38364
## 15 27.44385 27.43416 27.42561 27.41778 27.41532 27.40317 27.40179 27.39312
## 16 31.03613 31.02819 31.02110 31.01450 31.01217 31.00219 31.00029 30.99370
## 17 30.24428 30.23842 30.23314 30.22828 30.22600 30.21933 30.21768 30.21277
## 18 33.29285 33.30482 33.31543 33.32528 33.32851 33.34359 33.34584 33.35618
## 19 35.68821 35.69490 35.70081 35.70625 35.70812 35.71650 35.71769 35.72348
## 20 33.59510 33.58734 33.58031 33.57378 33.57141 33.56174 33.56028 33.55288
##         s80      s81      s82      s83      s84      s85
## 1  23.48473 23.47625 23.47367 23.47006 23.46141 23.46129
## 2  23.21243 23.20279 23.19877 23.19268 23.17913 23.17849
## 3  23.47248 23.46570 23.46396 23.46157 23.45586 23.45582
## 4  18.18003 18.17533 18.17365 18.17200 18.17018 18.17030
## 5  24.35702 24.34420 24.34038 24.33587 24.32429 24.32373
## 6  24.08814 24.08256 24.08032 24.07752 24.07246 24.07181
## 7  22.08504 22.08500 22.08435 22.08475 22.08848 22.08805
## 8  23.35886 23.36649 23.36741 23.36919 23.37763 23.37716
## 9  23.29730 23.29296 23.29122 23.28942 23.28870 23.28783
## 10 23.56651 23.56905 23.56937 23.57046 23.57594 23.57575
## 11 22.72105 22.72723 22.72919 22.73229 22.74340 22.74461
## 12 22.03915 22.03614 22.03424 22.03298 22.03523 22.03377
## 13 25.04655 25.05476 25.05636 25.05921 25.06977 25.06947
## 14 27.38550 27.39054 27.39291 27.39494 27.40195 27.40364
## 15 27.39125 27.38494 27.38408 27.38267 27.37958 27.38020
## 16 30.99160 30.98607 30.98370 30.98099 30.97603 30.97449
## 17 30.21108 30.20752 30.20644 30.20536 30.20418 30.20370
## 18 33.35902 33.36723 33.36976 33.37310 33.38181 33.38247
## 19 35.72500 35.72949 35.73066 35.73237 35.73616 35.73646
## 20 33.55107 33.54542 33.54420 33.54240 33.53756 33.53753

The syntax for glmnet() is a bit different. The breakdown is as follows:

glmnet(x,y,alpha): x corresponds to matrix of the training data, y correspondes to the vector, matrix, of the response variable, and alpha denotes which model we are going to use. If this does not make sense, documentation of the glmnet package is as follows: https://cran.r-project.org/web/packages/glmnet/glmnet.pdf

As we can see, the glmnet() command produces a sequence of possible predictions! However, this isn't very useful. We should have the computer pick the best sequence!

Picking Optimal Sequence

model.cv <- cv.glmnet(mlb.trainx,mlb.trainy, alpha = 1)

prediction.age.lasso <- predict(model.cv,as.matrix(mlb.test), 
                          lambda = "lambda.min")

prediction.age.lasso
##           1
## 1  25.72202
## 2  26.34318
## 3  25.73113
## 4  20.89643
## 5  25.79605
## 6  26.20210
## 7  23.82487
## 8  24.74716
## 9  25.38516
## 10 25.56243
## 11 24.16293
## 12 24.92668
## 13 26.09647
## 14 28.27556
## 15 30.12525
## 16 31.72731
## 17 31.19468
## 18 32.38998
## 19 34.82378
## 20 34.51655

# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "SosaSa",],prediction.age.lasso)


# Selects only relevant columns
results <- results[c("id","age","X1")]
colnames(results) <- c("id","age","predicted age lasso")
head(results,n = 15)
##        id age predicted age lasso
## 1  SosaSa  20            25.72202
## 2  SosaSa  20            26.34318
## 3  SosaSa  20            25.73113
## 4  SosaSa  21            20.89643
## 5  SosaSa  22            25.79605
## 6  SosaSa  23            26.20210
## 7  SosaSa  24            23.82487
## 8  SosaSa  25            24.74716
## 9  SosaSa  26            25.38516
## 10 SosaSa  27            25.56243
## 11 SosaSa  28            24.16293
## 12 SosaSa  29            24.92668
## 13 SosaSa  30            26.09647
## 14 SosaSa  31            28.27556
## 15 SosaSa  32            30.12525

In short, cv.glmnet() uses cross validation to determine the optimal sequence finding an optimal lambda value. This is outside the scope of this course. However, for the curious learner there are many many online sources that discuss the topic of cross validation.

Your Turn

Using the MLB Stats.csv file, predict the age of Barry Bonds (BondsBa) over his entire career using all other players as training data using:

  1. Ordinary Least Squares

  2. Ridge Regression alpha = 0 with an optimal lambda

Answers

1.

mlb <- read.csv("MLB Stats.csv")

mlb <- subset(mlb, select = -c(tm,lg))

mlb.train <- mlb[-which(mlb$id == "BondsBa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "BondsBa",] # Selects rows with SosaSa

model <- lm(age ~., data = mlb.train[,-1])
# Builds linear model predicting hr(homeruns) based on all variables
# We need to remove id, tm, and lg variables since they are not numeric

prediction <- predict(model,mlb.test) 
# Run prediction function based on our model

# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "BondsBa",],prediction)

# Selects only relevant columns
results <- results[c("id","age","prediction")]

results
##         id age prediction
## 21 BondsBa  21   21.32797
## 22 BondsBa  22   24.00431
## 23 BondsBa  23   24.20358
## 24 BondsBa  24   26.77484
## 25 BondsBa  25   21.42256
## 26 BondsBa  26   24.25045
## 27 BondsBa  27   25.59272
## 28 BondsBa  28   27.69233
## 29 BondsBa  29   28.49327
## 30 BondsBa  30   27.53192
## 31 BondsBa  31   30.17421
## 32 BondsBa  32   32.09397
## 33 BondsBa  33   28.62478
## 34 BondsBa  34   31.88814
## 35 BondsBa  35   34.59521
## 36 BondsBa  36   33.20863
## 37 BondsBa  37   40.65713
## 38 BondsBa  38   41.99880
## 39 BondsBa  39   51.00804
## 40 BondsBa  40   36.91986
## 41 BondsBa  41   43.96660
## 42 BondsBa  42   46.39968

2.

mlb <- read.csv("MLB Stats.csv")

mlb.train <- mlb[-which(mlb$id == "BondsBa"),] # Selects rows without SosaSa
mlb.test <- mlb[mlb$id == "BondsBa",] # Selects rows with SosaSa

mlb.train <- subset(mlb.train, select = -c(id,tm,lg))
mlb.test <- subset(mlb.test, select = -c(id,tm,lg,age))
  
mlb.trainx <- as.matrix(subset(mlb.train,select = -age))
mlb.trainy <- as.matrix(subset(mlb.train,select = age))

model.cv <- cv.glmnet(mlb.trainx,mlb.trainy, alpha = 0)

prediction.age.ridge <- predict(model.cv,as.matrix(mlb.test), 
                          lambda = "lambda.min")

# Creates data frame with Sammy Sosa and the predictions as columns
results <- data.frame(mlb[mlb$id == "BondsBa",],prediction.age.ridge)


# Selects only relevant columns
results <- results[c("id","age","X1")]
colnames(results) <- c("id","age","predicted age ridge")

results
##         id age predicted age ridge
## 21 BondsBa  21            23.15289
## 22 BondsBa  22            22.70597
## 23 BondsBa  23            24.96508
## 24 BondsBa  24            26.82158
## 25 BondsBa  25            24.03458
## 26 BondsBa  26            26.28994
## 27 BondsBa  27            27.82253
## 28 BondsBa  28            28.17418
## 29 BondsBa  29            27.55025
## 30 BondsBa  30            28.15562
## 31 BondsBa  31            30.54194
## 32 BondsBa  32            30.88525
## 33 BondsBa  33            29.45109
## 34 BondsBa  34            30.35081
## 35 BondsBa  35            31.94046
## 36 BondsBa  36            32.18441
## 37 BondsBa  37            38.70461
## 38 BondsBa  38            38.30887
## 39 BondsBa  39            44.70509
## 40 BondsBa  40            34.15943
## 41 BondsBa  41            39.49559
## 42 BondsBa  42            40.80638