## Outline

• Describe machine learning in a nutshell
• Understand terminology

## Machine Learning?

Simply, machine learning is the use of computers to apply techniques that automatically and independently discover patterns in data.

Nowadays machine learning is used across various disciplines from biology to finance to sports. Of course, our focus will be on sports.

For us, we can use machine learning to make predictions, or to learn about groups within our data sets. Specifically, in this course we will do:

• Predictive Modeling
• Regression
• Classification
• Clustering

Predictive modeling is concerned with predicting the outcome of an event. For instance in a regression problem we would predict values such as points scored in a game, number of field goals, etc. In a classification problem, we would want to predict a categorical variable such as win/loss, above/below, low/medium/high, etc.

Clustering is concerned with grouping data points based on some sort of similarity measure. As a basic example in basketball, we could cluster players by position or team. A more advanced example would see if based on player statistics if we can cluster players together into groups such as: sharpshooter, 3 and D, Slasher, Post Scorer, ect.

## Data Mining Language

In the machine learning / data mining community, there are certain words that are commonly used. Given a data set we have some words as follows:

• Instances/Observations: Number of rows of data set
• Features/Attributes: Number of columns which contain descriptive variables
• Response Variable: Variable in which we are trying to predict

• Specifically for regression problems:
• Predictor Variable: Variable used to predict another variable
• Specifically for classification problems:
• Class: The label of a particular reponse variable

## Examples

nba <- read.csv("Boston Celtics.csv")
dim(nba) # Gets dimension of nba dataset
## [1] 26 62
names(nba) # Column names of nba dataset
##  [1] "Team"                          "Opp"
##  [3] "Date"                          "Number"
##  [5] "Round"                         "Game"
##  [7] "Location"                      "W.L"
##  [9] "Importance"                    "Points"
## [11] "Oppo"                          "Diff"
## [13] "Team.FG"                       "Team.FGA"
## [15] "Team.FG."                      "Team.3P"
## [17] "Team.3PA"                      "Team.3P."
## [19] "Team.FT"                       "Team.FTA"
## [21] "Team.FT."                      "Team.ORB"
## [23] "Team.TRB"                      "Team.AST"
## [25] "Team.STL"                      "Team.BLK"
## [27] "Team.TOV"                      "Team.PF"
## [29] "Opponent.FG"                   "Opponent.FGA"
## [31] "Opponent.FG."                  "Opponent.3P"
## [33] "Opponent.3PA"                  "Opponent.3P."
## [35] "Opponent.FT"                   "Opponent.FTA"
## [37] "Opponent.FT."                  "Opponent.ORB"
## [39] "Opponent.TRB"                  "Opponent.AST"
## [41] "Opponent.STL"                  "Opponent.BLK"
## [43] "Opponent.TOV"                  "Opponent.PF"
## [55] "Offensive.Four.Factors.eFG."   "Offensive.Four.Factors.TOV."
## [57] "Offensive.Four.Factors.ORB."   "Offensive.Four.Factors.FT.FGA"
## [59] "Defensive.Four.Factors.eFG."   "Defensive.Four.Factors.TOV."
## [61] "Defensive.Four.Factors.DRB."   "Defensive.Four.Factors.FT.FGA"

Note: The number of instances in the nba data set is 26 and the number of features is 62-1 = 61 since one column will by default be the response variable.

If we wanted to predict W/L, our class would be W or L and our features would be all the other columns.

If we wanted to predict Points, our response variable would be numeric (Points) and our predictor variables would be the rest of the columns.

## Separation of Data Sets

When we are trying to predict an outcome we need data to learn from, that is, training data. Specifically, we build our model from this data, and we use it to predict outcomes on an independent testing data. It is key for the training data and testing data to be complete independent or else we are using data from the testing set to predict the testing set. This will be biased and inaccurate!

## Example of Separating Data Sets

Suppose we have a data set, and we want to make a prediction on a specific player, Sammy Sosa. From the original data set, we would want to create two separate data sets where: - Training data does not contain Sammy Sosa - Testing data only contains Sammy Sosa

mlb <- read.csv("MLB Stats.csv") # Loads MLB data set

mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows without SosaSa mlb.test <- mlb[mlb$id == "SosaSa",] # Selects rows with SosaSa

intersect(mlb.train,mlb.test) # See how many rows in common
## data frame with 0 columns and 0 rows

Since the intersect() commands returns a 0 by 0 data frame, we are sure that the intersection between the two is empty. As forementioned earlier, it is crucial that the training set and testing set be completely independent to avoid any bias.

1. Create two data sets, one training and one testing, where the training data consists of Barry Bonds (BondsBa) and Cal Ripken (RipkeCa) and the testing data contains everyone else.

2. Verify that the intersect of the two data sets is empty.

### 1.

mlb <- read.csv("MLB Stats.csv")

# Selects rows without Bonds and Ripken
mlb.train <- mlb[-which(mlb$id %in% c("BondsBa","RipkeCa")),] # Selects rows with Bonds and Ripken mlb.test <- mlb[mlb$id %in% c("BondsBa","RipkeCa"),]

intersect(mlb.train,mlb.test) # See how many rows in common
## data frame with 0 columns and 0 rows