- Describe machine learning in a nutshell
- Understand terminology

- Describe machine learning in a nutshell
- Understand terminology

Simply, machine learning is the use of computers to apply techniques that automatically and independently discover patterns in data.

Nowadays machine learning is used across various disciplines from biology to finance to sports. Of course, our focus will be on sports.

For us, we can use machine learning to make predictions, or to learn about groups within our data sets. Specifically, in this course we will do:

- Predictive Modeling
- Regression
- Classification

- Clustering

**Predictive modeling** is concerned with predicting the outcome of an event. For instance in a regression problem we would predict values such as points scored in a game, number of field goals, etc. In a classification problem, we would want to predict a categorical variable such as win/loss, above/below, low/medium/high, etc.

**Clustering** is concerned with grouping data points based on some sort of similarity measure. As a basic example in basketball, we could cluster players by position or team. A more advanced example would see if based on player statistics if we can cluster players together into groups such as: sharpshooter, 3 and D, Slasher, Post Scorer, ect.

In the machine learning / data mining community, there are certain words that are commonly used. Given a data set we have some words as follows:

**Instances/Observations**: Number of rows of data set**Features/Attributes**: Number of columns which contain descriptive variables**Response Variable**: Variable in which we are trying to predict- Specifically for regression problems:
**Predictor Variable**: Variable used to predict another variable

- Specifically for classification problems:
**Class**: The label of a particular reponse variable

nba <- read.csv("Boston Celtics.csv") dim(nba) # Gets dimension of nba dataset

## [1] 26 62

names(nba) # Column names of nba dataset

## [1] "Team" "Opp" ## [3] "Date" "Number" ## [5] "Round" "Game" ## [7] "Location" "W.L" ## [9] "Importance" "Points" ## [11] "Oppo" "Diff" ## [13] "Team.FG" "Team.FGA" ## [15] "Team.FG." "Team.3P" ## [17] "Team.3PA" "Team.3P." ## [19] "Team.FT" "Team.FTA" ## [21] "Team.FT." "Team.ORB" ## [23] "Team.TRB" "Team.AST" ## [25] "Team.STL" "Team.BLK" ## [27] "Team.TOV" "Team.PF" ## [29] "Opponent.FG" "Opponent.FGA" ## [31] "Opponent.FG." "Opponent.3P" ## [33] "Opponent.3PA" "Opponent.3P." ## [35] "Opponent.FT" "Opponent.FTA" ## [37] "Opponent.FT." "Opponent.ORB" ## [39] "Opponent.TRB" "Opponent.AST" ## [41] "Opponent.STL" "Opponent.BLK" ## [43] "Opponent.TOV" "Opponent.PF" ## [45] "Advanced.ORtg" "Advanced.DRtg" ## [47] "Advanced.Pace" "Advanced.FTr" ## [49] "Advanced.3PAr" "Advanced.TS." ## [51] "Advanced.TRB." "Advanced.AST." ## [53] "Advanced.STL." "Advanced.BLK." ## [55] "Offensive.Four.Factors.eFG." "Offensive.Four.Factors.TOV." ## [57] "Offensive.Four.Factors.ORB." "Offensive.Four.Factors.FT.FGA" ## [59] "Defensive.Four.Factors.eFG." "Defensive.Four.Factors.TOV." ## [61] "Defensive.Four.Factors.DRB." "Defensive.Four.Factors.FT.FGA"

Note: The number of instances in the nba data set is `26`

and the number of features is `62-1 = 61`

since one column will by default be the response variable.

If we wanted to predict W/L, our class would be W or L and our features would be all the other columns.

If we wanted to predict `Points`

, our response variable would be numeric (Points) and our predictor variables would be the rest of the columns.

When we are trying to predict an outcome we need data to learn from, that is, **training data**. Specifically, we build our model from this data, and we use it to predict outcomes on an independent **testing data**. It is key for the training data and testing data to be complete independent or else we are using data from the testing set to predict the testing set. This will be biased and inaccurate!

Suppose we have a data set, and we want to make a prediction on a specific player, Sammy Sosa. From the original data set, we would want to create two separate data sets where: - Training data does not contain Sammy Sosa - Testing data only contains Sammy Sosa

mlb <- read.csv("MLB Stats.csv") # Loads MLB data set mlb.train <- mlb[-which(mlb$id == "SosaSa"),] # Selects rows without SosaSa mlb.test <- mlb[mlb$id == "SosaSa",] # Selects rows with SosaSa intersect(mlb.train,mlb.test) # See how many rows in common

## data frame with 0 columns and 0 rows

Since the `intersect()`

commands returns a `0 by 0`

data frame, we are sure that the intersection between the two is empty. As forementioned earlier, it is crucial that the training set and testing set be completely independent to avoid any bias.

Create two data sets, one training and one testing, where the training data consists of Barry Bonds (BondsBa) and Cal Ripken (RipkeCa) and the testing data contains everyone else.

Verify that the intersect of the two data sets is empty.

mlb <- read.csv("MLB Stats.csv") # Selects rows without Bonds and Ripken mlb.train <- mlb[-which(mlb$id %in% c("BondsBa","RipkeCa")),] # Selects rows with Bonds and Ripken mlb.test <- mlb[mlb$id %in% c("BondsBa","RipkeCa"),] intersect(mlb.train,mlb.test) # See how many rows in common

## data frame with 0 columns and 0 rows

In later lessons we will learn more about machine learning with certain models as well as do more complicated exercises.