Introduction to Clustering

Outline

Discuss fundamentals of clustering
Apply knowledge to simple problems

What is Cluster Analysis?

A simple way of describing cluster analysis is dividing data into groups (clusters) that are meaningful and useful based upon properties of your data. The end goal being the objects within each group are similar and distinct from other clusters. The greater the homogeneity (similarity) within a cluster the greater the difference between other groups

As an example, given any sports team you can cluster teams based upon their skill level. For instance, teams with higher overall skill will be in one cluster while teams with lower overall skill will be in another cluster.

Cluster Analysis

An overarching question to clustering is, how do you find similar groups? Since the task of clustering is somewhat subjective, there are many different ways of accomplishing this goal. Two of the most common types of clustering algorithms are:

K-Means Clustering
Hierarchical Clustering

K-Means Clustering

This algorithm is a partitional clustering technique that attempts to find a user input, k, number of clusters. In short, the algorithm computes a centroid for the k clusters and then computes distances from each point to determine if a centroid is a reasonable point for a cluster.

Pre-Processing Data

When doing clustering it is important to center and scale your data. This is a consequence of the distance metric being used. For example, in the NFL is 1 unit of rushing yards equivalent to 1 unit of touchdown? No of course not! 1 touchdown is much for valuable than 1 rushing yard. Hence when we scale and center our data, we create a good measure of distance between all variables.

Example

Here we are going to perform cluster analysis on the NFL Running Backs 2016 data set. This data contain the seasonal statistics of each starting running back (32 total) for the 2016 - 2017 NFL season.

Let's see if using K-Means clustering can identify meaning groups within this data set. The structure of K-Means is as follows: kmeans(data set,centers = k). There are many more arguments that can be viewed in the help menu using ?kmeans.

Note: K-Means clustering can only be applied with numerical values, so you will have to remove all categorical variables in any data set where you want to apply cluster analysis.

# Load data
nfl <- read.csv("NFL Running Backs 2016.csv")

# Remove categorical variables
nfl.rb <- subset(nfl, select = -c(Player,Year,Team,Pos))

# Give our data frame row names for use of use
rownames(nfl.rb) <- nfl$Player

# Scale data. This is important!!
nfl.rb <- scale(nfl.rb)

# K-means algorithm with 2 clusters
nfl.rb.cluster <- kmeans(nfl.rb,centers = 2)

# Output results
nfl.rb.cluster

## K-means clustering with 2 clusters of sizes 18, 14
## 
## Cluster means:
##          Age     Number      Games     Starts   RushAtms    RushYds
## 1 -0.1568003 -0.1192355  0.5318690  0.6322527  0.7117896  0.7203365
## 2  0.2016003  0.1533027 -0.6838316 -0.8128964 -0.9151580 -0.9261469
##       RushTD    RushLng    RushYPA    RushYPG RushAtmsPG    Targets
## 1  0.5647368  0.4677834  0.3576346  0.6127593  0.5545440  0.4111109
## 2 -0.7260901 -0.6014358 -0.4598160 -0.7878334 -0.7129851 -0.5285711
##          Rec        Yds     RecYPR      RecTD     RecLng     RecRPG
## 1  0.4327003  0.4018829  0.1471037  0.2699913  0.3608545  0.2849799
## 2 -0.5563290 -0.5167066 -0.1891334 -0.3471317 -0.4639558 -0.3664027
##         RYPG      Ctch. YardfmScrim    TotalTD    Fumbles
## 1  0.2982008  0.2743948   0.7005657  0.6033873  0.3587948
## 2 -0.3834010 -0.3527934  -0.9007273 -0.7757836 -0.4613076
## 
## Clustering vector:
##    JohnsonDa    FreemanDe       WestTe      McCoyLe    StewartJo 
##            1            1            2            1            2 
##     HowardJo       HillJe    JohnsonDu     ElliotEz   AndersonCJ 
##            1            1            2            1            2 
##    RiddickTh       LacyEd     MillerLa       GoreFr      IvoryCh 
##            2            2            1            1            2 
##    CharlesJa     GurleyTo      AjayiJa   PetersonAd     IngramMa 
##            2            1            1            2            1 
## LeGarretteBl   JenningsRa      ForteMa     MurrayLa    MathewsRy 
##            1            2            1            1            2 
##       BellLe     MelvinGo      RawlsTh       HydeCa     MartinDo 
##            1            1            2            1            2 
##     MurrayDe     KelleyRo 
##            1            2 
## 
## Within cluster sum of squares by cluster:
## [1] 276.5100 228.0134
##  (between_SS / total_SS =  29.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

# See cluster
nfl.rb.cluster$cluster

##    JohnsonDa    FreemanDe       WestTe      McCoyLe    StewartJo 
##            1            1            2            1            2 
##     HowardJo       HillJe    JohnsonDu     ElliotEz   AndersonCJ 
##            1            1            2            1            2 
##    RiddickTh       LacyEd     MillerLa       GoreFr      IvoryCh 
##            2            2            1            1            2 
##    CharlesJa     GurleyTo      AjayiJa   PetersonAd     IngramMa 
##            2            1            1            2            1 
## LeGarretteBl   JenningsRa      ForteMa     MurrayLa    MathewsRy 
##            1            2            1            1            2 
##       BellLe     MelvinGo      RawlsTh       HydeCa     MartinDo 
##            1            1            2            1            2 
##     MurrayDe     KelleyRo 
##            1            2

Create a data frame with the cluster as a column

# Create data frame joining cluster data to NFL data
df <- data.frame(nfl,as.factor(nfl.rb.cluster$cluster))

# Rename last column "Cluster"
names(df)[ncol(df)] <- "Cluster"

head(df)

##              Player Year Age Team Pos Number Games Starts RushAtms RushYds
## JohnsonDa JohnsonDa 2016  25  ARI  RB     31    16     16      293    1239
## FreemanDe FreemanDe 2016  24  ATL  RB     24    16     16      227    1079
## WestTe       WestTe 2016  25  BAL  RB     28    16     13      193     774
## McCoyLe     McCoyLe 2016  28  BUF  RB     25    15     15      234    1267
## StewartJo StewartJo 2016  29  CAR  RB     28    13     13      218     824
## HowardJo   HowardJo 2016  22  CHI  RB     24    15     13      252    1313
##           RushTD RushLng RushYPA RushYPG RushAtmsPG Targets Rec Yds RecYPR
## JohnsonDa     16      58     4.2    77.4       18.3     120  80 879   11.0
## FreemanDe     11      75     4.8    67.4       14.2      65  54 462    8.6
## WestTe         5      41     4.0    48.4       12.1      45  34 236    6.9
## McCoyLe       13      75     5.4    84.5       15.6      57  50 356    7.1
## StewartJo      9      47     3.8    63.4       16.8      21   8  60    7.5
## HowardJo       6      69     5.2    87.5       16.8      50  29 298   10.3
##           RecTD RecLng RecRPG RYPG Ctch. YardfmScrim TotalTD Fumbles
## JohnsonDa     4     58    5.0 54.9  66.7        2118      20       5
## FreemanDe     2     35    3.4 28.9  83.1        1541      13       1
## WestTe        1     17    2.1 14.8  75.6        1010       6       2
## McCoyLe       1     41    3.3 23.7  87.7        1623      14       3
## StewartJo     0     25    0.6  4.6  38.1         884       9       3
## HowardJo      1     34    1.9 19.9  58.0        1611       7       2
##           Cluster
## JohnsonDa       1
## FreemanDe       1
## WestTe          2
## McCoyLe         1
## StewartJo       2
## HowardJo        1

Plotting Clusters

# Load relevant libraries
library(ggplot2)
library(ggrepel)

# ggplot of df with x: RushYds, y: RushTD with coloring by cluster
ggplot(df, aes(RushYds, RushTD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Visualization of Clusters for Rushing Touchdowns vs. Rushing Yards K = 2")

Changing K

K = 3

nfl.rb.cluster <- kmeans(nfl.rb,centers = 3)

# Create data frame joining cluster data to NFL data
df <- data.frame(nfl,as.factor(nfl.rb.cluster$cluster))
# Rename last column "Cluster"
names(df)[ncol(df)] <- "Cluster"

# ggplot of df with x: RushYds, y: RushTD with coloring by cluster
ggplot(df, aes(RushYds, RushTD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Visualization of Clusters for Rushing Touchdowns vs. Rushing Yards K = 3")

K = 4

nfl.rb.cluster <- kmeans(nfl.rb,centers = 4)

# Create data frame joining cluster data to NFL data
df <- data.frame(nfl,as.factor(nfl.rb.cluster$cluster))
# Rename last column "Cluster"
names(df)[ncol(df)] <- "Cluster"

# ggplot of df with x: RushYds, y: RushTD with coloring by cluster
ggplot(df, aes(RushYds, RushTD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Visualization of Clusters for Rushing Touchdowns vs. Rushing Yards K = 4")

K = 5

nfl.rb.cluster <- kmeans(nfl.rb,centers = 5)

df <- data.frame(nfl,as.factor(nfl.rb.cluster$cluster))
names(df)[ncol(df)] <- "Cluster"

ggplot(df, aes(RushYds, RushTD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Visualization of Clusters for Rushing Touchdowns vs. Rushing Yards K = 5")

Discussion

By printing out nfl.rb.cluster$cluster in any of the code above, you can see which cluster group each player is in. Moreover, in our plots, we chose to model rushing touchdowns vs. rushing yards. We will notice that some colors groups (clusters) are consistently near the top corner, while others are near the bottom corner. Moreover, by doing clustering we can get a sense of the different tiers and types of players.

Your Turn

Using the NFL Wide Receivers 2016.csv data set, complete the following:

Center and scale your data and subset your data frame to remove: Player,Year,Team,Pos.
Reconstruct the 4 graphs above using k = 2,3,4,5. However, this time put touchdowns (TD) on the y-axis and receptions (Rec) on the x-axis.

Answers

1.

# Load WR data set
nfl <- read.csv("NFL Wide Receivers 2016.csv") 

# Subset data to remove categorical variables
nfl.wr <- subset(nfl, select = -c(Player,Year,Team,Pos))

# Name rows of data set
rownames(nfl.wr) <- nfl$Player

# Center and scale data
nfl.wr <- scale(nfl.wr)

2.

# K means with 2 clusters
nfl.wr.cluster <- kmeans(nfl.wr,centers = 2)
# Data frame of original data plus cluster assignments
df <- data.frame(nfl,as.factor(nfl.wr.cluster$cluster))
# Rename last column
names(df)[ncol(df)] <- "Cluster"
# Plotting TD vs. Rec
ggplot(df, aes(Rec, TD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Visualization of Clusters for Receiving Touchdowns vs. Receptions K = 2")

K = 3

# K means with 3 clusters
nfl.wr.cluster <- kmeans(nfl.wr,centers = 3)
# Data frame of original data plus cluster assignments
df <- data.frame(nfl,as.factor(nfl.wr.cluster$cluster))
# Rename last column
names(df)[ncol(df)] <- "Cluster"
# Plotting TD vs. Rec
ggplot(df, aes(Rec, TD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Visualization of Clusters for Receiving Touchdowns vs. Receptions K = 3")

K = 4

nfl.wr.cluster <- kmeans(nfl.wr,centers = 4)

df <- data.frame(nfl,as.factor(nfl.wr.cluster$cluster))
names(df)[ncol(df)] <- "Cluster"

ggplot(df, aes(Rec, TD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Visualization of Clusters for Receiving Touchdowns vs. Receptions K = 4")

K = 5

nfl.wr.cluster <- kmeans(nfl.wr,centers = 5)

df <- data.frame(nfl,as.factor(nfl.wr.cluster$cluster))
names(df)[ncol(df)] <- "Cluster"

ggplot(df, aes(Rec, TD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Visualization of Clusters for Receiving Touchdowns vs. Receptions K = 5")