Outline

  • Provide an overview of hierarchical clustering
  • K-means vs. hierarchical clustering
  • Examples

Hierarchical Clustering

This procedure starts with each point as a singleton cluster, and then repeatedly combines the two nearest clusters until a single cluster containing all points remain. This can be visualized using a Dendrogram. From the Dendrogram, we can then see the optimal clusters.

Hierarchical Clustering Example

Here we are going to perform cluster analysis on the NFL Running Backs 2016 data set. This data contain the seasonal statistics of each starting running back (32 total) for the 2016 - 2017 NFL season.

# Load NFL RB data
nfl <- read.csv("NFL Running Backs 2016.csv")

# Remove categorical data
nfl.rb <- subset(nfl, select = -c(Player,Year,Team,Pos))

# Name rows of data set
rownames(nfl.rb) <- nfl$Player

# Center and scale data. Important!!
nfl.rb <- scale(nfl.rb)

# Perform hierarchical clustering
# Need to use dist() i.e. distance matrix
nfl.rb.cluster <- hclust(dist(nfl.rb))

# Display results
nfl.rb.cluster
## 
## Call:
## hclust(d = dist(nfl.rb))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 32

Visualizing Dendrogram

# Plots Dendrogram
plot(nfl.rb.cluster, sub = "")

From this Dendrogram we can visually see each cluster.

If we want to separate this into n groups, we use the following code:

# Select 3 clusters
nfl.rb.clustercut <- cutree(nfl.rb.cluster,3)
# Sort by cluster
sort(nfl.rb.clustercut)
##    JohnsonDa     ElliotEz       BellLe    FreemanDe       WestTe 
##            1            1            1            2            2 
##      McCoyLe    StewartJo     HowardJo     MillerLa       GoreFr 
##            2            2            2            2            2 
##     GurleyTo      AjayiJa     IngramMa LeGarretteBl     MurrayLa 
##            2            2            2            2            2 
##     MelvinGo       HydeCa     MurrayDe       HillJe    JohnsonDu 
##            2            2            2            3            3 
##   AndersonCJ    RiddickTh       LacyEd      IvoryCh    CharlesJa 
##            3            3            3            3            3 
##   PetersonAd   JenningsRa      ForteMa    MathewsRy      RawlsTh 
##            3            3            3            3            3 
##     MartinDo     KelleyRo 
##            3            3

K-Means vs. Hierarchical Clustering

K-means requires the user to specify the number of clusters, k. However, finding the optimal k may be difficult in practice, but gives the user more flexibility. Hierarchical clustering determines the best clusters based upon incremental creation.

K-Means cluster

# Load data
nfl2 <- read.csv("NFL Running Backs 2016.csv")

# Remove categorical data
nfl2.rb <- subset(nfl2, select = -c(Player,Year,Team,Pos))

# Rename rows
rownames(nfl2.rb) <- nfl2$Player

# Center and scale. Important!!
nfl2.rb <- scale(nfl2.rb)

# K means with 3 clusters
nfl2.rb.cluster <- kmeans(nfl2.rb,centers = 3)

# Display clusters
sort(nfl2.rb.cluster$cluster)
##    JohnsonDu   AndersonCJ    RiddickTh       LacyEd      IvoryCh 
##            1            1            1            1            1 
##    CharlesJa   PetersonAd      RawlsTh     MartinDo    JohnsonDa 
##            1            1            1            1            2 
##    FreemanDe      McCoyLe     HowardJo     ElliotEz     IngramMa 
##            2            2            2            2            2 
##       BellLe     MelvinGo     MurrayDe       WestTe    StewartJo 
##            2            2            2            3            3 
##       HillJe     MillerLa       GoreFr     GurleyTo      AjayiJa 
##            3            3            3            3            3 
## LeGarretteBl   JenningsRa      ForteMa     MurrayLa    MathewsRy 
##            3            3            3            3            3 
##       HydeCa     KelleyRo 
##            3            3

As we can see, using 3 clusters, the results of the clusters are different using K-means and hierachical. Using specialty domain knowledge, someone may choose one clustering model over the other.

Your Turn

Using the NFL Wide Receivers 2016.csv data set, complete the following:

  1. Create different scatter plots using the hierarchical clustering results with 2,3,4, and 5 clusters. Put touchdowns (TD) on the y-axis and receptions (Rec) on the x-axis. Instead of placing points, place each player's name. Color code each name based on the cluster group.

Answers

1.

2 Clusters

# Load libraries
library(ggplot2)
library(ggrepel)

# Load NFL WR data
nfl <- read.csv("NFL Wide Receivers 2016.csv")
# Remove categorical data
nfl.wr <- subset(nfl, select = -c(Player,Year,Team,Pos))
# Rename rows of data frame
rownames(nfl.wr) <- nfl$Player
# Center and scale. Important!!
nfl.wr <- scale(nfl.wr)
# Hierachical cluster with dist() function applied
nfl.wr.cluster <- hclust(dist(nfl.wr))
# Select 2 clusters
nfl.rb.clustercut <- cutree(nfl.wr.cluster,2)
# Create data frame of NFL and cluster data
df <- data.frame(nfl,as.factor(nfl.rb.clustercut))
# Rename last column cluster
names(df)[ncol(df)] <- "Cluster"

# Create plot of TD vs. Rec
ggplot(df, aes(Rec, TD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Receiving Touchdowns vs. Receptions Using HC - 2 Clusters")

3 Clusters

nfl.rb.clustercut <- cutree(nfl.wr.cluster,3)

df <- data.frame(nfl,as.factor(nfl.rb.clustercut))
names(df)[ncol(df)] <- "Cluster"

ggplot(df, aes(Rec, TD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Receiving Touchdowns vs. Receptions Using HC - 3 Clusters")

4 Clusters

nfl.rb.clustercut <- cutree(nfl.wr.cluster,4)

df <- data.frame(nfl,as.factor(nfl.rb.clustercut))
names(df)[ncol(df)] <- "Cluster"

ggplot(df, aes(Rec, TD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Receiving Touchdowns vs. Receptions Using HC - 4 Clusters")

5 Clusters

nfl.rb.clustercut <- cutree(nfl.wr.cluster,5)

df <- data.frame(nfl,as.factor(nfl.rb.clustercut))
names(df)[ncol(df)] <- "Cluster"

ggplot(df, aes(Rec, TD,colour = df$Cluster)) + 
  geom_text_repel(aes(label=Player))+
  theme(legend.position="none") +
  ggtitle("Receiving Touchdowns vs. Receptions Using HC - 5 Clusters")