- Discuss fundamentals of clustering
- Apply knowledge to simple problems
A simple way of describing cluster analysis is dividing data into groups (clusters) that are meaningful and useful based upon properties of your data. The end goal being the objects within each group are similar and distinct from other clusters. The greater the homogeneity (similarity) within a cluster the greater the difference between other groups
As an example, given any sports team you can cluster teams based upon their skill level. For instance, teams with higher overall skill will be in one cluster while teams with lower overall skill will be in another cluster.
An overarching question to clustering is, how do you find similar groups? Since the task of clustering is somewhat subjective, there are many different ways of accomplishing this goal. Two of the most common types of clustering algorithms are:
K-Means Clustering
Hierarchical Clustering
This algorithm is a partitional clustering technique that attempts to find a user input, k, number of clusters. In short, the algorithm computes a centroid for the k clusters and then computes distances from each point to determine if a centroid is a reasonable point for a cluster.
When doing clustering it is important to center and scale your data. This is a consequence of the distance metric being used. For example, in the NFL is 1 unit of rushing yards equivalent to 1 unit of touchdown? No of course not! 1 touchdown is much for valuable than 1 rushing yard. Hence when we scale and center our data, we create a good measure of distance between all variables.
Here we are going to perform cluster analysis on the NFL Running Backs 2016 data set. This data contain the seasonal statistics of each starting running back (32 total) for the 2016 - 2017 NFL season.
Let's see if using K-Means clustering can identify meaning groups within this data set. The structure of K-Means is as follows: kmeans(data set,centers = k)
. There are many more arguments that can be viewed in the help menu using ?kmeans
.
Note: K-Means clustering can only be applied with numerical values, so you will have to remove all categorical variables in any data set where you want to apply cluster analysis.
# Load data nfl <- read.csv("NFL Running Backs 2016.csv") # Remove categorical variables nfl.rb <- subset(nfl, select = -c(Player,Year,Team,Pos)) # Give our data frame row names for use of use rownames(nfl.rb) <- nfl$Player # Scale data. This is important!! nfl.rb <- scale(nfl.rb) # K-means algorithm with 2 clusters nfl.rb.cluster <- kmeans(nfl.rb,centers = 2)
# Output results nfl.rb.cluster
## K-means clustering with 2 clusters of sizes 18, 14 ## ## Cluster means: ## Age Number Games Starts RushAtms RushYds ## 1 -0.1568003 -0.1192355 0.5318690 0.6322527 0.7117896 0.7203365 ## 2 0.2016003 0.1533027 -0.6838316 -0.8128964 -0.9151580 -0.9261469 ## RushTD RushLng RushYPA RushYPG RushAtmsPG Targets ## 1 0.5647368 0.4677834 0.3576346 0.6127593 0.5545440 0.4111109 ## 2 -0.7260901 -0.6014358 -0.4598160 -0.7878334 -0.7129851 -0.5285711 ## Rec Yds RecYPR RecTD RecLng RecRPG ## 1 0.4327003 0.4018829 0.1471037 0.2699913 0.3608545 0.2849799 ## 2 -0.5563290 -0.5167066 -0.1891334 -0.3471317 -0.4639558 -0.3664027 ## RYPG Ctch. YardfmScrim TotalTD Fumbles ## 1 0.2982008 0.2743948 0.7005657 0.6033873 0.3587948 ## 2 -0.3834010 -0.3527934 -0.9007273 -0.7757836 -0.4613076 ## ## Clustering vector: ## JohnsonDa FreemanDe WestTe McCoyLe StewartJo ## 1 1 2 1 2 ## HowardJo HillJe JohnsonDu ElliotEz AndersonCJ ## 1 1 2 1 2 ## RiddickTh LacyEd MillerLa GoreFr IvoryCh ## 2 2 1 1 2 ## CharlesJa GurleyTo AjayiJa PetersonAd IngramMa ## 2 1 1 2 1 ## LeGarretteBl JenningsRa ForteMa MurrayLa MathewsRy ## 1 2 1 1 2 ## BellLe MelvinGo RawlsTh HydeCa MartinDo ## 1 1 2 1 2 ## MurrayDe KelleyRo ## 1 2 ## ## Within cluster sum of squares by cluster: ## [1] 276.5100 228.0134 ## (between_SS / total_SS = 29.2 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" "iter" ## [9] "ifault"
# See cluster nfl.rb.cluster$cluster
## JohnsonDa FreemanDe WestTe McCoyLe StewartJo ## 1 1 2 1 2 ## HowardJo HillJe JohnsonDu ElliotEz AndersonCJ ## 1 1 2 1 2 ## RiddickTh LacyEd MillerLa GoreFr IvoryCh ## 2 2 1 1 2 ## CharlesJa GurleyTo AjayiJa PetersonAd IngramMa ## 2 1 1 2 1 ## LeGarretteBl JenningsRa ForteMa MurrayLa MathewsRy ## 1 2 1 1 2 ## BellLe MelvinGo RawlsTh HydeCa MartinDo ## 1 1 2 1 2 ## MurrayDe KelleyRo ## 1 2
Create a data frame with the cluster as a column
# Create data frame joining cluster data to NFL data df <- data.frame(nfl,as.factor(nfl.rb.cluster$cluster)) # Rename last column "Cluster" names(df)[ncol(df)] <- "Cluster" head(df)
## Player Year Age Team Pos Number Games Starts RushAtms RushYds ## JohnsonDa JohnsonDa 2016 25 ARI RB 31 16 16 293 1239 ## FreemanDe FreemanDe 2016 24 ATL RB 24 16 16 227 1079 ## WestTe WestTe 2016 25 BAL RB 28 16 13 193 774 ## McCoyLe McCoyLe 2016 28 BUF RB 25 15 15 234 1267 ## StewartJo StewartJo 2016 29 CAR RB 28 13 13 218 824 ## HowardJo HowardJo 2016 22 CHI RB 24 15 13 252 1313 ## RushTD RushLng RushYPA RushYPG RushAtmsPG Targets Rec Yds RecYPR ## JohnsonDa 16 58 4.2 77.4 18.3 120 80 879 11.0 ## FreemanDe 11 75 4.8 67.4 14.2 65 54 462 8.6 ## WestTe 5 41 4.0 48.4 12.1 45 34 236 6.9 ## McCoyLe 13 75 5.4 84.5 15.6 57 50 356 7.1 ## StewartJo 9 47 3.8 63.4 16.8 21 8 60 7.5 ## HowardJo 6 69 5.2 87.5 16.8 50 29 298 10.3 ## RecTD RecLng RecRPG RYPG Ctch. YardfmScrim TotalTD Fumbles ## JohnsonDa 4 58 5.0 54.9 66.7 2118 20 5 ## FreemanDe 2 35 3.4 28.9 83.1 1541 13 1 ## WestTe 1 17 2.1 14.8 75.6 1010 6 2 ## McCoyLe 1 41 3.3 23.7 87.7 1623 14 3 ## StewartJo 0 25 0.6 4.6 38.1 884 9 3 ## HowardJo 1 34 1.9 19.9 58.0 1611 7 2 ## Cluster ## JohnsonDa 1 ## FreemanDe 1 ## WestTe 2 ## McCoyLe 1 ## StewartJo 2 ## HowardJo 1
# Load relevant libraries library(ggplot2) library(ggrepel) # ggplot of df with x: RushYds, y: RushTD with coloring by cluster ggplot(df, aes(RushYds, RushTD,colour = df$Cluster)) + geom_text_repel(aes(label=Player))+ theme(legend.position="none") + ggtitle("Visualization of Clusters for Rushing Touchdowns vs. Rushing Yards K = 2")
K = 3
nfl.rb.cluster <- kmeans(nfl.rb,centers = 3) # Create data frame joining cluster data to NFL data df <- data.frame(nfl,as.factor(nfl.rb.cluster$cluster)) # Rename last column "Cluster" names(df)[ncol(df)] <- "Cluster" # ggplot of df with x: RushYds, y: RushTD with coloring by cluster ggplot(df, aes(RushYds, RushTD,colour = df$Cluster)) + geom_text_repel(aes(label=Player))+ theme(legend.position="none") + ggtitle("Visualization of Clusters for Rushing Touchdowns vs. Rushing Yards K = 3")
K = 4
nfl.rb.cluster <- kmeans(nfl.rb,centers = 4) # Create data frame joining cluster data to NFL data df <- data.frame(nfl,as.factor(nfl.rb.cluster$cluster)) # Rename last column "Cluster" names(df)[ncol(df)] <- "Cluster" # ggplot of df with x: RushYds, y: RushTD with coloring by cluster ggplot(df, aes(RushYds, RushTD,colour = df$Cluster)) + geom_text_repel(aes(label=Player))+ theme(legend.position="none") + ggtitle("Visualization of Clusters for Rushing Touchdowns vs. Rushing Yards K = 4")
K = 5
nfl.rb.cluster <- kmeans(nfl.rb,centers = 5) df <- data.frame(nfl,as.factor(nfl.rb.cluster$cluster)) names(df)[ncol(df)] <- "Cluster" ggplot(df, aes(RushYds, RushTD,colour = df$Cluster)) + geom_text_repel(aes(label=Player))+ theme(legend.position="none") + ggtitle("Visualization of Clusters for Rushing Touchdowns vs. Rushing Yards K = 5")
By printing out nfl.rb.cluster$cluster
in any of the code above, you can see which cluster group each player is in. Moreover, in our plots, we chose to model rushing touchdowns vs. rushing yards. We will notice that some colors groups (clusters) are consistently near the top corner, while others are near the bottom corner. Moreover, by doing clustering we can get a sense of the different tiers and types of players.
Using the NFL Wide Receivers 2016.csv data set, complete the following:
Center and scale your data and subset your data frame to remove: Player,Year,Team,Pos.
Reconstruct the 4 graphs above using k = 2,3,4,5. However, this time put touchdowns (TD) on the y-axis and receptions (Rec) on the x-axis.
# Load WR data set nfl <- read.csv("NFL Wide Receivers 2016.csv") # Subset data to remove categorical variables nfl.wr <- subset(nfl, select = -c(Player,Year,Team,Pos)) # Name rows of data set rownames(nfl.wr) <- nfl$Player # Center and scale data nfl.wr <- scale(nfl.wr)
# K means with 2 clusters nfl.wr.cluster <- kmeans(nfl.wr,centers = 2) # Data frame of original data plus cluster assignments df <- data.frame(nfl,as.factor(nfl.wr.cluster$cluster)) # Rename last column names(df)[ncol(df)] <- "Cluster" # Plotting TD vs. Rec ggplot(df, aes(Rec, TD,colour = df$Cluster)) + geom_text_repel(aes(label=Player))+ theme(legend.position="none") + ggtitle("Visualization of Clusters for Receiving Touchdowns vs. Receptions K = 2")
K = 3
# K means with 3 clusters nfl.wr.cluster <- kmeans(nfl.wr,centers = 3) # Data frame of original data plus cluster assignments df <- data.frame(nfl,as.factor(nfl.wr.cluster$cluster)) # Rename last column names(df)[ncol(df)] <- "Cluster" # Plotting TD vs. Rec ggplot(df, aes(Rec, TD,colour = df$Cluster)) + geom_text_repel(aes(label=Player))+ theme(legend.position="none") + ggtitle("Visualization of Clusters for Receiving Touchdowns vs. Receptions K = 3")
K = 4
nfl.wr.cluster <- kmeans(nfl.wr,centers = 4) df <- data.frame(nfl,as.factor(nfl.wr.cluster$cluster)) names(df)[ncol(df)] <- "Cluster" ggplot(df, aes(Rec, TD,colour = df$Cluster)) + geom_text_repel(aes(label=Player))+ theme(legend.position="none") + ggtitle("Visualization of Clusters for Receiving Touchdowns vs. Receptions K = 4")
K = 5
nfl.wr.cluster <- kmeans(nfl.wr,centers = 5) df <- data.frame(nfl,as.factor(nfl.wr.cluster$cluster)) names(df)[ncol(df)] <- "Cluster" ggplot(df, aes(Rec, TD,colour = df$Cluster)) + geom_text_repel(aes(label=Player))+ theme(legend.position="none") + ggtitle("Visualization of Clusters for Receiving Touchdowns vs. Receptions K = 5")