One of the easiest and most appropriate methods for testing whether a data set contains multiple categories is k-means clustering. This technique can be supervised, in that you tell the computer how many clusters you think are in the original file. However, it is much wiser to test many k-means clusters using an unsupervised process. Here we show three of these. The The first one we will examine is the “elbow” method, runs several clusters, and produces a graph that visually lets you see what the ideal number of clusters is. You identify it by seeing the “bend” in the elbow. Here’s some code for generating a very distinct binary cluster and running the elbow test.
library(tidyverse)
library(factoextra)
library(cluster)
points = 10000
sd1 = 1
sd2 = 1
mu1 = 0
mu2 = 6
p=integrate(min.f1f2, -Inf, Inf, mu1=mu1, mu2=mu2, sd1=sd1, sd2=sd2)
G1 <- tibble(X = rnorm(points, mean = mu1, sd = sd1),
Y = rnorm(points, mean = 0, sd = sd1),
Name="Group 1", col = GOLD0A,Shape=1)
G2 <- tibble(X = rnorm(points, mean = mu2, sd = sd2),
Y = rnorm(points, mean = 0, sd = sd2),
Name="Group 2", col = BLUE0A,Shape=2)
G <- bind_rows(G1,G2) p2 = length(G$X[G$Name=="Group 1" & G$X> min(G$X[G$Name=="Group 2"])])/points
p2 = p2 + length(G$X[G$Name=="Group 2" & G$X< max(G$X[G$Name=="Group 1"])])/points
p2 = p2/2
fviz_nbclust(G[, 1:2], kmeans, method = "wss")
The second technique will tell you the answer, identifying a peak “silhouette width” with a handy dashed line.
fviz_nbclust(G[, 1:2], kmeans, method = "silhouette")
The third shows a “gap” statistic, with the highest peak identified.
gap_stat <- clusGap(G[, 1:2], FUN = kmeans, nstart = 25, K.max = 10, B = 50) fviz_gap_stat(gap_stat)
As you can see, all three cluster identification techniques show that the ideal number of clusters is 2. Which makes sense because that is the number we initially generated.
Here I show you what the difference between the real cluster and the estimate cluster looks like, beginning with the real cluster.
G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(colour = Name), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab(paste("Overlap percent = ",percent(as.numeric(p[1])), " : Overlap range = ", percent(p2),sep="")) + ylab("") + coord_equal(ratio=1)
Followed by the k-means cluster.
set.seed(20)
binaryCluster <- kmeans(G[, 1:2], 2, nstart = 10, algorithm="Lloyd") binaryCluster$cluster <- as.factor(binaryCluster$cluster) binaryCluster$color[binaryCluster$cluster == 1] = GOLD0A binaryCluster$color[binaryCluster$cluster == 2] = BLUE0A G$col2 = binaryCluster$color G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(color = col2), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab("Unsupervised binary separation") + ylab("") + coord_equal(ratio=1)
Notice that the unsupervised clusering will mis-categorize some items in the cluster, but gets most of them correct.
Here we generate a binary separated by 4 standard deviations.
points = 10000
sd1 = 1
sd2 = 1
mu1 = 0
mu2 = 4
p=integrate(min.f1f2, -Inf, Inf, mu1=mu1, mu2=mu2, sd1=sd1, sd2=sd2)
G1 <- tibble(X = rnorm(points, mean = mu1, sd = sd1), Y = rnorm(points, mean = 0, sd = sd1), Name="Group 1", col = GOLD0A,Shape=1)
G2 <- tibble(X = rnorm(points, mean = mu2, sd = sd2), Y = rnorm(points, mean = 0, sd = sd2), Name="Group 2", col = BLUE0A,Shape=2)
G <- bind_rows(G1,G2) p2 = length(G$X[G$Name=="Group 1" & G$X> min(G$X[G$Name=="Group 2"])])/points
p2 = p2 + length(G$X[G$Name=="Group 2" & G$X< max(G$X[G$Name=="Group 1"])])/points
p2 = p2/2
Notice that even with 4 standard deviations separating the groups, the elbow technique still clearly diagnoses 2 clusters – a binary system.
fviz_nbclust(G[, 1:2], kmeans, method = "wss")
fviz_nbclust(G[, 1:2], kmeans, method = "silhouette")
gap_stat <- clusGap(G[, 1:2], FUN = kmeans, nstart = 10, K.max = 10, B = 50) fviz_gap_stat(gap_stat)
And here is the underlying cluster with overlapped entries.
G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(colour = Name), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab(paste("Overlap percent = ",percent(as.numeric(p[1])),
" : Overlap range = ",percent(p2),sep="")) + ylab("") + coord_equal(ratio=1)
Notice that the cluster analysis misidentifies many entries – about 5% of them.
set.seed(20)
binaryCluster <- kmeans(G[, 1:2], 2, nstart = 10, algorithm="Lloyd") binaryCluster$cluster <- as.factor(binaryCluster$cluster) binaryCluster$color[binaryCluster$cluster == 1] = GOLD0A binaryCluster$color[binaryCluster$cluster == 2] = BLUE0A G$col2 = binaryCluster$color G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(color = col2), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab("Unsupervised binary separation") + ylab("") + coord_equal(ratio=1)
Lastly, here is a binary that is only separated by 2 standard deviations. A barely noticeable binary.
points = 10000
sd1 = 1
sd2 = 1
mu1 = 0
mu2 = 2
p=integrate(min.f1f2, -Inf, Inf, mu1=mu1, mu2=mu2, sd1=sd1, sd2=sd2)
G1 <- tibble(X = rnorm(points, mean = mu1, sd = sd1), Y = rnorm(points, mean = 0, sd = sd1), Name="Group 1", col = GOLD0A,Shape=1)
G2 <- tibble(X = rnorm(points, mean = mu2, sd = sd2), Y = rnorm(points, mean = 0, sd = sd2), Name="Group 2", col = BLUE0A,Shape=2)
G <- bind_rows(G1,G2) p2 = length(G$X[G$Name=="Group 1" & G$X> min(G$X[G$Name=="Group 2"])])/points
p2 = p2 + length(G$X[G$Name=="Group 2" & G$X< max(G$X[G$Name=="Group 1"])])/points
p2 = p2/2
Notice that even with 2 standard deviations separating the groups, the elbow technique DOES diagnose that this is a binary system, but barely. The silhouette and gap techniques also point to a binary.
library(factoextra)
fviz_nbclust(G[, 1:2], kmeans, method = "wss")
fviz_nbclust(G[, 1:2], kmeans, method = "silhouette")
gap_stat <- clusGap(G[, 1:2], FUN = kmeans, nstart = 10, K.max = 10, B = 50) fviz_gap_stat(gap_stat)
Here you can see the underlying binary division.
G %>% ggplot(aes(x = X, y = Y)) + geom_point(aes(colour = Name), show.legend = TRUE) + scale_color_manual(values=c(GOLD0A,BLUE0A)) + xlab(paste("Overlap percent = ",percent(as.numeric(p[1]))," : Overlap range = ",percent(p2),sep="")) + ylab("") + coord_equal(ratio=1)
And as you would expect, oh boy does the k-means clustering make mistakes.
set.seed(20)
binaryCluster <- kmeans(G[, 1:2], 2, nstart = 10, algorithm="Lloyd") binaryCluster$cluster <- as.factor(binaryCluster$cluster) binaryCluster$color[binaryCluster$cluster == 1] = GOLD0A binaryCluster$color[binaryCluster$cluster == 2] = BLUE0A G$col2 = binaryCluster$color G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(color = col2), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab("Unsupervised binary separation") +
ylab("") + coord_equal(ratio=1)
However, K-means clustering can still uncover the binary.
References:
Weitzman, M. S. (1970). Measures of overlap of income distributions of white and Negro families in the United States. Washington: U.S. Bureau of the Census.