Category Archives: Tutorial

Tutorial 4: Coin-toss for Linguists (Central Limit Theorem)

Here is a basic demonstration of how randomness works, but because I am writing this for linguists rather than statisticians, I’m modifying the standard coin-toss example for speech. Imagine you have a language with words that all start with either “t” or “d”. The word means the same thing regardless, so this is a “phonetic” rather than “phonemic” difference. Imagine also that each speaker uses “t” or “d” randomly about 50% of the time. Then record four speakers saying 20 of these words 10 times each.

Now ask the question: Will some words have more “t” productions than others?

The answer is ALWAYS yes, even when different speakers produce “t” and “d” sounds as completely random choices. Let me show you:

As with most of these examples I provide, I begin with code for libraries, colors, and functions.

library(tidyverse)
library(factoextra)
library(cluster)

RED0 = (rgb(213,13,11, 255, maxColorValue=255))
BLUE0 = (rgb(0,98,172,255, maxColorValue=255))
GOLD0 = (rgb(172,181,0,255, maxColorValue=255))

Then I provide code for functions.

randomDistribution <-function(maxCols,maxRep,replaceNumber,cat1,cat2)
{
distro = tibble(x=c(1:maxCols),y=list(rep(cat1, maxRep)))
for (i in sample(1:maxCols, replaceNumber, replace=TRUE))
{
distro$y[[i]] <- tail(append(distro$y[[i]],cat2), maxRep)
}
distroTibble = tibble(x = c(1:(maxCols * maxRep)), n = 1, y = "")
for (i in c(1:maxCols))
{
for (j in c(1:maxRep))
{
distroTibble$x[((i-1)maxRep)+j] = i
distroTibble$n[((i-1)maxRep)+j] = j
distroTibble$y[((i-1)*maxRep)+j] = distro$y[[i]][j]
}
}
return(distroTibble)
}

randomOrder <- function(distro) { distro %<>% mutate(y = case_when(line %in% sample(line)[1:100] ~ "d", TRUE ~ y)) %>%
ungroup() %>% group_by(x, y) %>% summarize(count = n()) %>%
mutate(perc = count/sum(count)) %>% ungroup() %>%
arrange(y, desc(perc)) %>% mutate(x = factor(x, levels=unique(x))) %>%
arrange(desc(perc))
return(distro)
}

And now for the data itself. I build four tables with 20 words (x values) and 10 recordings (n values) each, with the recordings labelled in the “y” value. I start by labeling all these “t”, and then randomly select half of the production and call them “d” instead of “t”. I then compute the percentage of each variant by word (x)

I also combine the four speakers, and do the same for all of them.

D1 <- randomDistribution(20,10,"t")
D2 <- randomDistribution(20,10,"t")
D3 <- randomDistribution(20,10,"t")
D4 <- randomDistribution(20,10,"t")
D5 <- bind_rows(D1,D2,D3,D4)

D1 = randomOrder(D1)
D2 = randomOrder(D2)
D3 = randomOrder(D3)
D4 = randomOrder(D4)
D5 = randomOrder(D5)

Now I plot a distribution graph for all of them. Note that some words are mostly one type of production (“d”), and others are mostly the other production (“t”). This inevitably occurs by random chance. And it differs by participant.

However, even when you pool all the participant data, you see the same result. This distribution is a part of the nature of how randomization works, and needs no other explanation other than this aspect of randomization is a part of the nature of reality.

D1 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 1")

D2 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 2")

D3 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 3")

D4 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 4")

D5 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("all groups")

And you can see that the combined data from all four speakers still shows some words that have almost no “d”, and some words have very few “t” values.

Because a purely random distribution will generate individual words with few or even none of a particular variant, even across speakers, you cannot use differences in the distributions by itself to identify any meaningful patterns.

And that is the “coin toss” tutorial for Linguists – also known as the central limit theorem. The main takeaway message is that you need minimal pairs, or at least minimal environments, to establish evidence that a distribution of two phonetic outputs could be phonemic.

Even then, the existence of a phonemic distinction doesn’t mean it predicts very many examples in speech.

Tutorial 3: K means clustering

One of the easiest and most appropriate methods for testing whether a data set contains multiple categories is k-means clustering. This technique can be supervised, in that you tell the computer how many clusters you think are in the original file. However, it is much wiser to test many k-means clusters using an unsupervised process. Here we show three of these. The The first one we will examine is the “elbow” method, runs several clusters, and produces a graph that visually lets you see what the ideal number of clusters is. You identify it by seeing the “bend” in the elbow. Here’s some code for generating a very distinct binary cluster and running the elbow test.

library(tidyverse)
library(factoextra)
library(cluster)
points = 10000
sd1 = 1
sd2 = 1
mu1 = 0
mu2 = 6
p=integrate(min.f1f2, -Inf, Inf, mu1=mu1, mu2=mu2, sd1=sd1, sd2=sd2)

G1 <- tibble(X = rnorm(points, mean = mu1, sd = sd1),
Y = rnorm(points, mean = 0, sd = sd1),
Name="Group 1", col = GOLD0A,Shape=1)

G2 <- tibble(X = rnorm(points, mean = mu2, sd = sd2),
Y = rnorm(points, mean = 0, sd = sd2),
Name="Group 2", col = BLUE0A,Shape=2)

G <- bind_rows(G1,G2) p2 = length(G$X[G$Name=="Group 1" & G$X> min(G$X[G$Name=="Group 2"])])/points

p2 = p2 + length(G$X[G$Name=="Group 2" & G$X< max(G$X[G$Name=="Group 1"])])/points
p2 = p2/2
fviz_nbclust(G[, 1:2], kmeans, method = "wss")

The second technique will tell you the answer, identifying a peak “silhouette width” with a handy dashed line.

fviz_nbclust(G[, 1:2], kmeans, method = "silhouette")

The third shows a “gap” statistic, with the highest peak identified.

gap_stat <- clusGap(G[, 1:2], FUN = kmeans, nstart = 25, K.max = 10, B = 50) fviz_gap_stat(gap_stat)

As you can see, all three cluster identification techniques show that the ideal number of clusters is 2. Which makes sense because that is the number we initially generated.

Here I show you what the difference between the real cluster and the estimate cluster looks like, beginning with the real cluster.

G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(colour = Name), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab(paste("Overlap percent = ",percent(as.numeric(p[1])), " : Overlap range = ", percent(p2),sep="")) + ylab("") + coord_equal(ratio=1)

Followed by the k-means cluster.

set.seed(20)
binaryCluster <- kmeans(G[, 1:2], 2, nstart = 10, algorithm="Lloyd") binaryCluster$cluster <- as.factor(binaryCluster$cluster) binaryCluster$color[binaryCluster$cluster == 1] = GOLD0A binaryCluster$color[binaryCluster$cluster == 2] = BLUE0A G$col2 = binaryCluster$color G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(color = col2), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab("Unsupervised binary separation") + ylab("") + coord_equal(ratio=1)

Notice that the unsupervised clusering will mis-categorize some items in the cluster, but gets most of them correct.

Here we generate a binary separated by 4 standard deviations.

points = 10000
sd1 = 1
sd2 = 1
mu1 = 0
mu2 = 4
p=integrate(min.f1f2, -Inf, Inf, mu1=mu1, mu2=mu2, sd1=sd1, sd2=sd2)
G1 <- tibble(X = rnorm(points, mean = mu1, sd = sd1), Y = rnorm(points, mean = 0, sd = sd1), Name="Group 1", col = GOLD0A,Shape=1)

G2 <- tibble(X = rnorm(points, mean = mu2, sd = sd2), Y = rnorm(points, mean = 0, sd = sd2), Name="Group 2", col = BLUE0A,Shape=2)

G <- bind_rows(G1,G2) p2 = length(G$X[G$Name=="Group 1" & G$X> min(G$X[G$Name=="Group 2"])])/points
p2 = p2 + length(G$X[G$Name=="Group 2" & G$X< max(G$X[G$Name=="Group 1"])])/points
p2 = p2/2

Notice that even with 4 standard deviations separating the groups, the elbow technique still clearly diagnoses 2 clusters – a binary system.

fviz_nbclust(G[, 1:2], kmeans, method = "wss")

fviz_nbclust(G[, 1:2], kmeans, method = "silhouette")

gap_stat <- clusGap(G[, 1:2], FUN = kmeans, nstart = 10, K.max = 10, B = 50) fviz_gap_stat(gap_stat)

And here is the underlying cluster with overlapped entries.

G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(colour = Name), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab(paste("Overlap percent = ",percent(as.numeric(p[1])),
" : Overlap range = ",percent(p2),sep="")) + ylab("") + coord_equal(ratio=1)

Notice that the cluster analysis misidentifies many entries – about 5% of them.

set.seed(20)
binaryCluster <- kmeans(G[, 1:2], 2, nstart = 10, algorithm="Lloyd") binaryCluster$cluster <- as.factor(binaryCluster$cluster) binaryCluster$color[binaryCluster$cluster == 1] = GOLD0A binaryCluster$color[binaryCluster$cluster == 2] = BLUE0A G$col2 = binaryCluster$color G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(color = col2), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab("Unsupervised binary separation") + ylab("") + coord_equal(ratio=1)

Lastly, here is a binary that is only separated by 2 standard deviations. A barely noticeable binary.

points = 10000
sd1 = 1
sd2 = 1
mu1 = 0
mu2 = 2
p=integrate(min.f1f2, -Inf, Inf, mu1=mu1, mu2=mu2, sd1=sd1, sd2=sd2)
G1 <- tibble(X = rnorm(points, mean = mu1, sd = sd1), Y = rnorm(points, mean = 0, sd = sd1), Name="Group 1", col = GOLD0A,Shape=1)

G2 <- tibble(X = rnorm(points, mean = mu2, sd = sd2), Y = rnorm(points, mean = 0, sd = sd2), Name="Group 2", col = BLUE0A,Shape=2)

G <- bind_rows(G1,G2) p2 = length(G$X[G$Name=="Group 1" & G$X> min(G$X[G$Name=="Group 2"])])/points
p2 = p2 + length(G$X[G$Name=="Group 2" & G$X< max(G$X[G$Name=="Group 1"])])/points
p2 = p2/2

Notice that even with 2 standard deviations separating the groups, the elbow technique DOES diagnose that this is a binary system, but barely. The silhouette and gap techniques also point to a binary.

library(factoextra)
fviz_nbclust(G[, 1:2], kmeans, method = "wss")

fviz_nbclust(G[, 1:2], kmeans, method = "silhouette")

gap_stat <- clusGap(G[, 1:2], FUN = kmeans, nstart = 10, K.max = 10, B = 50) fviz_gap_stat(gap_stat)

Here you can see the underlying binary division.

G %>% ggplot(aes(x = X, y = Y)) + geom_point(aes(colour = Name), show.legend = TRUE) + scale_color_manual(values=c(GOLD0A,BLUE0A)) + xlab(paste("Overlap percent = ",percent(as.numeric(p[1]))," : Overlap range = ",percent(p2),sep="")) + ylab("") + coord_equal(ratio=1)

And as you would expect, oh boy does the k-means clustering make mistakes.

set.seed(20)
binaryCluster <- kmeans(G[, 1:2], 2, nstart = 10, algorithm="Lloyd") binaryCluster$cluster <- as.factor(binaryCluster$cluster) binaryCluster$color[binaryCluster$cluster == 1] = GOLD0A binaryCluster$color[binaryCluster$cluster == 2] = BLUE0A G$col2 = binaryCluster$color G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(color = col2), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab("Unsupervised binary separation") +
ylab("") + coord_equal(ratio=1)

However, K-means clustering can still uncover the binary.

References:

Weitzman, M. S. (1970). Measures of overlap of income distributions of white and Negro families in the United States. Washington: U.S. Bureau of the Census.

https://afit-r.github.io/kmeans_clustering

https://rpubs.com/williamsurles/310847

Tutorial 2: Overlapping binaries.

Having previously demonstrated what two binary groupings look like when they are separated by six standard deviations, here I demonstrate what they look like when separated by 4 standard deviations. Such a binary has an overlapping coefficient of 4.55%, as seen from the code below, which computes from integration based on Weitzman’s overlapping distribution.

## 0.04550026 with absolute error < 3.8e-05
## [1] "4.55%"

This is what such data looks like graphed in a density curve.

The overlap range is now much larger, as can be seen in the scatterplot below.

Now let’s look at an overlap range of 2 standard deviations.

## 0.3173105 with absolute error < 4.7e-05
## [1] "31.73%"

The density plot now overlaps a lot.

And this is what the scatterplot looks like.

Now look at the scatterplot without color differences. At this point there is the barest of hints that there might be a binary in this system at all.

Let us compare that to the initial binary, separated by 6 standard deviations, now in grey.

This image has an empty alt attribute; its file name is image-22-1024x731.png

With this data, the binary remains visible and obvious even when both samples are gray.

However, even if you cannot observe categories by directly looking, there are tools that can help identify N-nary categories in what looks to us like gradient data – the tools of unsupervised cluster analysis, which I will discuss in the next tutorial.

The RMarkdown file used to generate this post can be found here. Some of the code was modified from code on this site.

References:

Weitzman, M. S. (1970). Measures of overlap of income distributions of white and Negro families in the United States. Washington: U.S. Bureau of the Census.

Tutorial 1: Gradient effects within binary systems

This post provides a visual example of gradient behaviour within a univariate binary system.

Here I demonstrate what two binary groupings look like when each binary is separated on a non-dimensional scale of 1 standard deviation for each binary, with a separation of 6 standard deviations. Such a binary has an overlapping coefficient of 0.27%, as seen from the code below, which was computed from integration based on Weitzman’s overlapping distribution.

## [1] "0.27%"

But the overlapping range hides the fact that in a group of, say, 10,000 for each binary, the outlier overlap is often enormous, and sometimes individual tokens look like they belong firmly in the other binary choice – like the one blue dot in the gold cloud. (Note that the y-axis is added to make the display easier to understand, but provides none of the data used in this analysis.)

In short, in a binary systems, individual tokens that exist thoroughly within the other binary range will exist due to simple random variation, yet they do not present evidence of constant gradient overlap or against the existence of the binary. Such things occur as long as the two binaries are close enough in relation to the number of examples – close enough being determined by simple probability, even in a univariate system (one without outside influences.)

The RMarkdown file used to generate this post can be found here. Some of the code was modified from code on this site.

References:

Weitzman, M. S. (1970). Measures of overlap of income distributions of white and Negro families in the United States. Washington: U.S. Bureau of the Census.