Tutorial 4: Coin-toss for Linguists (Central Limit Theorem)

Here is a basic demonstration of how randomness works, but because I am writing this for linguists rather than statisticians, I’m modifying the standard coin-toss example for speech. Imagine you have a language with words that all start with either “t” or “d”. The word means the same thing regardless, so this is a “phonetic” rather than “phonemic” difference. Imagine also that each speaker uses “t” or “d” randomly about 50% of the time. Then record four speakers saying 20 of these words 10 times each.

Now ask the question: Will some words have more “t” productions than others?

The answer is ALWAYS yes, even when different speakers produce “t” and “d” sounds as completely random choices. Let me show you:

As with most of these examples I provide, I begin with code for libraries, colors, and functions.

library(tidyverse)
library(factoextra)
library(cluster)

RED0 = (rgb(213,13,11, 255, maxColorValue=255))
BLUE0 = (rgb(0,98,172,255, maxColorValue=255))
GOLD0 = (rgb(172,181,0,255, maxColorValue=255))

Then I provide code for functions.

randomDistribution <-function(maxCols,maxRep,replaceNumber,cat1,cat2)
{
distro = tibble(x=c(1:maxCols),y=list(rep(cat1, maxRep)))
for (i in sample(1:maxCols, replaceNumber, replace=TRUE))
{
distro$y[[i]] <- tail(append(distro$y[[i]],cat2), maxRep)
}
distroTibble = tibble(x = c(1:(maxCols * maxRep)), n = 1, y = "")
for (i in c(1:maxCols))
{
for (j in c(1:maxRep))
{
distroTibble$x[((i-1)maxRep)+j] = i
distroTibble$n[((i-1)maxRep)+j] = j
distroTibble$y[((i-1)*maxRep)+j] = distro$y[[i]][j]
}
}
return(distroTibble)
}

randomOrder <- function(distro) { distro %<>% mutate(y = case_when(line %in% sample(line)[1:100] ~ "d", TRUE ~ y)) %>%
ungroup() %>% group_by(x, y) %>% summarize(count = n()) %>%
mutate(perc = count/sum(count)) %>% ungroup() %>%
arrange(y, desc(perc)) %>% mutate(x = factor(x, levels=unique(x))) %>%
arrange(desc(perc))
return(distro)
}

And now for the data itself. I build four tables with 20 words (x values) and 10 recordings (n values) each, with the recordings labelled in the “y” value. I start by labeling all these “t”, and then randomly select half of the production and call them “d” instead of “t”. I then compute the percentage of each variant by word (x)

I also combine the four speakers, and do the same for all of them.

D1 <- randomDistribution(20,10,"t")
D2 <- randomDistribution(20,10,"t")
D3 <- randomDistribution(20,10,"t")
D4 <- randomDistribution(20,10,"t")
D5 <- bind_rows(D1,D2,D3,D4)

D1 = randomOrder(D1)
D2 = randomOrder(D2)
D3 = randomOrder(D3)
D4 = randomOrder(D4)
D5 = randomOrder(D5)

Now I plot a distribution graph for all of them. Note that some words are mostly one type of production (“d”), and others are mostly the other production (“t”). This inevitably occurs by random chance. And it differs by participant.

However, even when you pool all the participant data, you see the same result. This distribution is a part of the nature of how randomization works, and needs no other explanation other than this aspect of randomization is a part of the nature of reality.

D1 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 1")

D2 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 2")

D3 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 3")

D4 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 4")

D5 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("all groups")

And you can see that the combined data from all four speakers still shows some words that have almost no “d”, and some words have very few “t” values.

Because a purely random distribution will generate individual words with few or even none of a particular variant, even across speakers, you cannot use differences in the distributions by itself to identify any meaningful patterns.

And that is the “coin toss” tutorial for Linguists – also known as the central limit theorem. The main takeaway message is that you need minimal pairs, or at least minimal environments, to establish evidence that a distribution of two phonetic outputs could be phonemic.

Even then, the existence of a phonemic distinction doesn’t mean it predicts very many examples in speech.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.