Monthly Archives: July 2019

TreeForm updated to 1.1

After 11 years, I have finally written a new update to TreeForm to address the issues people have been having with incompatibility with newer operating systems. This version of TreeForm is also a LOT easier to run and install. On Windows and Linux machines with Java Installed, you can just click and run the JAR file directly – all the menus and help screens are incorporated directly, and the file runs anywhere. This helps a LOT with university computers where you often are not allowed to install software.

On Macs, I have provided a package that will install the software. It still requires permission through the “security and privacy” tab of “system preferences”, but after that it will run, even on the Mojave operating system. (If anyone knows how to get Apple to let this app install without activating their “gatekeeper” program, please let me know. I really do not enjoy Apple’s war on open-source programs.)

This version also fixes the color chooser bugs that Oracle inadvertently introduced, allowing color choices to be available again. I have also updated the help, about, and what’s new screens. Lastly, I disabled the custom look and feel – TreeForm now obtains the operating system’s look and feel, and so is slightly different on each system.

You can always find TreeForm at SourceForge, and I will soon put the new source on gitHub.

Building a cleaned dataset of aligned ultrasound, articulometry, and audio.

In 2013, I recorded 11 North American English speakers, each reading eight phrases with two flaps in two syllables (e.g “We have editor books”), and at 5 speech rates, from about 3 syllables/second to 7 syllables/second. Each recording included audio, ultrasound imaging of the tongue, and articulometry.

The dataset has taken a truly inordinate amount of time to label, transcribe (thank you Romain Fiasson), rotate, align ultrasound to audio, fit in shared time (what is known as a Procrustean fit), extract acoustic correlates, and clean from tokens that have recording or unfixable alignment errors.

It is, however, now 2019 and I have a cleaned dataset. I’ve uploaded the dataset, with data at each point of processing included, to an Open Science Framework website: I will, over the next few weeks, upload documentation on how I processed the data, as well as videos of the cleaned data showing ultrasound and EMA motion.

By September 1st, I plan on submitting a research article discussing the techniques used to build the dataset, as well as theoretically motivated subset of the articulatory to acoustic correlates within this dataset to a special issue of a journal whose name I will disclose should they accept the article for publication.

This research was funded by a Marsden Grant from New Zealand, “Saving energy vs. making yourself understood during speech production”. Thanks to Mark Tiede for writing the quaternion rotation tools needed to oriented EMA traces, and to Christian Kroos for teaching our group at Western Sydney Universiy how to implement them. Thanks to Michael Proctor for building filtering and sample repair tools for EMA traces. Thanks also to Wei-rong Chen for writing the palate estimation tool needed to replace erroneous palate traces. Special thanks to Scott Lloyd for his part in developing and building the ultrasound transducer holder prototype used in this research. Dedicated to the memory of Roman Fiasson, who completed most of the labelling and transcription for this project.

Tutorial 4: Coin-toss for Linguists (Central Limit Theorem)

Here is a basic demonstration of how randomness works, but because I am writing this for linguists rather than statisticians, I’m modifying the standard coin-toss example for speech. Imagine you have a language with words that all start with either “t” or “d”. The word means the same thing regardless, so this is a “phonetic” rather than “phonemic” difference. Imagine also that each speaker uses “t” or “d” randomly about 50% of the time. Then record four speakers saying 20 of these words 10 times each.

Now ask the question: Will some words have more “t” productions than others?

The answer is ALWAYS yes, even when different speakers produce “t” and “d” sounds as completely random choices. Let me show you:

As with most of these examples I provide, I begin with code for libraries, colors, and functions.

library(tidyverse)
library(factoextra)
library(cluster)

RED0 = (rgb(213,13,11, 255, maxColorValue=255))
BLUE0 = (rgb(0,98,172,255, maxColorValue=255))
GOLD0 = (rgb(172,181,0,255, maxColorValue=255))

Then I provide code for functions.

randomDistribution <-function(maxCols,maxRep,replaceNumber,cat1,cat2)
{
distro = tibble(x=c(1:maxCols),y=list(rep(cat1, maxRep)))
for (i in sample(1:maxCols, replaceNumber, replace=TRUE))
{
distro$y[[i]] <- tail(append(distro$y[[i]],cat2), maxRep)
}
distroTibble = tibble(x = c(1:(maxCols * maxRep)), n = 1, y = "")
for (i in c(1:maxCols))
{
for (j in c(1:maxRep))
{
distroTibble$x[((i-1)maxRep)+j] = i
distroTibble$n[((i-1)maxRep)+j] = j
distroTibble$y[((i-1)*maxRep)+j] = distro$y[[i]][j]
}
}
return(distroTibble)
}

randomOrder <- function(distro) { distro %<>% mutate(y = case_when(line %in% sample(line)[1:100] ~ "d", TRUE ~ y)) %>%
ungroup() %>% group_by(x, y) %>% summarize(count = n()) %>%
mutate(perc = count/sum(count)) %>% ungroup() %>%
arrange(y, desc(perc)) %>% mutate(x = factor(x, levels=unique(x))) %>%
arrange(desc(perc))
return(distro)
}

And now for the data itself. I build four tables with 20 words (x values) and 10 recordings (n values) each, with the recordings labelled in the “y” value. I start by labeling all these “t”, and then randomly select half of the production and call them “d” instead of “t”. I then compute the percentage of each variant by word (x)

I also combine the four speakers, and do the same for all of them.

D1 <- randomDistribution(20,10,"t")
D2 <- randomDistribution(20,10,"t")
D3 <- randomDistribution(20,10,"t")
D4 <- randomDistribution(20,10,"t")
D5 <- bind_rows(D1,D2,D3,D4)

D1 = randomOrder(D1)
D2 = randomOrder(D2)
D3 = randomOrder(D3)
D4 = randomOrder(D4)
D5 = randomOrder(D5)

Now I plot a distribution graph for all of them. Note that some words are mostly one type of production (“d”), and others are mostly the other production (“t”). This inevitably occurs by random chance. And it differs by participant.

However, even when you pool all the participant data, you see the same result. This distribution is a part of the nature of how randomization works, and needs no other explanation other than this aspect of randomization is a part of the nature of reality.

D1 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 1")

D2 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 2")

D3 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 3")

D4 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("group 4")

D5 %>% ggplot(aes(x=x, fill=y, y=perc)) + geom_bar(stat="identity") + scale_y_continuous(labels=scales::percent) + ggtitle("all groups")

And you can see that the combined data from all four speakers still shows some words that have almost no “d”, and some words have very few “t” values.

Because a purely random distribution will generate individual words with few or even none of a particular variant, even across speakers, you cannot use differences in the distributions by itself to identify any meaningful patterns.

And that is the “coin toss” tutorial for Linguists – also known as the central limit theorem. The main takeaway message is that you need minimal pairs, or at least minimal environments, to establish evidence that a distribution of two phonetic outputs could be phonemic.

Even then, the existence of a phonemic distinction doesn’t mean it predicts very many examples in speech.

Tutorial 3: K means clustering

One of the easiest and most appropriate methods for testing whether a data set contains multiple categories is k-means clustering. This technique can be supervised, in that you tell the computer how many clusters you think are in the original file. However, it is much wiser to test many k-means clusters using an unsupervised process. Here we show three of these. The The first one we will examine is the “elbow” method, runs several clusters, and produces a graph that visually lets you see what the ideal number of clusters is. You identify it by seeing the “bend” in the elbow. Here’s some code for generating a very distinct binary cluster and running the elbow test.

library(tidyverse)
library(factoextra)
library(cluster)
points = 10000
sd1 = 1
sd2 = 1
mu1 = 0
mu2 = 6
p=integrate(min.f1f2, -Inf, Inf, mu1=mu1, mu2=mu2, sd1=sd1, sd2=sd2)

G1 <- tibble(X = rnorm(points, mean = mu1, sd = sd1),
Y = rnorm(points, mean = 0, sd = sd1),
Name="Group 1", col = GOLD0A,Shape=1)

G2 <- tibble(X = rnorm(points, mean = mu2, sd = sd2),
Y = rnorm(points, mean = 0, sd = sd2),
Name="Group 2", col = BLUE0A,Shape=2)

G <- bind_rows(G1,G2) p2 = length(G$X[G$Name=="Group 1" & G$X> min(G$X[G$Name=="Group 2"])])/points

p2 = p2 + length(G$X[G$Name=="Group 2" & G$X< max(G$X[G$Name=="Group 1"])])/points
p2 = p2/2
fviz_nbclust(G[, 1:2], kmeans, method = "wss")

The second technique will tell you the answer, identifying a peak “silhouette width” with a handy dashed line.

fviz_nbclust(G[, 1:2], kmeans, method = "silhouette")

The third shows a “gap” statistic, with the highest peak identified.

gap_stat <- clusGap(G[, 1:2], FUN = kmeans, nstart = 25, K.max = 10, B = 50) fviz_gap_stat(gap_stat)

As you can see, all three cluster identification techniques show that the ideal number of clusters is 2. Which makes sense because that is the number we initially generated.

Here I show you what the difference between the real cluster and the estimate cluster looks like, beginning with the real cluster.

G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(colour = Name), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab(paste("Overlap percent = ",percent(as.numeric(p[1])), " : Overlap range = ", percent(p2),sep="")) + ylab("") + coord_equal(ratio=1)

Followed by the k-means cluster.

set.seed(20)
binaryCluster <- kmeans(G[, 1:2], 2, nstart = 10, algorithm="Lloyd") binaryCluster$cluster <- as.factor(binaryCluster$cluster) binaryCluster$color[binaryCluster$cluster == 1] = GOLD0A binaryCluster$color[binaryCluster$cluster == 2] = BLUE0A G$col2 = binaryCluster$color G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(color = col2), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab("Unsupervised binary separation") + ylab("") + coord_equal(ratio=1)

Notice that the unsupervised clusering will mis-categorize some items in the cluster, but gets most of them correct.

Here we generate a binary separated by 4 standard deviations.

points = 10000
sd1 = 1
sd2 = 1
mu1 = 0
mu2 = 4
p=integrate(min.f1f2, -Inf, Inf, mu1=mu1, mu2=mu2, sd1=sd1, sd2=sd2)
G1 <- tibble(X = rnorm(points, mean = mu1, sd = sd1), Y = rnorm(points, mean = 0, sd = sd1), Name="Group 1", col = GOLD0A,Shape=1)

G2 <- tibble(X = rnorm(points, mean = mu2, sd = sd2), Y = rnorm(points, mean = 0, sd = sd2), Name="Group 2", col = BLUE0A,Shape=2)

G <- bind_rows(G1,G2) p2 = length(G$X[G$Name=="Group 1" & G$X> min(G$X[G$Name=="Group 2"])])/points
p2 = p2 + length(G$X[G$Name=="Group 2" & G$X< max(G$X[G$Name=="Group 1"])])/points
p2 = p2/2

Notice that even with 4 standard deviations separating the groups, the elbow technique still clearly diagnoses 2 clusters – a binary system.

fviz_nbclust(G[, 1:2], kmeans, method = "wss")

fviz_nbclust(G[, 1:2], kmeans, method = "silhouette")

gap_stat <- clusGap(G[, 1:2], FUN = kmeans, nstart = 10, K.max = 10, B = 50) fviz_gap_stat(gap_stat)

And here is the underlying cluster with overlapped entries.

G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(colour = Name), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab(paste("Overlap percent = ",percent(as.numeric(p[1])),
" : Overlap range = ",percent(p2),sep="")) + ylab("") + coord_equal(ratio=1)

Notice that the cluster analysis misidentifies many entries – about 5% of them.

set.seed(20)
binaryCluster <- kmeans(G[, 1:2], 2, nstart = 10, algorithm="Lloyd") binaryCluster$cluster <- as.factor(binaryCluster$cluster) binaryCluster$color[binaryCluster$cluster == 1] = GOLD0A binaryCluster$color[binaryCluster$cluster == 2] = BLUE0A G$col2 = binaryCluster$color G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(color = col2), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab("Unsupervised binary separation") + ylab("") + coord_equal(ratio=1)

Lastly, here is a binary that is only separated by 2 standard deviations. A barely noticeable binary.

points = 10000
sd1 = 1
sd2 = 1
mu1 = 0
mu2 = 2
p=integrate(min.f1f2, -Inf, Inf, mu1=mu1, mu2=mu2, sd1=sd1, sd2=sd2)
G1 <- tibble(X = rnorm(points, mean = mu1, sd = sd1), Y = rnorm(points, mean = 0, sd = sd1), Name="Group 1", col = GOLD0A,Shape=1)

G2 <- tibble(X = rnorm(points, mean = mu2, sd = sd2), Y = rnorm(points, mean = 0, sd = sd2), Name="Group 2", col = BLUE0A,Shape=2)

G <- bind_rows(G1,G2) p2 = length(G$X[G$Name=="Group 1" & G$X> min(G$X[G$Name=="Group 2"])])/points
p2 = p2 + length(G$X[G$Name=="Group 2" & G$X< max(G$X[G$Name=="Group 1"])])/points
p2 = p2/2

Notice that even with 2 standard deviations separating the groups, the elbow technique DOES diagnose that this is a binary system, but barely. The silhouette and gap techniques also point to a binary.

library(factoextra)
fviz_nbclust(G[, 1:2], kmeans, method = "wss")

fviz_nbclust(G[, 1:2], kmeans, method = "silhouette")

gap_stat <- clusGap(G[, 1:2], FUN = kmeans, nstart = 10, K.max = 10, B = 50) fviz_gap_stat(gap_stat)

Here you can see the underlying binary division.

G %>% ggplot(aes(x = X, y = Y)) + geom_point(aes(colour = Name), show.legend = TRUE) + scale_color_manual(values=c(GOLD0A,BLUE0A)) + xlab(paste("Overlap percent = ",percent(as.numeric(p[1]))," : Overlap range = ",percent(p2),sep="")) + ylab("") + coord_equal(ratio=1)

And as you would expect, oh boy does the k-means clustering make mistakes.

set.seed(20)
binaryCluster <- kmeans(G[, 1:2], 2, nstart = 10, algorithm="Lloyd") binaryCluster$cluster <- as.factor(binaryCluster$cluster) binaryCluster$color[binaryCluster$cluster == 1] = GOLD0A binaryCluster$color[binaryCluster$cluster == 2] = BLUE0A G$col2 = binaryCluster$color G %>% ggplot(aes(x = X, y = Y)) +
geom_point(aes(color = col2), show.legend = TRUE) +
scale_color_manual(values=c(GOLD0A,BLUE0A)) +
xlab("Unsupervised binary separation") +
ylab("") + coord_equal(ratio=1)

However, K-means clustering can still uncover the binary.

References:

Weitzman, M. S. (1970). Measures of overlap of income distributions of white and Negro families in the United States. Washington: U.S. Bureau of the Census.

https://afit-r.github.io/kmeans_clustering

https://rpubs.com/williamsurles/310847

University of Canterbury Open Day

The University of Canterbury held this year’s Open Day on Thursday, July 11, 2019. It was a chance for high-school students to look at possible majors at our University. This year I had the chance to showcase UC Linguistics, and I brought along our ultrasound machine to show people images of my tongue in motion, and let them see their tongues. A few were intimidated by the idea of seeing their own tongues on a machine, but lots of young students participated, and hopefully got a bit of a taste for Linguistics and especially phonetic research.

However, next year I will try to build more materials to address all the ways linguistics can be useful to students. I like the fact that Linguistics is both arts and science at the same time. You learn to write, you learn numeracy, you learn statistics, and you learn how to do experiments. And on top of that, our students learn how to speak in public and speak well. These are exceedingly useful skills, and have led students to continue in research, get positions with Stats NZ, build up computer research in local companies, and so much more.