Category Archives: Theoretical

These posts are dedicated individual and joint theoretical research

Uniformity in speech: The economy of reuse and adaptation across contexts

Myself, Connor Mayer, and Bryan Gick recently published “Uniformity in speech: The economy of reuse and adaptation across contexts” with Glossa. This article compares how Kiwis and North Americans produce flap sequences like “editor” in North America, or “added a” in New Zealand. Kiwis produce these similarly during slow and fast speech, North Americans often have two different methods for slow and fast speech. We show that difference likely stems from the extreme variability built into the “r”s of rhotic dialects of English reaching flaps because of reuse and adaptation of motor “chunks”.

To illustrate our claim: In the image below showing tongue tip frontness for the second vowel in 3-vowel-sequences for words like “editor”, you can see that for faster speech (6-7 syllables/second), there is a jump where the tongue tip is not nearly as fronted, but only for North American English (NAE) vowels, not for New Zealand English (NZE) vowels or NAE rhotic vowels. Here the high variability intrinsic to NAE rhotic (and commonly seen in other contexts) is visible in adjacent NAE non-rhotic vowels, but NZE has no access to rhotic vowels at all, so the non-rhotic vowels do not have a source of such motor control variability, even though such variability would provide mechanical advantage.

The abstract for this article, which explains in more technical but also more accurate terms, can be seen here:

“North American English (NAE) flaps/taps and rhotic vowels have been shown to exhibit extreme variability that can be categorized into subphonemic variants. This variability provides known mechanical benefits in NAE speech production. However, we also know languages reuse gestures for maximum efficiency during speech production; this uniformity of behavior reduces gestural variability. Here we test two conflicting hypotheses: Under a uniformity hypothesis in which extreme variability is inherent to rhotic vowels only, that variability can still transfer to flaps/taps and non-rhotic vowels due to adaptation across similar speech contexts. But because of the underlying reliance on extreme variability from rhotic vowels, this uniformity hypothesis does not predict extreme variability in flaps/taps within non-rhotic English dialects. Under a mechanical hypothesis in which extreme variability is inherent to all segments where it would provide mechanical advantage, including flaps/taps, such variability would appear across all English dialects with flaps/taps, affecting adjacent non-rhotic vowels through coarticulation whenever doing so would provide mechanical advantage. We test these two hypotheses by comparing speech-rate-varying NAE sequences with and without rhotic vowels to sequences from New Zealand English (NZE), which has flaps/taps, but no rhotic vowels at all. We find that NZE speakers all use similar tongue-tip motion patterns for flaps/taps across both slow and fast speech, unlike NAE speakers who sometimes use two different stable patterns, one for slow and another fast speech. Results show extreme variability is not inherent to flaps/taps across English dialects, supporting the uniformity hypothesis.”

Hearing, seeing, and feeling speech: the neurophysiological correlates of trimodal speech perception

Doreen Hansmann, myself, and Catherine Theys recently published a partially null-result article on the neurophysiological correlates of trimodal speech in Frontiers in Human Neuroscience: Hearing: Speech and Language. The short form is that while we saw behavioural differences showing integration of audio, visual, and tactile speech in closed-choice experiments, we could not extend that result to show an influence of tactile speech on brain activity – the effect is just to small:

Figure 3. Accuracy data for syllable /pa/ for auditory-only (A), audio-visual (AV), audio-tactile (AT), and audio-visual-tactile (AVT) conditions at each SNR level (–8, –14, –20 dB). Error bars are based on Binomial confidence intervals (95%).

The abstract for this article is below:

Introduction: To perceive speech, our brains process information from different sensory modalities. Previous electroencephalography (EEG) research has established that audio-visual information provides an advantage compared to auditory-only information during early auditory processing. In addition, behavioral research showed that auditory speech perception is not only enhanced by visual information but also by tactile information, transmitted by puffs of air arriving at the skin and aligned with speech. The current EEG study aimed to investigate whether the behavioral benefits of bimodal audio-aerotactile and trimodal audio-visual-aerotactile speech presentation are reflected in cortical auditory event-related neurophysiological responses.

Methods: To examine the influence of multimodal information on speech perception, 20 listeners conducted a two-alternative forced-choice syllable identification task at three different signal-to-noise levels.

Results: Behavioral results showed increased syllable identification accuracy when auditory information was complemented with visual information, but did not show the same effect for the addition of tactile information. Similarly, EEG results showed an amplitude suppression for the auditory N1 and P2 event-related potentials for the audio-visual and audio-visual-aerotactile modalities compared to auditory and audio-aerotactile presentations of the syllable/pa/. No statistically significant difference was present between audio-aerotactile and auditory-only modalities.

Discussion: Current findings are consistent with past EEG research showing a visually induced amplitude suppression during early auditory processing. In addition, the significant neurophysiological effect of audio-visual but not audio-aerotactile presentation is in line with the large benefit of visual information but comparatively much smaller effect of aerotactile information on auditory speech perception previously identified in behavioral research.

Gait Change in Tongue Movement

Bryan Gick and I recently published an article on “Gait Change in Tongue Movement” in Scientific Reports (Nature Publishing Group). Below is the abstract, with images alongside. However, if you want an easy-to-follow walkthrough of the paper, I also published a YouTube video on the paper on my YouTube Channel for Maps of Speech.

During locomotion, humans switch gaits from walking to running, and horses from walking to trotting to cantering to galloping, as they increase their movement rate. It is unknown whether gait change leading to a wider movement rate range is limited to locomotive-type behaviours, or instead is a general property of any rate-varying motor system. The tongue during speech provides a motor system that can address this gap. In controlled speech experiments, using phrases containing complex tongue-movement sequences, we demonstrate distinct gaits in tongue movement at different speech rates. As speakers widen their tongue-front displacement range, they gain access to wider speech-rate ranges.

At the widest displacement ranges, speakers also produce categorically different patterns for their slowest and fastest speech. Speakers with the narrowest tongue-front displacement ranges show one stable speech-gait pattern, and speakers with widest ranges show two. Critical fluctuation analysis of tongue motion over the time-course of speech revealed these speakers used greater effort at the beginning of phrases—such end-state-comfort effects indicate speech planning.

Based on these findings, we expect that categorical motion solutions may emerge in any motor system, providing that system with access to wider movement-rate ranges.

Evidence for active control of tongue lateralization in Australian English /l/

Jia Ying, Jason A. Shaw, Christopher Carignan, Michael Proctor, myself, and Catherine T. Best just published Evidence for active control of tongue lateralization in Australian English /l/. Most research on /l/ articulation has looked at motion timing along the midline, or midsagittal plane. This study compares that information to motion on the sides of the tongue. It focuses on Australian English (AusE), using three-dimensional electromagnetic articulography (3D EMA).

Fig. 11. Temporal dynamics of tongue curvature in the coronal plane over the entire V-/l/ interval. The brackets indicate onset (red) and coda (blue) /l/ intervals. Each bracket extends from the /l/ onset to its peak. For onset /l/s, the peak occurs earlier (at about 200 ms) than coda /l/s (at about 450 ms). A time of zero indicates the vowel onset. The 800-interval window captures the entire V-/l/ articulation in every token.

The articulatory analyses show: 1) consistent with past work, the timing lag between mid-sagittal tongue tip and tongue body gestures differs for syllable onsets and codas, and for different vowels.

2) The lateral channel is formed by tilting the tongue to the left/right side of the oral cavity as opposed to curving the tongue within the coronal plane

3) the timing of lateral channel formation relative to the tongue body gesture is consistent across syllable positions and vowel contexts – even though temporal lag between tongue tip and tongue body gestures varies.

This last result suggests that lateral channel formation is actively controlled as opposed to resulting as a passive consequence of tongue stretching. These results are interpreted as evidence that the formation of the lateral channel is a primary articulatory goal of /l/ production in AusE.

Locating de-lateralization in the pathway of sound changes affecting coda /l/

Patrycja Strycharczuk, Jason Shaw, and I just published Locating de-lateralization in the pathway of sound changes affecting coda /l/, in which we analyze New Zealand English /l/ using Ultrasound and Articulometry. You can find the article here. Put in the simplest English terms, the article shows the process by which /l/-sounds in speech can change over time from a light /l/ (like the first /l/ in ‘lull’) to a darker /l/ (like the second /l/ in ‘lull’). This darkening is the result of the upper-back, or dorsum, of the tongue moving closer to the back of the throat. This motion in turn reduces lateralization, or the lowering of the sides of the tongue away from the upper teeth. This is followed, over time, by the tongue tip no longer connecting to the front of the hard palate – the /l/ becomes a back vowel or vocalizes.

Two subcategories identified in the distribution of TT raising for the Vl#C context. Red = vocalized.

If you want a more technical description, Here is the abstract:

‘Vocalization’ is a label commonly used to describe an ongoing change in progress affecting coda /l/ in multiple accents of English. The label is directly linked to the loss of consonantal constriction observed in this process, but it also implicitly signals a specific type of change affecting manner of articulation from consonant to vowel, which involves loss of tongue lateralization, the defining property of lateral sounds. In this study, we consider two potential diachronic pathways of change: an abrupt loss of lateralization which follows from the loss of apical constriction, versus slower gradual loss of lateralization that tracks the articulatory changes to the dorsal component of /l/. We present articulatory data from seven speakers of New Zealand English, acquired using a combination of midsagittal and lateral EMA, as well as midsagittal ultrasound. Different stages of sound change are reconstructed through synchronic variation between light, dark, and vocalized /l/, induced by systematic manipulation of the segmental and morphosyntactic environment, and complemented by comparison of different individual articulatory strategies. Our data show a systematic reduction in lateralization that is conditioned by increasing degrees of /l/-darkening and /l/-vocalization. This observation supports the idea of a gradual diachronic shift and the following pathway of change: /l/-darkening, driven by the dorsal gesture, precipitates some loss of lateralization, which is followed by loss of the apical gesture. This pathway indicates that loss of lateralization is an integral component in the changes in manner of articulation of /l/ from consonantal to vocalic.

Building a cleaned dataset of aligned ultrasound, articulometry, and audio.

In 2013, I recorded 11 North American English speakers, each reading eight phrases with two flaps in two syllables (e.g “We have editor books”), and at 5 speech rates, from about 3 syllables/second to 7 syllables/second. Each recording included audio, ultrasound imaging of the tongue, and articulometry.

The dataset has taken a truly inordinate amount of time to label, transcribe (thank you Romain Fiasson), rotate, align ultrasound to audio, fit in shared time (what is known as a Procrustean fit), extract acoustic correlates, and clean from tokens that have recording or unfixable alignment errors.

It is, however, now 2019 and I have a cleaned dataset. I’ve uploaded the dataset, with data at each point of processing included, to an Open Science Framework website: I will, over the next few weeks, upload documentation on how I processed the data, as well as videos of the cleaned data showing ultrasound and EMA motion.

By September 1st, I plan on submitting a research article discussing the techniques used to build the dataset, as well as theoretically motivated subset of the articulatory to acoustic correlates within this dataset to a special issue of a journal whose name I will disclose should they accept the article for publication.

This research was funded by a Marsden Grant from New Zealand, “Saving energy vs. making yourself understood during speech production”. Thanks to Mark Tiede for writing the quaternion rotation tools needed to oriented EMA traces, and to Christian Kroos for teaching our group at Western Sydney Universiy how to implement them. Thanks to Michael Proctor for building filtering and sample repair tools for EMA traces. Thanks also to Wei-rong Chen for writing the palate estimation tool needed to replace erroneous palate traces. Special thanks to Scott Lloyd for his part in developing and building the ultrasound transducer holder prototype used in this research. Dedicated to the memory of Roman Fiasson, who completed most of the labelling and transcription for this project.

Visual-tactile Speech Perception and the Autism Quotient

Katie Bicevskis, Bryan Gick, and I recently published “Visual-tactile Speech Perception and the Autism Quotient” in Frontiers in Communication: Language Sciences. In this article, we demonstrated that the more people self-describe as having autistic-spectrum traits, the more they tolerate a separation of time between air-flow hitting the skin and lip opening from a video of someone saying an ambiguous “ba” or “pa” when identifying the syllable they saw and felt, but did not hear.

First, in an earlier publication, we showed that visual-tactile speech integration depended on this alignment of lip opening and airflow, and that this is evidence of modality-neutral speech primitives. We use whatever information we have during speech perception regardless of whether we see, feel, or hear it.

Summary results from Bicevskis et al. (2016), as seen in Derrick et al. (2019).

This result is best illustrated with the image above. The image shows a kind of topographical map, where white represents the “mountaintop” of people saying the ambiguous audio-tactile syllable is a “pa”, and green represents the “valley” of people saying the ambiguous audio-tactile syllable is a “ba”. On the X-axis is the alignment of the onset of air-flow release and lip opening. On the Y-axis is the participants’ Autism-spectrum Quotient. Lower numbers represent people who describe themselves as having the least autistic-like traits; the most neurotypical. At the bottom of the scale, perceivers identify the ambiguous syllables as “pa” with as much as 70-75% likelihood when the air-flow arrived 100-150 milliseconds after lip opening – about when it would arrive if a speaker stood 30-45 cm away from the perceiver. Deviations led to steep dropoffs, where perceivers would identify the syllable as “pa” only 20-30% of the time if the air flow arrived 300 milliseconds before the lip opening. In contrast, at the top of the AQ scale, perceivers reported perceiving “pa” as little as only 5% more often when audio-tactile alignment was closer to that experienced in typical speech.

Interaction between audio-tactile alignment and Autism-spectrum Quotient.
Interaction between audio-tactile alignment and Autism-spectrum Quotient.

These results are very similar what happens with people who are on the autism spectrum with audio-visual speech. Autists listen to speech with their ears more than they look with their eyes, showing a weak multisensory coherence during perceptual tasks (Happé and Frith, 2006). Our results suggest such weak coherence extends into the neutoryipcal population, and can be measured in tasks where the sensory modalities are well-balanced (which is easier to do in speech when audio is removed.)

References:

Bicevskis, K., Derrick, D., and Gick, B. (2016). Visual-tactile integration in speech perception: Evidence for modality neutral speech primitives. Journal of the Acoustical Society of America, 140(5):3531–3539

Derrick, D., Bicevskis, K., and Gick, B. (2019). Visual-tactile speech perception and the autism quotient. Frontiers in Communication – Language Sciences, 3(61):1–11

Derrick, D., Anderson, P., Gick, B., and Green, S. (2009). Characteristics of air puffs produced in English ‘pa’: Experiments and simulations. Journal of the Acoustical Society of America, 125(4):2272–2281

Happé, F., and Frith, U. (2006). The Weak Coherence Account: Detail-focused Cognitive Style in Autism Spectrum Disorders. Journal of Autism and Developmental Disorders, 36(1):5-25

Preliminary Report: Visual-tactile Speech Perception and the Autism Quotient

Katie Bicevskis, Bryan Gick, and I just had “Visual-tactile Speech Perception and the Autism Quotient” – our reexamination and expansion our evidence for ecologically valid visual-tactile speech perception – accepted to Frontiers in Communications: Language Sciences.  Right now only the abstract and introductory parts are online, but the whole article will be up soon.  The major contribution of this article is  that speech perceivers integrate air flow information during visual speech perception with greater reliance upon event-related accuracy the more they self-describe as neurotypical.  This behaviour supports the Happé & Frith (2006) weak coherence account of Autism Spectrum Disorder.  Put very simply, neurotypical people perceive whole events, but people with ASD perceive uni-sensory parts of events, often with greater detail than their neurotypical counterparts.  This account partially explains how autists can have deficiencies in imagination and social skills, but also be extremely capable in other areas of inquiry.  Previous models of ASD offered an explanation of disability, Happé and Frith offer an explanation of different ability.

I will be expanding on this discussion, with a plain English explanation of the results, once the article is fully published.  For now, the article abstract is re-posted here:

“Multisensory information is integrated asymmetrically in speech perception: An audio signal can follow video by 240 milliseconds, but can precede video by only 60 ms, without disrupting the sense of synchronicity (Munhall et al., 1996). Similarly, air flow can follow either audio (Gick et al., 2010) or video (Bicevskis et al., 2016) by a much larger margin than it can precede either while remaining perceptually synchronous. These asymmetric windows of integration have been attributed to the physical properties of the signals; light travels faster than sound (Munhall et al., 1996), and sound travels faster than air flow (Gick et al., 2010). Perceptual windows of integration narrow during development (Hillock-Dunn and Wallace, 2012), but remain wider among people with autism (Wallace and Stevenson, 2014). Here we show that, even among neurotypical adult perceivers, visual-tactile windows of integration are wider and flatter the higher the participant’s Autism Quotient (AQ) (Baron-Cohen et al, 2001), a self-report screening test for Autism Spectrum Disorder (ASD). As ‘pa’ is produced with a tiny burst of aspiration (Derrick et al., 2009), we applied light and inaudible air puffs to participants’ necks while they watched silent videos of a person saying ‘ba’ or ‘pa’, with puffs presented both synchronously and at varying degrees of asynchrony relative to the recorded plosive release burst, which itself is time-aligned to visible lip opening. All syllables seen along with cutaneous air puffs were more likely to be perceived as ‘pa’. Syllables were perceived as ‘pa’ most often when the air puff occurred 50-100 ms after lip opening, with decaying probability as asynchrony increased. Integration was less dependent on time-alignment the higher the participant’s AQ. Perceivers integrate event-relevant tactile information in visual speech perception with greater reliance upon event-related accuracy the more they self-describe as neurotypical, supporting the Happé & Frith (2006) weak coherence account of ASD.”

The articulation of /ɹ/ in New Zealand English

Matthias Heyne, Xuan Wang, myself (Donald Derrick), Kieran Dorreen, and Kevin Watson have recently had an article documenting the articulation of  /ɹ/ in New Zealand English.

This work is therefore in part a follow-up to some of my co-authored research into biomechanical modelling of English  /ɹ/ variants, indicating that vocalic context influences variation through muscle stress, strain, and displacement.  It is, by these three measures, “easier” to move from an /i/ to a tip-down /ɹ/ , but from /a/ to a tip-up /ɹ/.

In this study, for speakers who vary at all (some only do tip-up or tip-down), they are most likely to produce tip-up /ɹ/ in these conditions:

back vowel > low central vowel > high front vowel

initial /ɹ/ > intervocalic /ɹ/ > following a coronal (“dr”) > following a velar (“cr”)

The results show that allophonic variation of NZE /ɹ/ is similar to that in American English, indicating that the variation is caused by similar constraints.  The results support theories of locally optimized modular speech motor control, and a mechanical model of rhotic variation.

The abstract is repeated below, with links to articles contained within:

This paper investigates the articulation of approximant /ɹ/ in New Zealand English (NZE), and tests whether the patterns documented for rhotic varieties of English hold in a non- rhotic dialect. Midsagittal ultrasound data for 62 speakers producing 13 tokens of /ɹ/ in various phonetic environments were categorized according to the taxonomy by Delattre & Freeman (1968), and semi-automatically traced and quantified using the AAA software (Articulate Instruments Ltd. 2012) and a Modified Curvature Index (MCI; Dawson, Tiede & Whalen 2016). Twenty-five NZE speakers produced tip-down /ɹ/ exclusively, 12 tip-up /ɹ/ exclusively, and 25 produced both, partially depending on context. Those speakers who produced both variants used the most tip-down /ɹ/ in front vowel contexts, the most tip- up /ɹ/ in back vowel contexts, and varying rates in low central vowel contexts. The NZE speakers produced tip-up /ɹ/ most often in word-initial position, followed by intervocalic, then coronal, and least often in velar contexts. The results indicate that the allophonic variation patterns of /ɹ/ in NZE are similar to those of American English (Mielke, Baker & Archangeli 2010, 2016). We show that MCI values can be used to facilitate /ɹ/ gesture classification; linear mixed-effects models fit on the MCI values of manually categorized tongue contours show significant differences between all but two of Delattre & Freeman’s (1968) tongue types. Overall, the results support theories of modular speech motor control with articulation strategies evolving from local rather than global optimization processes, and a mechanical model of rhotic variation (see Stavness et al. 2012).

Ultrasound/EMA guide

This is a guide to the use of ultrasound and EMA in combination.  It is a bit out of date, and probably needs a day or two of work to make fully correct, but it describes the techniques I use with 3 researchers.  Of course I wrote this years ago, and now I can run an Ultrasound/EMA experiment by myself if I need to.