Phonological optimization for sight and sound: Disentangling visual-articulatory and auditory-acoustic factors in phonetic enhancement and hyperarticulation
General Research Fund Award (GRF), 2025-28
Principal investigator: Jonathan Havenhill
Amount: 941,184 HKD
Abstract
Sound is arguably the primary (and often only) medium by which spoken language is conveyed. This allows communication to proceed when the speaker is obscured, whether over the phone, in the dark, at a distance, or when wearing a face mask. At the same time, vision and other types of non-auditory perception are also important. Spoken language is often accompanied by facial expressions and manual gestures, and the ability to see a speaker’s face and mouth is known to influence how speech sounds are recognized. Sign languages, moreover, are transmitted through vision alone. This demonstrates that neither sight nor sound are strictly required for language, nor is one modality linguistically superior. Rather, human communication is inherently multimodal: Speakers and listeners maximally exploit auditory and visual information to perceive and produce language, enhancing its robustness. Determining how these cues trade off and/or reinforce one another is essential for understanding how language is optimized for efficient communication, how speech sounds are organized in the mind, and how phonological systems change over time.
Prior audiovisual speech research has focused mostly on the listener, testing how nonauditory information influences how speech sounds are identified, and often relies on incongruous perceptual illusions that cannot occur in actual speech. The novel approach taken here is to examine how speakers actively modify their speech to enhance its visibility and how the resulting array of (real) visual speech cues affect listener perception. Three experiments explore the production and perception of English sounds that are optionally produced with visible tongue or lip gestures, including extreme variants of /l/ that have not been systematically studied. In a production experiment, speakers interact with a virtual speaking partner in clear, audio-degraded, and videodegraded conditions, testing whether hyperarticulated speech arises specifically to benefit the listener or owes instead to greater overall speaking effort. Simultaneous acoustic, ultrasound tongue imaging, and 3D motion capture data will be collected to understand how the auditory-acoustic and visual-articulatory characteristics of these sounds are altered by audiovisual enhancement. Participants in two types of perception experiment will then identify sounds produced with varying types of visible and nonvisible gestures, in clear and noisy conditions. In one experiment, eye-tracking will reveal whether listeners anticipate visible gestures by attending more closely to mouth movements for potentially confusable sounds. Together, these experiments will inform theories of clear speech and hyperarticulation, sound change and the maintenance of phonological contrast, and adaptive communication.