Sensory Neuroscience: Hearing and speech/Speech

Speech perception is a hot topic because it is critically important to human communication. As well, there are some controversial claims being made. For example, it may be the case that we have two auditory systems: one for speech, one for non-speech.

The nature of speech sounds

The stream of speech sounds is relatively unbroken during normal, fluent speech. There is no space between the words - detecting word boundaries is a major problem for psycholinguistics.

The International Phonetic Alphabet is an alphabet which does uniquely identify every speech sound by a single character. IPA symbols will be used extensively in this section; you would do well to familiarize yourself with the symbols, and their use.

Definitions

phoneme: The smallest unit of linguistically distinctive sound. Phonemes are not the same thing as letters, morphemes, nor syllables.
letter: An element of written language. It is critically important that there is no $1:1$ relationship between letters of the alphabet and the phonemes of the language.
morpheme: The smallest linguistic unit that has semantic meaning. Work is a morpheme; -ed is another morpheme. Worked is two morphemes.
utterance: A complete unit of spoken language.

Structure of language

There are four rough levels of the analysis of language. Beginning from the highest level:

Syntactic - the utterance has a grammatical structure
Semantic - a string of phonemes has a particular meaning
Phonetic - a sound is a string of phonemes
Acoustic - a sound has certain properties

While the bottom-up processes are obvious, there is significant top-down influence as well. For example, the phonemic restoration effect is very robust, and similar processes can repair input at the level of individual phonemes, words, and even sentences.

If you replace a single phoneme in a sentence with white noise, or a cough, or silence, listeners can still identify the word.

In the sentence "I scream, you scream, we all scream for ice cream." there is no difference between the /aɪ/ in "I scream" and in "ice cream" - the sound is exactly the same in both cases, yet the word boundary is perceived to be in a different location based solely on context effects.

If you insert a semantically unrelated sentence into a paragraph-length discourse, listeners will not get it (but will continue with normal perception after the sentence).

Measuring speech

Speech spectrogram

A speech spectrogram of "I owe you."

A speech spectrogram (careful with terminology here!) graphs frequency vs time & the intensity (amplitude, loudness) is coded by darkness. This shows qualitatively how speech sounds unfold over time. Note that consonants are noisy (not loud! but broadband in frequency). Vowels, in contrast have prototyped horizontal bands of energy. These bands are called formants. Near consonants, one or more of the formant curves - this is a formant transition. Most vowels are steady-state; they do not change over time (monophthongs). Some vowel sounds are dipthongs - they transition from one spectrum to another over the course of the vowel.

However, the speech spectrogram isn't very helpful since it is not quantitative. For that, we can use a speech spectrum (careful with terminology!).

Speech spectrum

A speech spectrum plots energy in dB SPL vs frequency for a short period of time: on the order of 40ms. This is essentially a cross-sectional slice of a speech spectrogram. But now you can't see how the sound unfolds over time!

Waterfall plot

A waterfall plot is essentially a series of speech spectrums over time, laid out in 3D. This overcomes the limitations of speech spectrograms (shows speech unfolding over time, but isn't quantitative) and also overcomes the limitations of speech spectrums (highly qualitative, but shows only a short snippet of the speech stream).