Figure 6.18.
The cochlea as a piano soundboard.
The cochlear nucleus
The cochlear nucleus is not in the cochlea. It is the first auditory processing
substation along the auditory nerve. In the cochlear nucleus, different cell types
and networks process the auditory signal in at least four distinctive ways
(Harrison and Howe 1974; Kelly 1985).
First, large spherical neurons in the cochlear nucleus not only relay the
hair cells’ signals but do so tonotopically. That is, they preserve the cochlea’s regu
lar, keyboardlike mapping of different frequencies. Each spherical cell responds
to only one narrow frequency range because each is innervated by only a few
hair cells. Like narrowband spectrograms, these neurons transmit periodic,
vocalic information to higher processing centers.
Second, octopus cells in the cochlear nucleus (figure 6.20) respond to sig
nals from wide frequency ranges of hair cells. Because they sample a broad
Figure 6.19.
Central auditory pathways. (Manter 1975. Reprinted by permission of
F. A. Davis Company.)
SPEECH AND HEARING
• 105
Figure 6.20.
Octopus cells of the cochlear nucleus.
spectrum, octopus cells are well designed to function like sound meters, mea
suring the overall intensity of a complex sound.
Third, because octopus cells receive inputs from many hair cells at many
frequencies at once, they are capable of quickly depolarizing in response to
brief stimuli like the many frequencies in a plosive burst. By contrast, it takes
a relatively long time for the few hair cells in a single narrow frequency band
to depolarize a spherical cell.
Fourth, many octopus cells’ dendrites are arrayed to receive inputs either
from high to low or from low to high (figure 6.20). Thus, these cells are differ
entially sensitive to brief events such as the rising and falling formant transi
tions which mark place of articulation (figure 6.14; Kelly 1985).
From the cochlear nucleus afferent auditory axons enter the trapezoid body
and cross over to the superior olivary nucleus (superior olive) on the contralat
eral side of the brain stem. However, unlike vision and touch, not all auditory
pathways cross. An ipsilateral (same-side) pathway also arises (i.e., from left ear
to left cerebral hemisphere and from right ear to right cerebral hemisphere).
Thus, each side of the brain can compare inputs from both ears. Sounds com
ing from the left or right side reach the left and right ears at slightly different
times, allowing the brain to identify the direction of sounds. Bernard Kripkee
has pointed out to me how remarkable the ability to localize sound is. We
learned in chapter 3 that brain cells fire at a maximum rate of about 400 spikes
per second. Nevertheless, the human ear can readily distinguish sound sources
separated by only a few degrees of arc. At the speed of sound, this translates
into time differences on the order of 0.0004 s. This means the auditory system
as a whole can respond about 100 times faster than any single neuron in it.
Instead of the Von Neumannesque, all-or-nothing, digital response of a single
neuron, many neurons working together in parallel produce a nearly analogue
response with a hundredfold improvement in sensory resolution.
Ultimately, pathways from both the cochlear nucleus and the superior
olive combine to form the lateral lemniscus, an axon bundle which ascends to
the inferior colliculus. The inferior colliculus (1) relays signals to higher brain
centers (especially the medial geniculate nucleus of the thalamus, MGN) and (2) in
doing so preserves the tonotopic organization of the cochlea. That is, the topo
106 •
HOW THE BRAIN EVOLVED LANGUAGE
graphic arrangement by frequency which is found in the cochlea is repeated
in the inferior colliculus. (And indeed is repeated all the way up to the cere
brum!) The inferior colliculus has been clearly implicated in sound localiza
tion, but little is known about the functions of the inferior colliculus with respect
to speech. However, it is noteworthy that reciprocal connections exist both from
the medial geniculate nucleus (MGN) and from the cerebral cortex back to
the inferior colliculus, and these pathways will prove important to one of our
first adaptive grammar models in chapter 7.
Like the inferior colliculus, the medial geniculate nucleus relays afferent
auditory signals to auditory cortex, retaining tonotopic organization. It also
receives reciprocal signals from auditory cortex and sends reciprocal signals
back to the inferior colliculus. Unlike the inferior colliculus, the medial gen
iculate nucleus also exchanges information with thalamic centers for other
senses, especially vision. As a result, some cross-modal information arises from
thalamus to cortex. Moreover, much of this information seems to be processed
in cerebrum-like on-center off-surround anatomies.
Medial geniculate nucleus projections of the auditory nerve erupt into the
cerebrum in the (auditory) koniocortex on the inner surface of the superior
temporal gyrus, inside the Sylvian fissure. Perhaps because this area is relatively
inaccessible to preoperative brain probes, it has been relatively little studied
in the human case. Nevertheless, it can be inferred from dissection, as well as
from many studies of mammals and primates, that koniocortex is character
ized by tonotopic neuron arrays (tonotopic maps) which still reflect the tonotopic
organization first created by the cochlea. Considerable research on tonotopic
mapping has been done on mammalian brains ranging from bats (Suga 1990)
to monkeys (Rauschecker et al. 1995). In fact, these brains tend to exhibit three,
four, five, and more cerebral tonotopic maps.
Since the same on-center off-surround architecture characterizes both vi
sual and auditory cortex, the same minimal visual anatomies from which adap
tive resonance theory was derived can be used to explain sound perception.
We will first consider how auditory contrast enhancement may be said to oc
cur, and then we will consider auditory noise suppression.
Auditory contrast enhancement
At night, a dripping faucet starts as a nearly inaudible sound and little by little
builds until it sounds like Niagara Falls, thundering out all possibility of sleep.
Figure 6.21 models this common phenomenon. Field F
1
(remember that fields
are superscripted while formants are subscripted) models a tonotopic, cochlear
nucleus array in which the drip is perceived as a single, quiet note activating
cell x subliminally (“below the limen,” the threshold of perception). For con
creteness, the higher field F
2
may be associated with the inferior colliculus or
the medial geniculate nucleus. The response at F
1
also graphs the (subliminal)
response at t
1
, while the response at F
2
also graphs the (supraliminal) response
at some later t
2
. At t
1
the response at cell x, stimulated only by the faucet drip,
barely rises above the surrounding stillness of the night and does not cross the
SPEECH AND HEARING
• 107
Figure 6.21.
Auditory contrast enhancement.
threshold of audibility. However, cell x is stimulated both by the drip and by
resonant stimulation. At the same time, the surrounding cells, . . . , x – 2, x – 1
and x + 1, x + 2, . . . , become inhibited by cell x, and as they become inhibited,
they also disinhibit cell x, adding further to the on-center excitation of x. This
process (“the rich get richer and the poor get poorer”) continues until finally,
at t
n
, cell x stands out as loudly as any sound can. The contrast between the
drip and the nighttime silence has become enhanced.
Auditory noise suppression and edge detection
Auditory noise suppression is closely related to contrast enhancement since
both are caused by the dynamics of on-center off-surround neural anatomies.
Noise suppression and an interesting, related case of spurious edge detection
are illustrated in figure 6.22.
Figure 6.22.
White noise is suppressed; band-limited noise is not.
108 •
HOW THE BRAIN EVOLVED LANGUAGE
In World War II communications research, it was found that white noise
interfered with spoken communication less than band-limited noise. Figure
6.22 presents a vowel spectrum (solid line) under conditions of both white and
band-limited noise (dotted lines). At the same amplitude, when noise is lim
ited to a relatively narrow frequency band, there is “less” of it than there is of
white noise, which covers the entire frequency range. Nevertheless, the band-
limited noise interferes more with speech communication. This occurs because,
under white noise, the on-center off-surround neural filter causes the noise to
be uniformly suppressed across the entire spectrum. At the same time, con
trast enhancement picks out the formant peaks and emphasizes them above
the background noise. Thus, in figure 6.22, under the white-noise condition,
the perceived formant spectrum is preserved at time c, after neural process
ing. The band-limited noise, however, introduces perceptual edges which, like
the edges in figure 5.4b, become enhanced. By time f (figure 6.22), the speech
spectrum has been grossly distorted: the perceptually critical second formant
peak has been completely suppressed and replaced by two spurious formant
peaks introduced at the edges of the band-limited noise. This further illustrates
the perceptual phenomenon of edge detection, which occurs because in on-center
off-surround processing, the middle of the band-limited noise is laterally in
hibited and suppressed from both sides, while the edges are suppressed only
from one side.
In this chapter, we have looked at the basic human motor-speech and auditory
systems, and we have seen how these systems produce, sample, sense, and pro
cess sound and perceive it as the most fundamental elements of speech. In
particular, we examined basic examples of noise suppression, contrast enhance
ment, and edge detection in on-center off-surround ART networks. In chap
ter 7 we will extend these findings from these atomic, phonetic speech sounds
and phenomena to the phonemic categories of speech.
SPEECH PERCEPTION
•
109
•
S
E
V
E
N
•
Speech Perception
In chapter 6 we described the nature of the speech signal and how its image is
sensed and presented to the cerebrum. We noted that because no two vocal
tracts are exactly alike, your pronunciation will differ subtly but certainly from
my pronunciation. To express these subtle, phonetic differences, linguists in
vented the International Phonetic Alphabet (IPA). In the fifteenth century, the
Great English Vowel Shift caused the writing system of English to deviate from
continental European systems, so IPA looks more like French or Italian spell
ing than English. Thus, when you say beet, we might write it in IPA as [bit], and
if I pronounce my [i] a little further forward in my mouth, we could capture
this detail of my pronunciation in IPA as [bi
+
t]. The sounds of the letters of
IPA correspond to phones, and the brackets tell us we are attempting to cap
ture pronunciation phonetically, that is, as accurately as possible.
But even though you say [bit] and I say [bi
+
t], we both perceive the word
beet. This is a small miracle, but it is very significant. In one sense, language
itself is nothing but a million such small miracles strung end to end. No two
oak trees are exactly alike either, but 99% of the time you and I would agree
on what is an oak and what isn’t an oak. This ability to suppress irrelevant de
tail and place the objects and events of life into the categories we call words is
near to the essence of cognition. In linguistics, categorically perceived sounds
are called phonemes, and phonemic categories are distinguished from phonetic
instances by enclosing phonemes in slashes. Thus, for example, you say [bit]
and I say [bi
+
t], but we both perceive /bit/.
Whereas in chapter 6 we found much to marvel at in the neural production
of phones, we turn now to the still more marvelous phenomena of categorical
neural perception of phonemes. To understand speech perception, and ultimately
cognition, we now begin to study how minimal cerebral anatomies process the
speech signal. The result of this exercise will be the first collection of hypoth
eses that define adaptive grammar.
109
110 •
HOW THE BRAIN EVOLVED LANGUAGE
Voice Onset Time Perception
The words beet and heat differ by only their initial phonemes: /bit/ versus
/hit/. Linguists call such pairs of words minimal pairs. More minimally still, beet
and peat only differ on a single feature. The phonemes /b/ and /p/ are both
bilabial plosive (or stop) consonants. The only difference is that /b/ is a voiced
consonant, and /p/ is an unvoiced consonant. The words /bit/ and /pit/ dif
fer only on the manner feature of voicing. Such minimal pairs isolate the cat
egorical building blocks of language and provide an excellent laboratory in
which to begin the study of cognition.
The difference between voiced and unvoiced sounds was long said to be
that the vocal cords vibrated during production of voiced sounds but not dur
ing the production of unvoiced sounds. This is true enough, but as we have
seen, the production of speech sounds is only one-half of the language equa
tion. Following the invention of the sound spectrograph in the late 1940s it
became possible for researchers to study the other half, the perception of speech
sounds. In a landmark study, Liberman et al. (1952) used spectrography to
measure the plosive voicing contrast against how it is perceived by listeners. For
example, spectrograms of /p/ and /b/ in paid and bade are given figure 7.1.
The spectrograms for both /p/ and /b/ begin with a dark, vertical band which
marks the initial, plosive burst of these consonants. (We will examine the third
spectogram in figure 7.1 a little later in this chapter.) These are followed by the
dark, horizontal bands of the formants of /e/. Finally, each spectrogram ends
with another burst marking the final /d/. It is difficult to find much to say about
nothingness, so one might believe, as linguists long did, that the most significant
difference between /p/ and /b/ is the aspiration following /p/. This is the high-
frequency sound in figure 7.1, appearing after the burst in paid. From a listener’s
perspective, however, such aspiration falls outside the tuning curve of the ear canal
and is too faint to be reliably heard. It might as well be silence. And indeed, in
1957 Liberman et al. found that it was the silence following a plosive burst which
distinguished /p/ and /b/. They called this silent interval voice onset time (VOT).
Figure 7.1.
Spectrograms of [bed], [ped], and [
m
bed] (Spanish-like prevoicing).
SPEECH PERCEPTION
• 111
Marked VOT in figure 7.1, voice onset time is usually measured from the
burst to the beginning of the lowest dark band on the spectrogram, the voicing
bar. Once researchers recognized that silence could be a highly contrastive
speech cue, it was easy to see from spectrograms how VOT could be a highly
salient feature of speech. Using synthesized speech, numerous studies quickly
verified that VOT was the primary feature distinguishing voiced and unvoiced
consonants and that this distinction applied universally across languages (Lisker
and Abramson 1964).
It was soon discovered that these perceptual distinctions were also categori-
cal. In 1957, Liberman et al. presented listeners with a series of syllables which
varied in VOT between 0 and 50 ms (in IPA we might represent the stimuli as
[ba], [b
+
a], [b
++
a], etc.). They asked listeners to identify these syllables as ei
ther /ba/ or /pa/. As figure 7.2 shows, they found an abrupt, categorical shift
toward the identification of /pa/ when VOT reached 25 ms. Initial plosives
with VOTs under 25 ms were perceived as voiced, while those with longer VOTs
were perceived as unvoiced. It was as if a binary switch flipped when VOT
crossed the 25 ms boundary.
This metaphor of a binary switch was particularly attractive to generative
philosophers, who viewed language as the product of a computational mind,
and the metaphor took on the further appearance of reality when Eimas,
et al. (1971) demonstrated that even extraordinarily young infants perceived
the voiced-voiceless distinction categorically. In a series of ingenious studies,
Eimas and his coworkers repeated synthetic VOT stimuli to infants as young as
one month.
In these experiments, the infants were set to sucking on an electonically
monitored pacifier (figure 7.3). At first, the synthetic speech sound [ba]
0
(i.e.,
VOT = 0) would startle the neonates, and they would begin sucking at an ele
vated rate. The [ba]
0
was then repeated, synchronized with the infant’s suck
ing rate, until a stable, baseline rate was reached: the babies became habituated
to (or bored with) [ba]
0
.
Figure 7.2.
Categorical perception.
112 •
HOW THE BRAIN EVOLVED LANGUAGE
Figure 7.3.
Eimas et al.’s (1971) “conjugate sucking” paradigm.
Then the stimulus was changed. If it was changed to [ba]
30
, the infants
were startled again. They perceived something new and different, and their
sucking rate increased. If, however, the new stimulus was [ba]
10
or [ba]
20
and
did not cross the 25 ms. VOT boundary, the babies remained bored. They per
ceived nothing new, and they continued to suck at their baseline rate.
This study was replicated many times, and the conclusion seemed inescap
able: Chomsky’s conjecture on the innateness of language had been experi
mentally proved. Neonates had the innate capacity to distinguish so subtle and
language-specific a feature as phonemic voicing! But then the study was repli
cated once too often. In 1975, Kuhl and Miller replicated the Eimas study—
but with chinchillas! Obviously, categorical perception of VOT by neonates was
not evidence of an innate, distinctively human, linguistic endowment.
Figure 7.4 explains both infants’ and chinchillas’ categorical perception
of the voiced-voiceless contrast as the result of species-nonspecific dipole com
petition. In figure 7.4a, the left pole responds to an aperiodic plosive burst at
t = 0 ms. Despite the brevity of the burst, feedback from F
2
to F
1
causes site u
1
to become persistently activated. This persistent activation also begins lateral
inhibition of v
1
via i
uv
. When the right pole is later activated at v
0
by the peri
odic inputs of the vowel (voice onset at t > 25 ms), inhibition has already been
established. Because v
1
cannot fire, v
2
cannot fire. Only the unvoiced percept
from u
2
occurs at F
2
.
i
In figure 7.4b, on the other hand, voice onset occurs at t < 25 ms. In this
case, v
1
reaches threshold, fires, and establishes feedback to itself via v
2
before
uv
can inhibit v
1
. Now, driven by both v
2
–v
1
feedback and v
0
–v
1
feedforward
inputs, i
vu
can inhibit u
1
, and a voiced percept results at v
2
.
Figure 7.4 also explains more subtle aspects of English voicing. For example,
the [t] in step is perceived as an unvoiced consonant, but acoustically, this [t]
is more like a /d/: it is never aspirated, and its following VOT is usually less
than 25 ms. How then is it perceived as a /t/? In this case, figure 7.4 suggests
SPEECH PERCEPTION
• 113
Figure 7.4.
A VOT dipole. (a) Unvoiced percept. (b) Voiced percept.
that the preceding /s/ segment excites the unvoiced pole of figure 7.4, so it
can establish persistent inhibition of the voiced pole without 25 ms of silence.
It predicts that if the preceding /s/ segment is synthetically shortened to less
than 25 ms, the [t] segment will then be heard as a voiced /d/.
The differentiation of wideband and narrowband perception has a plausible
macroanatomy and a plausible evolutionary explanation. Figure 7.4 models a
cerebral dipole, but dipoles also exist ubiquitously in the thalamus and other
subcerebral structures—wherever inhibitory interneurons occur. In this case, the
unvoiced pole must respond to a brief burst stimulus with a broadband spectrum.
It is therefore plausible to associate u
0
with the octopus cells of the cochlear
nucleus since, as we saw in chapter 6, this is exactly the type of signal to which
they respond. Similarly, we associate v
0
with the tonotopic spherical cells of the
cochlear nucleus. We associate F
1
of figure 7.4 with the inferior colliculus and
medial geniculate nucleus. It is known that the octopus cells and spherical cells
send separate pathways to these subcortical structures. Lateral competition be
tween these pathways at F
1
is more speculative. The inferior colliculus has been
mostly studied as a site computing interaural timing and sound localization
(Hattori and Suga 1997). However, both excitatory and inhibitory cell types
are present, and it is probable that such lateral inhibition does occur at the
inferior colliculus (Pierson and Snyder-Keller 1994). Lateral competition is
a well-established process at the levels of the medial geniculate nucleus, the
thalamic reticular formation, and cerebral cortex (grouped as F
2
in figure
114 •
HOW THE BRAIN EVOLVED LANGUAGE
7.4; Suga et al. 1997). For simplicity, however, we diagram lateral competition
only at F
1
in figure 7.4. Likewise, reciprocal feedforward-feedback loops like u
1
–
u
2
–u
1
and v
1
–v
2
–v
1
are found at all levels of auditory pathways, but figure 7.4
emphasizes feedback loops from u
2
to u
1
following Suga and his colleagues
(Ohlemiller et al. 1996; Yan and Suga 1996; Zhang et al. 1997), who have iden
tified what is presumably homologous “FM” feedback circuitry from cerebrum
to inferior colliculus in bats. Finally, at typical central nervous system (CNS) sig
nal velocities of 1 mm/ms, note that the circuits of figure 7.4 are also reasonably
scaled for categorical perception centered around a VOT of 25 ms.
Some languages, like Thai and Bengali, map prevoiced (VOT
≤ 25ms),
voiced (VOT
≈ 0 ms), and unvoiced plosive phones (VOT > 25 ms) to three
different phonemic categories, and replications of the Eimas study using this
wider range of stimuli suggest that neonates can also perceive VOT in three
categories. Nevertheless, most languages, including the European languages,
divide the VOT continuum into only two phonemic categories: voiced and
unvoiced. The problem is that these languages divide the continuum in differ
ent places, so before the sound spectrograph came along, this situation con
fused even trained linguists. For example, whereas English and Chinese locate
the voicing crossover at 25 ms, so that “voiced” /b/ < 25 ms < “unvoiced” /p/,
Spanish locates its voicing crossover at 0 ms., so that “voiced” /b/ < 0 ms <
“unvoiced” /p/. That is, Spanish /b/ is prevoiced, as in figure 7.1.
As one result, when Spanish and Portuguese missionaries first described
the Chinese language, they said it lacked voiced consonants, but if these same
missionaries had gone to England instead, they might well have said that En
glish lacked voiced consonants. Subsequently, this Hispanic description of
Chinese became adopted even by English linguists. In the Wade-Giles system
for writing Chinese in Roman letters, /di/ was written ti and /ti/ was written
t’i, and generations of English learners of Chinese have learned to mispro
nounce Chinese accordingly, even though standard English orthography and
pronunciation would have captured the Chinese voiced/voiceless distinction
perfectly.
Because of its species nonspecificity, subcerebral origins, and simple dipole
mechanics, categorical VOT perception probably developed quite early in ver
tebrate phylogeny. It is quite easy to imagine that the ability to discriminate
between a narrowband, periodic birdsong and the wideband, aperiodic snap
ping noise of a predator stepping on a twig had survival value even before the
evolution of mammalian life. Still, it is not perfectly clear how figure 7.4 ap
plies to the perception of Spanish or Bengali. As surely as there are octopus
cells in the cochlear nucleus, the XOR information of the voicing dipole is
present, but how it is used remains an issue for further research.
Phoneme Learning by Vowel Polypoles
A few speech features such as VOT may be determined by subcortical process
ing, but most speech and language features must be processed as higher cog
SPEECH PERCEPTION
• 115
nitive functions. Most of these features begin to be processed in primary audi
tory cortex, where projections from the medial geniculate nucleus erupt into
the temporal lobe of the cerebrum. Although a few of these projections might
be random or diffuse signal pathways, a large and significant number are co
herently organized into tonotopic maps. That is, these projections are spatially
organized so as to preserve the frequency ordering of sound sensation that was
first encoded at the cochlea.
PET scans and MRI scans have so far lacked sufficient detail to study hu
man tonotopic maps, so direct evidence of tonotopic organization in humans
is sparse. Animal studies of bats and primates, however, have revealed that the
typical mammalian brain contains, not one, but many tonotopic maps. The bat,
for example, exhibits as many as five or six such maps (Suga 1990).
It is rather impressive that this tonotopic order is maintained all the way
from the cochlea to the cerebrum, for although this distance is only a few
centimeters, some half-dozen midbrain synapses may be involved along some
half-dozen distinct pathways. Moreover, no fewer than three of these path
ways cross hemispheres, yet all the signals reach the medial geniculate nucleus
more or less in synchrony and project from there into the primary auditory
cortex, still maintaining tonotopic organization. In humans, tonotopic orga
nization implies that the formant patterns of vowels, which are produced in
the vocal tract and recorded at the cochlea, are faithfully reproduced in the
cerebrum. The general structure of these formant patterns was presented in
chapter 5. What the cerebrum does with tonotopic formant patterns is our
next topic.
To model how phones like [i] become phonemes like /i/, we return to the
on-center off-surround anatomy, which we now call a polypole for simplicity. For
concreteness, imagine an infant learning Spanish (which has a simpler vowel
system than English), and consider how the formant pattern of an [i] is projected
from the cochlea onto the polypoles of primary auditory cortex (A
1
) and ter
tiary auditory cortex (A
3
) in figure 7.5. (We will discuss A
2
at the end of this
chapter.) If we take the infant’s primary auditory cortex to be a tabula rosa at birth,
then its vector of cortical long-term memory traces, drawn in figure 7.5 as modi
fiable synaptic knobs at A
3
, is essentially uniform. That is, z
1
= z
2
= . . . = z
n
.
When the vowel [i] is sensed at the cochlea and presented to polypoles A
1
and A
3
, the formants of the [i] map themselves onto the long-term memory
traces z
i
between A
1
and A
3
.
Feature Filling and Phonemic Normalization
In figure 7.5, lateral inhibition across A
1
and A
3
contrast-enhances the pho
neme /i/. Thus, the formant peaks at A
3
become more exaggerated and bet
ter defined than the original input pattern. This has the benefit of allowing
learned, expectancy feedback signals from the idealized phoneme pattern
across A
3
to deform the various [i]s of different speakers to a common (and,
thus, phonemic) pattern for categorical matching and recognition.
116 •
HOW THE BRAIN EVOLVED LANGUAGE
Dostları ilə paylaş: |