Figure 6.2.
The first harmonic of a vibrating string and its waveform.
Now things become complex. It happens that a string can vibrate in many
segments at once. In figure 6.4, our string is shown vibrating as a whole to the
right. At the same time, each third is also vibrating, the top and bottom thirds
to the right, and the middle third to the left. Each third vibrates at three times
the frequency of the whole. A string’s vibration can be (and usually is) more
complex still. A classic example occurs if a complex wave is composed of suc-
Figure 6.3.
A string vibrating in halves generates the second harmonic.
SPEECH AND HEARING
•
93
Figure 6.4. H (dotted line) and H (dashed line) in complex vibration.
1
2
cessive odd harmonics (e.g., the first, third, and fifth). In this case, the com
posite complex wave approaches a square wave in shape.
When two or more harmonics are present, they can reinforce each other or
interfere with each other. When they reinforce each other, their peaks align and
become higher; the harmonics are said to resonate and the sound becomes louder.
For example, when the dotted-line and dashed-line harmonics in figure 6.4 com
bine, the first peak of the resulting solid-line waveform is more intense than ei
ther of the two other components alone. In highly resonant systems, a waveform
can actually feed back into itself, resonating with itself and becoming louder and
louder. This is what happens, for example, when a public address system micro
phone is placed too close to its speakers. The output feeds back into the micro
phone, reinforcing and resonating with itself until a fuse blows. When sound
waves interfere with one another, their energy dissipates and the vibration of
the string is described as damped: once plucked, it does not continue to vibrate.
This is the situation with the vocal cords. Their vibratory patterns are very
complex and not very resonant. The resulting sound wave—the glottal pulse—
is a highly damped waveform (figure 6.5). A single glottal pulse sounds much
more like a quick, dull thud than a ringing bell.
Each time a puff of air passes from the lungs through the larynx, one glottal
pulse like figure 6.5 occurs. The vocal folds then spring closed again until subglottal
air pressure builds to emit another puff. If, while you were talking, we could some
how unscrew your head just above your larynx, your larynx would sound like a
“Bronx cheer” (a bilabial trill). This is approximately the sound of a trumpet
mouthpiece without the trumpet. What turns such a buzzing sound into melodi
ous, voiced speech?
A trumpet makes its buzzing mouthpiece harmonious by attaching it to a
tube. If we put a pulse train like figure 6.5 into a tube, a chain reaction of high-
and low-pressure areas is set up in the tube. Like strings, tubes also have har
monic frequencies at which they resonate. Like strings, the shorter the tube,
94 •
HOW THE BRAIN EVOLVED LANGUAGE
Figure 6.5.
(a) The glottal pulse has a highly damped waveform. (b) Continuous
periodic speech is based on a train of glottal pulses.
the higher the fundamental frequency. If we now screwed your head back onto
your larynx, your normal, beautiful, melodious voice would be restored. Your
mouth and throat function like the tubes of a trumpet to transform a lowly
Bronx cheer into vowels. To gain a deeper appreciation of how this happens,
we need to visualize sound waves in the frequency domain.
In figures 6.2–6.4, we observed that a string could simultaneously vibrate at
successively higher harmonic frequencies, each of successively lower amplitude.
Figure 6.6 captures these facts in a display known as a power spectrum. Each vertical
line in figure 6.6a represents a harmonic in figure 6.4, and the height of each
line represents the amplitude of the harmonic. The amplitudes decrease as the
harmonics increase in frequency. Notice that figure 6.6a represents essentially the
same physical facts as figure 6.4: a complex wave with f
0
= 1 Hz. and H
1
= 3 Hz.
Figure 6.6.
Power spectrum of (a) figure 6.4 and (b) the glottal pulse.
SPEECH AND HEARING
•
95
Figure 6.4, however, plots time on the x-axis, and is therefore sometimes called a
time domain plot. Figure 6.6 plots frequency on the x-axis, and so is called a frequency
domain plot. Figure 6.6b presents a fuller picture of the frequency domain of the
glottal pulse, whose many harmonics fall off in amplitude as frequency increases.
Figure 6.7 explains the operation of a trumpet starting with a frequency
domain plot of the buzzing input at the mouthpiece (the vertical lines). Let the
fundamental frequency of figure 6.7 be a train of pulses at 200 Hz. Then higher
harmonics are generated at integer multiples of this f
0
, that is, at 400 Hz, 600
Hz, 800 Hz, and so on. The tube of the trumpet, however, imposes a further filter
that “envelopes” this input. With some specific combination of valves, the tube
of the trumpet takes on a length which resonates optimally at some note at the
center frequency of the bell curve in figure 6.7. The surrounding frequencies
resonate less and are attenuated. Most of them are completely filtered out and
never escape the trumpet. On the other hand, the tube can actually amplify the
resonant frequencies at the center of the curve. When the trumpeter changes
the valves on a trumpet, tubes of different lengths are coupled together. These
successive tubes of differing length are resonant frequency filters which select
the successive musical notes the trumpeter is playing.
The human voice works rather like a trumpet trio. The glottal pulse train
feeds into three connected tubes of variable length. These resonate at three
different, variable frequencies and produce the various three-note, chordlike
sounds, which are called vowels. These three “tubes” can be seen in figure 6.8,
which shows the articulatory position of the vocal tract for the vowel /i/, as in
bead.
Letters enclosed in brackets and slashes, like [i] and /e/, are letters of
the International Phonetic Alphabet, or IPA, which is useful for distinguish
ing sounds like the vowels of bad and bade. IPA departs from normal English
spelling in several respects, most notably in that /i/ represents the sound of
French si or Chinese bi, but English deed, and /e/ represents the sound of
French les or German Schnee, but English bade. Letters enclosed in brackets are
phones, and letters enclosed in slashes are phonemes. Phones are sounds-in-the-
air, and phonemes are sounds-in-the-mind.
The phoneme /i/ is often described as a high, front vowel, meaning the
tongue is high and to the front. This position of the tongue divides the vocal
Figure 6.7.
Idealized trumpet filter centered at 1.2 kHz.
96 •
HOW THE BRAIN EVOLVED LANGUAGE
Figure 6.8.
Articulatory position of /i/.
tract into three subtubes (figure 6.8): the labiodental cavity, between the teeth
and the lips (III); the oral cavity, in the middle (II); and the pharyngeal cav
ity at the rear (I). The dental cavity resonates at a frequency that defines the
highest note, or formant (F
3
), of the vocalic “chord.”
3
The oral cavity is larger.
When articulating /i/, the oral cavity is bounded at the rear by the tongue
and at the front by the teeth. Together with the labiodental cavity, the oral
cavity forms a longer, larger subtube which resonates at a lower frequency,
the second formant, F
2
, of the vowel. The F
1
formant is the lowest in frequency.
It is created by the pharyngeal, oral, and labiodental cavities resonating to
gether as one large, long tube.
For comparison, figure 6.9 shows the articulatory position for the vowel
/u/, as in rude. Here, the tongue is up and back, making the oral cavity bigger.
The lips are also rounded, making the labiodental cavity bigger. Correspond-
Figure 6.9.
Articulatory position for /u/.
SPEECH AND HEARING
•
97
ingly, F
2
and F
3
are both lower than in /i/. F
1
is low as in /i/ because the over
all length and volume of the vocal tract remain large.
Figure 6.10 summarizes the main acoustic effects of the articulatory con
figurations in figures 6.8 and 6.9, superimposing filter functions (or vowel spec-
tra) on a male glottal spectrum. In figure 6.10a, the spectrum for the vowel
/i/ displays typical formant peaks at 300, 2200, and 2900 Hz. For this speaker,
these are the three “notes” of the vocalic “chord” for /i/, and they correspond
to the resonant frequencies of the three subtubes of the vocal tract in figure
6.8. Corresponding typical formant values for /u/ of 300, 750, and 2300 Hz
are plotted in figure 6.10b, values corresponding to the three subtubes of the
vocal tract in figure 6.9.
The preceding three-tube model is adequate for our purposes, but it is also
a considerable oversimplification. The acoustics of coupled tubes is actually much
more complex. Retroflex consonants like American /r/ divide the oral cavity
into two resonant chambers, and lateral semivowels like /l/ define the oral cav
ity with two apertures, one on each side of the tongue. Nasal speech sounds reso-
Figure 6.10.
Glottal spectra and vocal tract filters for (a) /i/ and (b) /u/ (fast
Fourier transform analyses).
98 •
HOW THE BRAIN EVOLVED LANGUAGE
nate in complex nasal and sinus cavities and introduce antiresonances which
cannot be accounted for in our simple model. Because everyone’s vocal tract is
different, there is also endless variation among individual voices.
One interesting source of individual variation is illustrated in figure 6.11.
Males have lower voices than women and children, so a male with a fundamental
frequency of 100 Hz will have glottal source harmonics at 200, 300, 400 Hz,
and so on (all vertical lines). However, a female with a fundamental frequency
of 200 Hz will have only half as many harmonics, at 400, 600, 800 Hz, and so
on (dotted and solid vertical lines, dotted formant envelope). Since the vocal
tract can only amplify harmonics that are physically present in the source spec
trum, the formants of female speech are less well defined than the formants
of male speech. This becomes especially apparent if the female fundamental
frequency in figure 6.11 is raised an octave to 400 Hz. Now only the solid-line
harmonics are present in the source. There is no harmonic at 1400 Hz, and
the second formant, which the vocal tract locates at that frequency, barely reso
nates (solid-line formant envelope). This makes one wonder why mommies,
rather than daddies, teach language to children. We will explain later how
mommies compensate for this apparent disadvantage.
A child’s vocal tract is shorter still, so its formants are higher, and this
compensates somewhat for the child’s higher f
0
. But as a result of its shorter
vocal tract, all the child’s vowel formants fall at frequencies different from
mommy’s formants (not to mention daddy’s and everybody else’s). This “lack
of invariance” has been an enormous frustration to attempts at speech rec
ognition by computer, and it makes one wonder that children can learn lan
guage at all. Later we will also see how the child’s brain normalizes this variance
to make language learning possible.
A device known as the “sound spectrograph” can make time-frequency do
main plots like figures 6.12 and 6.13 directly from speech and expose some of
this variation. In a standard spectrogram, the x-axis is the time axis. Frequency
is plotted on the y-axis. Amplitude is plotted “coming at you” on the z-axis, with
louder sounds darker.
Figure 6.12a is a narrowband spectrogram of the vowel /i/ (as in deed).
The harmonics of the glottal source appear as narrow bands running the length
Figure 6.11.
A high voice decreases formant definition.
SPEECH AND HEARING
•
99
Figure 6.12.
Narrowband spectrograms of (a) /i/ and (b) /u/.
of the spectrogram. The three characteristic formants of /i/ appear as dark
bands of harmonics which are resonantly amplified by the vocal tract filter.
(Sometimes higher fourth and fifth formants appear, as in figure 6.10, but they
carry little speech information.) Figure 6.12b is a spectrogram of the vowel
/u/ (as in crude). The phoneme /u/ differs from /i/ in that the second formant
is much lower, at 800 Hz.
Aperiodic sounds
Aperiodic sounds are sounds that do not have a fundamental frequency or
harmonic structure. For example, figure 6.13 shows a spectrogram of the ape
riodic sound /s/, as in sassy. All aperiodic sounds in speech can be said to be
consonants, but not all consonants are aperiodic. For example, nasal sounds
(e.g., English /m/, /n/, /
Ω/) and certain “semivowels” (e.g., English /l/, /r/,
Figure 6.13.
Spectrogram of /s/.
100 •
HOW THE BRAIN EVOLVED LANGUAGE
/w/) are actually periodic, and the argument can be made that such “conso
nants” should properly be called vowels. In any event, it follows that “conso
nant” is something of a catchall term, and the features of consonants cannot
be summarized as succinctly as we have summarized the articulatory and acous
tic structure of vowels. Unlike vowels, however, many consonants are quick
sounds, which make them especially interesting examples of the motor con
trol of speech.
Motor control of speech
Figure 6.15 identifies key articulatory points in the linguistic dimension of place.
For example, /b/ and /p/ are articulated at the lips; their place is said to be
labial. English /d/ and /t/ are said to be alveolar; they are articulated by plac
ing the tongue tip at the alveolus, the gum ridge, behind the upper incisors. By
contrast, /g/ is velar. It is produced by placing the back of the tongue against
the velum, the soft palate at the back of the roof of the mouth. Acoustically,
these consonants are identified by formant transitions, changes in the following
(or preceding) vowel’s formants, which disclose how the articulators have
moved and the vocal tract has changed to produce the consonant.
4
Figure 6.14
illustrates place acoustically with spectrograms of [b
ɑb], [dɑd], and [gɑg].
Initially, [b], [d], and [g] are all plosive consonants, which begin with a
vertical line representing a damped and brief plosive burst. This is most clearly
visible for [d(d] in figure 6.14. Bursts are damped and brief because they have
competing, not-very-harmonic “harmonics” at many frequencies. These bursts
are followed by the formant transitions, which may last only 50 ms. The transi
tions appear as often-faint slopes leading into or out of the steady-state vowel
formants. They are traced in white in figure 6.14. The transitions begin (for
initial consonants) or end (for postvocalic consonants) at frequencies which
roughly correspond to the consonant’s place of articulation (Halle et al. 1957).
For example, [b] is a labial consonant, meaning it is articulated at the lips.
Since the lips are closed when articulation of [b
ɑb] begins, high frequencies
are especially muted and all formant transitions appear to begin from the fun-
Figure 6.14.
Spectrograms of [b
ɑb], [dɑd], [gɑg].
SPEECH AND HEARING
• 101
damental frequency (figure 6.14). The transitions of [d] begin at frequencies
near those for the vowel [i] (figure 6.14), because [i] is a “high, front” vowel,
with the tongue tip near the alveolar ridge, and [d] is articulated with the tip
of the tongue at the alveolar ridge. Accordingly, [d] is given the place feature
alveolar. Similarly, [g] begins its formant transitions near the frequencies for
the vowel [u] (figure 6.14). The vowel [u] is produced with the body of the
tongue raised back toward the velum, and [g] begins its articulation with the
tongue raised back in contact with the velum. Therefore, [g] is classed as a
velar consonant.
5
The different ways sounds can be articulated at any given place is described
by the articulatory feature of manner. Thus, before vowels, [d] and [g] are called
plosive consonants. They display their common manner of articulation in fig
ure 6.14 by beginning with an explosive burst. But after vowels, their manner
of articulation is somewhat different and they are called stops. (In older litera
ture the distinction between stops and plosives was not appreciated, and both
were often simply called stops.)
The fricative consonant [s] results from an intermediate gesture. The
tongue is placed too close to the alveolus to produce [i] but not close enough
for a stop. Instead, the tongue produces friction against the outgoing airstream.
This friction causes a turbulent airflow and the aperiodic sound [s], which is
classed as +fricative. In spectrograms, [s] can exhibit formant transitions from
and to its surrounding vowels, but its aperiodic sound and fricative manner
produce a more distinctive and salient band of wideband “noise” (figure 6.13).
Fricative sounds articulated further forward in the mouth with a smaller fron
tal cavity, like [f] and [
θ] (as in thin), generate a higher frequency noise spec
trum. Sounds generated further back in the mouth with a larger frontal cavity,
like the alveopalatal [
Ê] in shoot, generate a lower frequency noise spectrum.
As the preceding examples should illustrate, very delicate movements of
tongue, lips, jaw, velum, and glottis must be coordinated to produce any single
speech sound. In fluent speech, these movements must also be quick enough to
produce as many as twenty distinct phones per second, and at such high rates
of speed, articulation can become imprecise. Fortunately, the vocal tract has
optimal locations, certain places of articulation, like the alveolus and velum
(figure 6.15), that are highly fault-tolerant. At these locations, articulation can
be maximally sloppy, yet minimally distort the resulting speech sound. It is not
surprising that sounds produced at these “quantal” locations, including the “car
dinal vowels” [i], [a], [u], occur in virtually all human languages (Stevens 1972).
This quantal advantage is leveraged by the upright posture of Homo sapiens
(Lieberman 1968). Animals that walk on all fours, apes, and even slouched Ne
anderthals all have short, relatively straight vocal tracts. When humans started
walking erect, the vocal tract lengthened and bent below the velum, creating
separate pharyngeal and oral cavities. This added an additional formant to
human calls, essentially doubling the number of speech sounds humans could
produce. The already highly innervated and developed nervous system for eat
ing was then recruited to help control this expanded inventory, and human
speech evolved.
102 •
HOW THE BRAIN EVOLVED LANGUAGE
Figure 6.15.
Optimal places of articulation.
Hearing
The ear
The human ear (figure 6.16) can be divided into three parts: the outer ear,
including the ear canal, or meatus; the middle ear; and the inner ear. The middle
ear is composed of three small bones: the “hammer” (malleus), the “anvil”
(incus), and the “stirrup” (stapes). The inner ear is composed of the vestibu
lar system, which senses up and down, and the cochlea, the primary organ for
hearing.
As we saw in figure 6.10, vowel formants are mostly between 300 and 3000
Hz. Most information is conveyed around 1800 Hz, a typical mean frequency
for F
2
. Having described the speech system as a kind of trumpet, we can now
see the outer ear to be a kind of ear trumpet (figure 6.16). This outer-ear as
sembly has evolved so that it has a tuning curve (figure 6.17) centered on a
resonant frequency of 1800 Hz. Natural selection has tuned the outer ear to
speech (or at least to the range of human “calls”: ape ears are very similar even
though we might not wish to say that apes “speak”). One reason the ear must
Figure 6.16.
The ear.
SPEECH AND HEARING
• 103
Figure 6.17.
The tuning curve of the outer ear.
be carefully tuned to speech is that sound waves are feeble. The sound pres
sure of whispered speech might be only 0.002 dynes/cm
2
. The ear must trans
form this weak physical signal into something palpable and sensible. What is
equally remarkable is that the ear can also respond to sound pressure levels
up to 140 dB, a 10,000,000-fold increase over the minimum audible sound.
Sound waves cause the eardrum to vibrate. In turn, the three small bones
of the middle ear act as levers which amplify and transmit these vibrations to
the oval window of the cochlea. More important, the oval window is much
smaller than the eardrum, further amplifying the pressure delivered to the
cochlea. With this middle-ear assembly, we are able to hear sounds 1000 times
fainter than we could otherwise. (It also contains small muscles which protect
us from sounds that are too loud.)
The oval window is a membrane “opening into” the cochlea. The cochlea
is an elongated cone, coiled up like a snail shell (and so called cochlea, from
the Greek kochlias, “snail”). In 1961, Georg von Békésy received the Nobel Prize
for describing how these vibrations from the eardrum are ultimately sensed by
the cochlea (Békésy 1960). The cochlea is fluid-filled, and the vibrations the
middle ear transmits to this fluid are transferred to the basilar membrane, which
runs inside the cochlea from the oval window to its tip. As the basilar mem
brane vibrates, it produces a shearing action against hair cells, which are the
end-organ cells of the auditory nervous system. Each hair cell emits a nerve
signal (a spike) when its hair is stimulated by the shearing action of the basilar
membrane.
Figure 6.18 schematically uncoils the cochlea and exposes the basilar
membrane. We can envision the hair cells arrayed along this membrane like
the keys of a mirror-image piano, high notes to the left. (Unlike a piano, the
broad end of whose soundboard resonates to low frequencies, the membrane’s
thin, flexible tip resonates to low frequencies, while the more rigid broad end
resonates to high frequencies.) When the spectrum of a vowel like [a] stimu
lates the basilar membrane, it sounds like a chord played on the hair cells.
Signals from the activated hair cells then propagate along the auditory nerve:
from the cochlea along the cochlear nerve to the cochlear nucleus, the infe
rior colliculus, the medial geniculate body of the thalamus, and the transverse
temporal gyrus of the cerebrum (koniocortex), as illustrated in figure 6.19.
104 •
HOW THE BRAIN EVOLVED LANGUAGE
Dostları ilə paylaş: |