Oxford university press how the br ain evolved language



Yüklə 2,9 Mb.
Pdf görüntüsü
səhifə8/18
tarix06.06.2020
ölçüsü2,9 Mb.
#31814
1   ...   4   5   6   7   8   9   10   11   ...   18
how the brain evolved language


Figure 6.2. 
The first harmonic of a vibrating string and its waveform. 
Now things become complex. It happens that a string can vibrate in many 
segments at once. In figure 6.4, our string is shown vibrating as a whole to the 
right. At the same time, each third is also vibrating, the top and bottom thirds 
to the right, and the middle third to the left. Each third vibrates at three times 
the frequency of the whole. A string’s vibration can be (and usually is) more 
complex still. A classic example occurs if a complex wave is composed of suc-
Figure 6.3. 
A string vibrating in halves generates the second harmonic. 

SPEECH  AND  HEARING 

93
 
Figure 6.4.  H  (dotted line) and H  (dashed line) in complex vibration. 
1
2
cessive odd harmonics (e.g., the first, third, and fifth). In this case, the com­
posite complex wave approaches a square wave in shape. 
When two or more harmonics are present, they can reinforce each other or 
interfere with each other. When they reinforce each other, their peaks align and 
become higher; the harmonics are said to resonate and the sound becomes louder. 
For example, when the dotted-line and dashed-line harmonics in figure 6.4 com­
bine, the first peak of the resulting solid-line waveform is more intense than ei­
ther of the two other components alone. In highly resonant systems, a waveform 
can actually feed back into itself, resonating with itself and becoming louder and 
louder. This is what happens, for example, when a public address system micro­
phone is placed too close to its speakers. The output feeds back into the micro­
phone, reinforcing and resonating with itself until a fuse blows. When sound 
waves interfere with one another, their energy dissipates and the vibration of 
the string is described as damped: once plucked, it does not continue to vibrate. 
This is the situation with the vocal cords. Their vibratory patterns are very 
complex and not very resonant. The resulting sound wave—the glottal pulse— 
is a highly damped waveform (figure 6.5). A single glottal pulse sounds much 
more like a quick, dull thud than a ringing bell. 
Each time a puff of air passes from the lungs through the larynx, one glottal 
pulse like figure 6.5 occurs. The vocal folds then spring closed again until subglottal 
air pressure builds to emit another puff. If, while you were talking, we could some­
how unscrew your head just above your larynx, your larynx would sound like a 
“Bronx cheer” (a bilabial trill). This is approximately the sound of a trumpet 
mouthpiece without the trumpet. What turns such a buzzing sound into melodi­
ous, voiced speech? 
A trumpet makes its buzzing mouthpiece harmonious by attaching it to a 
tube. If we put a pulse train like figure 6.5 into a tube, a chain reaction of high-
and low-pressure areas is set up in the tube. Like strings, tubes also have har­
monic frequencies at which they resonate. Like strings, the shorter the tube, 

94  • 
HOW  THE  BRAIN  EVOLVED  LANGUAGE 
Figure 6.5. 
(a) The glottal pulse has a highly damped waveform. (b) Continuous 
periodic speech is based on a train of glottal pulses. 
the higher the fundamental frequency. If we now screwed your head back onto 
your larynx, your normal, beautiful, melodious voice would be restored. Your 
mouth and throat function like the tubes of a trumpet to transform a lowly 
Bronx cheer into vowels. To gain a deeper appreciation of how this happens, 
we need to visualize sound waves in the frequency domain. 
In figures 6.2–6.4, we observed that a string could simultaneously vibrate at 
successively higher harmonic frequencies, each of successively lower amplitude. 
Figure 6.6 captures these facts in a display known as a power spectrum. Each vertical 
line in figure 6.6a represents a harmonic in figure 6.4, and the height of each 
line represents the amplitude of the harmonic. The amplitudes decrease as the 
harmonics increase in frequency. Notice that figure 6.6a represents essentially the 
same physical facts as figure 6.4: a complex wave with f
0
 = 1 Hz. and H

= 3 Hz. 
Figure 6.6. 
Power spectrum of (a) figure 6.4 and (b) the glottal pulse. 

SPEECH  AND  HEARING 

95
 
Figure 6.4, however, plots time on the x-axis, and is therefore sometimes called a 
time domain plot. Figure 6.6 plots frequency on the x-axis, and so is called a frequency 
domain plot. Figure 6.6b presents a fuller picture of the frequency domain of the 
glottal pulse, whose many harmonics fall off in amplitude as frequency increases. 
Figure 6.7 explains the operation of a trumpet starting with a frequency 
domain plot of the buzzing input at the mouthpiece (the vertical lines). Let the 
fundamental frequency of figure 6.7 be a train of pulses at 200 Hz. Then higher 
harmonics are generated at integer multiples of this f
0
, that is, at 400 Hz, 600 
Hz, 800 Hz, and so on. The tube of the trumpet, however, imposes a further filter 
that “envelopes” this input. With some specific combination of valves, the tube 
of the trumpet takes on a length which resonates optimally at some note at the 
center frequency of the bell curve in figure 6.7. The surrounding frequencies 
resonate less and are attenuated. Most of them are completely filtered out and 
never escape the trumpet. On the other hand, the tube can actually amplify the 
resonant frequencies at the center of the curve. When the trumpeter changes 
the valves on a trumpet, tubes of different lengths are coupled together. These 
successive tubes of differing length are resonant frequency filters which select 
the successive musical notes the trumpeter is playing. 
The human voice works rather like a trumpet trio. The glottal pulse train 
feeds into three connected tubes of variable length. These resonate at three 
different, variable frequencies and produce the various three-note, chordlike 
sounds, which are called vowels. These three “tubes” can be seen in figure 6.8, 
which shows the articulatory position of the vocal tract for the vowel /i/, as in 
bead
Letters enclosed in brackets and slashes, like [i] and /e/, are letters of 
the International Phonetic Alphabet, or IPA, which is useful for distinguish­
ing sounds like the vowels of bad and bade. IPA departs from normal English 
spelling in several respects, most notably in that /i/ represents the sound of 
French si or Chinese bi, but English deed, and /e/ represents the sound of 
French les or German Schnee, but English bade. Letters enclosed in brackets are 
phones, and letters enclosed in slashes are phonemes. Phones are sounds-in-the-
air, and phonemes are sounds-in-the-mind. 
The phoneme /i/ is often described as a high, front vowel, meaning the 
tongue is high and to the front. This position of the tongue divides the vocal 
Figure 6.7. 
Idealized trumpet filter centered at 1.2 kHz. 

96  • 
HOW  THE  BRAIN  EVOLVED  LANGUAGE 
Figure 6.8. 
Articulatory position of /i/. 
tract into three subtubes (figure 6.8): the labiodental cavity, between the teeth 
and the lips (III); the oral cavity, in the middle (II); and the pharyngeal cav­
ity at the rear (I). The dental cavity resonates at a frequency that defines the 
highest note, or formant (F
3
), of the vocalic “chord.”
3
 The oral cavity is larger. 
When articulating /i/, the oral cavity is bounded at the rear by the tongue 
and at the front by the teeth. Together with the labiodental cavity, the oral 
cavity forms a longer, larger subtube which resonates at a lower frequency, 
the second formant, F
2
, of the vowel. The F
1
 formant is the lowest in frequency. 
It is created by the pharyngeal, oral, and labiodental cavities resonating to­
gether as one large, long tube. 
For comparison, figure 6.9 shows the articulatory position for the vowel 
/u/, as in rude. Here, the tongue is up and back, making the oral cavity bigger. 
The lips are also rounded, making the labiodental cavity bigger. Correspond-
Figure 6.9. 
Articulatory position for /u/. 

SPEECH  AND  HEARING 

97
 
ingly, F
2
 and F
3
 are both lower than in /i/. F
1
 is low as in /i/ because the over­
all length and volume of the vocal tract remain large. 
Figure 6.10 summarizes the main acoustic effects of the articulatory con­
figurations in figures 6.8 and 6.9, superimposing filter functions (or vowel spec-
tra) on a male glottal spectrum. In figure 6.10a, the spectrum for the vowel 
/i/ displays typical formant peaks at 300, 2200, and 2900 Hz. For this speaker, 
these are the three “notes” of the vocalic “chord” for /i/, and they correspond 
to the resonant frequencies of the three subtubes of the vocal tract in figure 
6.8. Corresponding typical formant values for /u/ of 300, 750, and 2300 Hz
are plotted in figure 6.10b, values corresponding to the three subtubes of the 
vocal tract in figure 6.9. 
The preceding three-tube model is adequate for our purposes, but it is also 
a considerable oversimplification. The acoustics of coupled tubes is actually much 
more complex. Retroflex consonants like American /r/ divide the oral cavity 
into two resonant chambers, and lateral semivowels like /l/ define the oral cav­
ity with two apertures, one on each side of the tongue. Nasal speech sounds reso-
Figure 6.10. 
Glottal spectra and vocal tract filters for (a) /i/ and (b) /u/ (fast 
Fourier transform analyses). 

98  • 
HOW  THE  BRAIN  EVOLVED  LANGUAGE 
nate in complex nasal and sinus cavities and introduce antiresonances which 
cannot be accounted for in our simple model. Because everyone’s vocal tract is 
different, there is also endless variation among individual voices. 
One interesting source of individual variation is illustrated in figure 6.11. 
Males have lower voices than women and children, so a male with a fundamental 
frequency of 100 Hz will have glottal source harmonics at 200, 300, 400 Hz, 
and so on (all vertical lines). However, a female with a fundamental frequency 
of 200 Hz will have only half as many harmonics, at 400, 600, 800 Hz, and so 
on (dotted and solid vertical lines, dotted formant envelope). Since the vocal 
tract can only amplify harmonics that are physically present in the source spec­
trum, the formants of female speech are less well defined than the formants 
of male speech. This becomes especially apparent if the female fundamental 
frequency in figure 6.11 is raised an octave to 400 Hz. Now only the solid-line 
harmonics are present in the source. There is no harmonic at 1400 Hz, and 
the second formant, which the vocal tract locates at that frequency, barely reso­
nates (solid-line formant envelope). This makes one wonder why mommies, 
rather than daddies, teach language to children. We will explain later how 
mommies compensate for this apparent disadvantage. 
A child’s vocal tract is shorter still, so its formants are higher, and this 
compensates somewhat for the child’s higher f
0
. But as a result of its shorter 
vocal tract, all the child’s vowel formants fall at frequencies different from 
mommy’s formants (not to mention daddy’s and everybody else’s). This “lack 
of invariance” has been an enormous frustration to attempts at speech rec­
ognition by computer, and it makes one wonder that children can learn lan­
guage at all. Later we will also see how the child’s brain normalizes this variance 
to make language learning possible. 
A device known as the “sound spectrograph” can make time-frequency do­
main plots like figures 6.12 and 6.13 directly from speech and expose some of 
this variation. In a standard spectrogram, the x-axis is the time axis. Frequency 
is plotted on the y-axis. Amplitude is plotted “coming at you” on the z-axis, with 
louder sounds darker. 
Figure 6.12a is a narrowband spectrogram of the vowel /i/ (as in deed). 
The harmonics of the glottal source appear as narrow bands running the length 
Figure 6.11. 
A high voice decreases formant definition. 

SPEECH  AND  HEARING 

99
 
Figure 6.12. 
Narrowband spectrograms of (a) /i/ and (b) /u/. 
of the spectrogram. The three characteristic formants of /i/ appear as dark 
bands of harmonics which are resonantly amplified by the vocal tract filter. 
(Sometimes higher fourth and fifth formants appear, as in figure 6.10, but they 
carry little speech information.) Figure 6.12b is a spectrogram of the vowel 
/u/ (as in crude). The phoneme /u/ differs from /i/ in that the second formant 
is much lower, at 800 Hz. 
Aperiodic sounds 
Aperiodic sounds are sounds that do not have a fundamental frequency or 
harmonic structure. For example, figure 6.13 shows a spectrogram of the ape­
riodic sound /s/, as in sassy. All aperiodic sounds in speech can be said to be 
consonants, but not all consonants are aperiodic. For example, nasal sounds 
(e.g., English /m/, /n/, /
Ω/) and certain “semivowels” (e.g., English /l/, /r/, 
Figure 6.13. 
Spectrogram of /s/. 

100  • 
HOW  THE  BRAIN  EVOLVED  LANGUAGE 
/w/) are actually periodic, and the argument can be made that such “conso­
nants” should properly be called vowels. In any event, it follows that “conso­
nant” is something of a catchall term, and the features of consonants cannot 
be summarized as succinctly as we have summarized the articulatory and acous­
tic structure of vowels. Unlike vowels, however, many consonants are quick 
sounds, which make them especially interesting examples of the motor con­
trol of speech. 
Motor control of speech 
Figure 6.15 identifies key articulatory points in the linguistic dimension of place
For example, /b/ and /p/ are articulated at the lips; their place is said to be 
labial. English /d/ and /t/ are said to be alveolar; they are articulated by plac­
ing the tongue tip at the alveolus, the gum ridge, behind the upper incisors. By 
contrast, /g/ is velar. It is produced by placing the back of the tongue against 
the velum, the soft palate at the back of the roof of the mouth. Acoustically, 
these consonants are identified by formant transitions, changes in the following 
(or preceding) vowel’s formants, which disclose how the articulators have 
moved and the vocal tract has changed to produce the consonant.
4
 Figure 6.14 
illustrates place acoustically with spectrograms of [b
ɑb], [dɑd], and [gɑg]. 
Initially, [b], [d], and [g] are all plosive consonants, which begin with a 
vertical line representing a damped and brief plosive burst. This is most clearly 
visible for [d(d] in figure 6.14. Bursts are damped and brief because they have 
competing, not-very-harmonic “harmonics” at many frequencies. These bursts 
are followed by the formant transitions, which may last only 50 ms. The transi­
tions appear as often-faint slopes leading into or out of the steady-state vowel 
formants. They are traced in white in figure 6.14. The transitions begin (for 
initial consonants) or end (for postvocalic consonants) at frequencies which 
roughly correspond to the consonant’s place of articulation (Halle et al. 1957). 
For example, [b] is a labial consonant, meaning it is articulated at the lips. 
Since the lips are closed when articulation of [b
ɑb] begins, high frequencies 
are especially muted and all formant transitions appear to begin from the fun-
Figure 6.14. 
Spectrograms of [b
ɑb], [dɑd], [gɑg]. 

SPEECH  AND  HEARING 
•  101 
damental frequency (figure 6.14). The transitions of [d] begin at frequencies 
near those for the vowel [i] (figure 6.14), because [i] is a “high, front” vowel, 
with the tongue tip near the alveolar ridge, and [d] is articulated with the tip 
of the tongue at the alveolar ridge. Accordingly, [d] is given the place feature 
alveolar. Similarly, [g] begins its formant transitions near the frequencies for 
the vowel [u] (figure 6.14). The vowel [u] is produced with the body of the 
tongue raised back toward the velum, and [g] begins its articulation with the 
tongue raised back in contact with the velum. Therefore, [g] is classed as a 
velar consonant.

The different ways sounds can be articulated at any given place is described 
by the articulatory feature of manner. Thus, before vowels, [d] and [g] are called 
plosive consonants. They display their common manner of articulation in fig­
ure 6.14 by beginning with an explosive burst. But after vowels, their manner 
of articulation is somewhat different and they are called stops. (In older litera­
ture the distinction between stops and plosives was not appreciated, and both 
were often simply called stops.) 
The fricative consonant [s] results from an intermediate gesture. The 
tongue is placed too close to the alveolus to produce [i] but not close enough 
for a stop. Instead, the tongue produces friction against the outgoing airstream. 
This friction causes a turbulent airflow and the aperiodic sound [s], which is 
classed as +fricative. In spectrograms, [s] can exhibit formant transitions from 
and to its surrounding vowels, but its aperiodic sound and fricative manner 
produce a more distinctive and salient band of wideband “noise” (figure 6.13). 
Fricative sounds articulated further forward in the mouth with a smaller fron­
tal cavity, like [f] and [
θ] (as in thin), generate a higher frequency noise spec­
trum. Sounds generated further back in the mouth with a larger frontal cavity
like the alveopalatal [
Ê] in shoot, generate a lower frequency noise spectrum. 
As the preceding examples should illustrate, very delicate movements of 
tongue, lips, jaw, velum, and glottis must be coordinated to produce any single 
speech sound. In fluent speech, these movements must also be quick enough to 
produce as many as twenty distinct phones per second, and at such high rates 
of speed, articulation can become imprecise. Fortunately, the vocal tract has 
optimal locations, certain places of articulation, like the alveolus and velum 
(figure 6.15), that are highly fault-tolerant. At these locations, articulation can 
be maximally sloppy, yet minimally distort the resulting speech sound. It is not 
surprising that sounds produced at these “quantal” locations, including the “car­
dinal vowels” [i], [a], [u], occur in virtually all human languages (Stevens 1972). 
This quantal advantage is leveraged by the upright posture of Homo sapiens 
(Lieberman 1968). Animals that walk on all fours, apes, and even slouched Ne­
anderthals all have short, relatively straight vocal tracts. When humans started 
walking erect, the vocal tract lengthened and bent below the velum, creating 
separate pharyngeal and oral cavities. This added an additional formant to 
human calls, essentially doubling the number of speech sounds humans could 
produce. The already highly innervated and developed nervous system for eat­
ing was then recruited to help control this expanded inventory, and human 
speech evolved. 

102  • 
HOW  THE  BRAIN  EVOLVED  LANGUAGE 
Figure 6.15. 
Optimal places of articulation. 
Hearing 
The ear 
The human ear (figure 6.16) can be divided into three parts: the outer ear, 
including the ear canal, or meatus; the middle ear; and the inner ear. The middle 
ear is composed of three small bones: the “hammer” (malleus), the “anvil” 
(incus), and the “stirrup” (stapes). The inner ear is composed of the vestibu­
lar system, which senses up and down, and the cochlea, the primary organ for 
hearing. 
As we saw in figure 6.10, vowel formants are mostly between 300 and 3000 
Hz. Most information is conveyed around 1800 Hz, a typical mean frequency 
for F
2
. Having described the speech system as a kind of trumpet, we can now 
see the outer ear to be a kind of ear trumpet (figure 6.16). This outer-ear as­
sembly has evolved so that it has a tuning curve (figure 6.17) centered on a 
resonant frequency of 1800 Hz. Natural selection has tuned the outer ear to 
speech (or at least to the range of human “calls”: ape ears are very similar even 
though we might not wish to say that apes “speak”). One reason the ear must 
Figure 6.16. 
The ear. 

SPEECH  AND  HEARING 
•  103 
Figure 6.17. 
The tuning curve of the outer ear. 
be carefully tuned to speech is that sound waves are feeble. The sound pres­
sure of whispered speech might be only 0.002 dynes/cm
2
. The ear must trans­
form this weak physical signal into something palpable and sensible. What is 
equally remarkable is that the ear can also respond to sound pressure levels 
up to 140 dB, a 10,000,000-fold increase over the minimum audible sound. 
Sound waves cause the eardrum to vibrate. In turn, the three small bones 
of the middle ear act as levers which amplify and transmit these vibrations to 
the oval window of the cochlea. More important, the oval window is much 
smaller than the eardrum, further amplifying the pressure delivered to the 
cochlea. With this middle-ear assembly, we are able to hear sounds 1000 times 
fainter than we could otherwise. (It also contains small muscles which protect 
us from sounds that are too loud.) 
The oval window is a membrane “opening into” the cochlea. The cochlea 
is an elongated cone, coiled up like a snail shell (and so called cochlea, from 
the Greek kochlias, “snail”). In 1961, Georg von Békésy received the Nobel Prize 
for describing how these vibrations from the eardrum are ultimately sensed by 
the cochlea (Békésy 1960). The cochlea is fluid-filled, and the vibrations the 
middle ear transmits to this fluid are transferred to the basilar membrane, which 
runs inside the cochlea from the oval window to its tip. As the basilar mem­
brane vibrates, it produces a shearing action against hair cells, which are the 
end-organ cells of the auditory nervous system. Each hair cell emits a nerve 
signal (a spike) when its hair is stimulated by the shearing action of the basilar 
membrane. 
Figure 6.18 schematically uncoils the cochlea and exposes the basilar 
membrane. We can envision the hair cells arrayed along this membrane like 
the keys of a mirror-image piano, high notes to the left. (Unlike a piano, the 
broad end of whose soundboard resonates to low frequencies, the membrane’s 
thin, flexible tip resonates to low frequencies, while the more rigid broad end 
resonates to high frequencies.) When the spectrum of a vowel like [a] stimu­
lates the basilar membrane, it sounds like a chord played on the hair cells. 
Signals from the activated hair cells then propagate along the auditory nerve
from the cochlea along the cochlear nerve to the cochlear nucleus, the infe­
rior colliculus, the medial geniculate body of the thalamus, and the transverse 
temporal gyrus of the cerebrum (koniocortex), as illustrated in figure 6.19. 

104  • 
HOW  THE  BRAIN  EVOLVED  LANGUAGE 
Yüklə 2,9 Mb.

Dostları ilə paylaş:
1   ...   4   5   6   7   8   9   10   11   ...   18




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin