Table 1 A Tabulation of the Speaker Recognition Tasks Required by Various Applications
SECURITY: Physical entry
X
Database access
X
Telephone transactions
X
X
RECONNAISSANCE
X
FORENSIC APPLICATIONS
X
X
Note that almost all applications are classified as verification tasks in Table 1. Even the reconnaissance task can be viewed as a set of verification decisions. It is difficult for me to visualize a real operational application of speaker identification. Yet the identification task formulation remains popular in laboratory evaluations.
Inherent Factors which Limit the Recognizability of Speakers
What is it about the speech signal that conveys information about the speaker's identity? There are, of course, many different sources of speaker identifying information, including high-level information such as dialect, subject matter or context, and style of speech (including lexical and syntactical patterns of usage). This high-level information is certainly valuable as an aid to recognition of speakers by human listeners, but it has not been used in automatic recognition systems because of practical difficulties in acquiring and using such information. Rather, automatic techniques focus on "low-level" acoustical features. These low-level features include such characteristics of the speech signal as spectral amplitudes, voice pitch frequency, formant frequencies and bandwidths, and characteristic voicing aperiodicities. (See [22] for a good review of such features.) These variables may be measured as a function of time or the statistics of long-term averages may be used as recognition variables. But the real question, the essence of the problem, is this: How stable are these speaker discriminating features? Given a speech signal, is the identity of the speaker uniquely decodable?
The fact is that the speech signal is a complex function of the speaker and his environment. It is an acoustic signal generated by the speaker and which does not convey detailed anatomical information, at least not in any explicit manner. This distinguishes voice recognition from fingerprint identification, since fingerprint recognition (along with other physical attributes such as hand geometry and retinal patterns) uses fixed, static, physical characteristics, while speaker recognition (along with signature recognition) uses dynamic "performance" features that depend upon an act.
Thus there exist inherent limitations in performance which are attributable to the nature of the speech signal and its relationship to the signal generator (the speaker). To appreciate these limits we must understand the source of speaker-discriminating information and how it is encoded in the speech signal. The speech signal, being a consequence of articulation, is determined by the vocal apparatus and its neural control. Thus there are two possible sources of speaker information; namely, the physical and structural characteristics of the vocal tract and the controlling information from the brain and articulatory masculature. This information is imparted to the speech signal during articulation along with all the other information sources. These other sources include not only the linguistic message but also the speech effort level (loud, soft), emotional state (e.g., anger, fear, urgency), health, age, and so on.
The characteristics of the speech signal are determined primarily by the linguistic message, via control of the vocal tract masculature and the resulting articulation of the vocal cords, jaw, tongue, lips, and velum (which controls coupling to the nasal cavity). This articulation, in turn, produces the speech signal as a complex function of the articulatory parameters. The secondary speech messages, including speaker discriminants, are encoded as nonlinguistic articulatory variations of the basic linguistic message. Thus the information useful for identifying the speaker is carried indirectly in the speech signal, a side effect of the articulatory process, and the speaker information may be viewed as "noise" applied to the basic linguistic message. Thus the problem with speaker recognition is that there are no known speech features or feature transformations which are dedicated solely to carrying speaker-discriminating information, and further that the speaker-discriminating information is a second-order effect in the speech features.
The fact is, however, that different individuals typically exhibit speech signal characteristics that are quite strikingly individualistic. We know that people sound different from each other, but the differences become visually apparent when comparing spectrograms from different individuals. The spectrogram is by far the most popular and generally informative tool available for phonetic analysis of speech signals. The spectrogram is a running display of the spectral amplitude of a short-time spectrum as a function of frequency and time. The amplitude is only rather crudely plotted as the level of darkness, but the resonant frequencies of the vocal tract are usually clearly represented in the spectrogram. Fig. 2 demonstrates the degree of difference between spectrograms of five different men saying "Berlin Forest." Note the differences between the individual renditions of this linguistic message. Segment durations, formant frequencies, and formant frequency transitions, pitch and pitch dynamics, formant amplitudes, all exhibit gross differences from speaker to speaker. Thus these speakers would be very easy to discriminate by visual inspection of their spectrograms. This is especially impressive in view of the fact that two of the speakers illustrated in Fig. 2 are identical twins who sound quite similar to each other. Yet their spectrograms look very different.
But there are problems with this appealing notion of spectrographic differences. The primary difficulty lies not with the similarity between different speakers. Speakers usually sound very different from each other, and, in fact, the spectrograms in Fig. 2 show large differences between speakers. The real problem is that a single speaker also
(a)
(b)
(c)
(d)
(e)
(a)
(b)
(c)
(d)
(e)
This figure exhibits five different spectrograms, one each from five different men. The spectrogram is a display of the amplitude of a speech signal as a function of frequency and time. The spectral amplitude is computed as a short-time Fourier transform of the speech signal and is plotted on the z-axis, with greater spectral amplitude being depicted as a darker marking. The running window used in the spectral analysis is 6 ms long and the signal is weighted by a Hamming window. The abscissa is the time axis, with 1 s being displayed in this figure. The ordinate is the frequency axis, which spans the range from 0 Hz to 4 kHz in this figure. Although the spectrogram represents amplitude only imprecisely, a great deal of phonetic information may be decoded from the spectrogram by an expert phonetician. Indeed, many studies have shown that the energy loci in frequency and time are perceptually more important than exact calibration of the spectral amplitudes. (These energy loci are usually referred to as "formant" frequencies by phoneticians.) The utterance spoken in each of the five speech spectrograms displayed in this figure is "Berlin Forest." Note that although there are general similarities in the spectrograms as dictated by the linguistic message, the speaker differences are striking. So striking, in fact, that one might be tempted to question the sameness of the phonetic transcription. Two of the speaker represented in this figure are identical twins, and their voices do sound quite similar. These are speakers (b) and (c). Even for these twins, the spectrograms are unquestionably different.
This figure exhibits five different speech spectrograms, all from the same speaker, one who is not represented in Fig. 2. The utterance spoken in these spectrograms is the same as in Fig. 2. Note that the variation in spectrograms produced by the same person can be extremely great, even greater than the differences between speakers seen in Fig. 2. Examples (a) and (b) are for speech under nominal conditions, but taken from two different recording sessions. There are some significant amplitude differences, particularly above 2 kHz, but the formant frequencies remain nearly identical in frequency and time. Example (c) is for speech collected through a carbon button microphone. Notice that the nonlinearities of this microphone create significant spectral distortions and that the weaker formant frequencies above 21 kHz are largely obscured. Example (d) is for very softly spoken speech (about 20 dB below a normal comfortable level). Notice several changes: First, the spectrum falls off more rapidly with frequency, with most of the signal energy appearing below 1 kHz. Second, the spectrum appears "noisy," which is largely attributable to irregular voiced excitation. These changes also make the formant frequencies much less distinct. Example (e) is for very loudly spoken speech (about 20 dB above a normal comfortable level). The spectrogram of this speech signal bears little resemblance to that in (a) (at least relative to the expected between-speaker differences exhibited in Fig. 2), with a much higher pitch frequency and with relatively more energy at the higher frequencies. Indeed, the two speech signals do not sound much like the same person, either.
often sounds (and looks, spectrographically) very different from time to time. The problem is not so much telling people apart. Rather, the problem is that people sometimes are just not themselves(l) We call this phenomenon "intraspeaker variability." This is illustrated in Fig. 3, which displays the spectrograms of five different tokens (of the same words, "Berlin Forest”) for one single speaker. (This speaker is distinct from those of Fig. 2.) Note that the differences between spectrogram (a) and spectrogram (b) are quite small, but that for other conditions the differences between spectrograms are quite marked. Even under the same experimental conditions, a significant difference between spectrograms may be observed for speech data collected at different times. Differences attributable to significant changes in either the transmission path (carbon button microphone substituted for the linear dynamic microphone) or speaker variation (sotto voce or shouted speech) may become so large as to render any speaker recognition decision completely unreliable.
The Acoustical Bases for Speaker Recognition
Before addressing the various speaker recognition technologies and systems, it is appropriate to review the acoustical bases for speaker recognition. An excellent exposition on this subject is presented in a book by Francis Nolan [36]; see also [12]. Note that from introspection of our own personal experience we can appreciate a wide variety of nonacoustic clues that may be used unconsciously to aid the identification of a speaker. Although we usually think of speaker information as comprising acoustical attributes and related perceptual features such as "breathiness" or "nasality," there are other perhaps more useful characteristics that we as listeners use all the time. Such sources include speaker dialect, style of speech (for example, energetic, sexy, sarcastic, witty), and particular unique verbal mannerisms (for example, use of particular words and idioms, or a particular kind of laugh). Our interest here, however, is principally in the low-level acoustical characteristics of the speech signal which support speaker recognition, rather than the higher level features mentioned above. Lack of interest in these higher level sources of information stems from a variety of issues, including the need for extensive training (personal knowledge of the speaker), difficulty in quantifying these features, and the resultant difficulty in controlling and evaluating performance as a function of this information.
One seemingly reasonable way to approach the task of determining speaker-characterizing acoustical features is to determine the perceptual bases for speaker recognition and then to determine the acoustical correlates of these perceptual features [5], [10]. Unfortunately, this method has not been productive for a variety of reasons. First, it is difficult for listeners to analyze their discriminatory powers and describe quantitatively the important speaker discriminating features. Voiers [5] approached this problem by creating a list of 49 candidate differential factors, asking listeners to classify speakers according to these factors, then performing a factor analysis to determine the important speakerdiscriminating features. Second, knowledge of these perceptual bases provides precious little insight toward the determination of productive acoustical features. In the Voiers study, the most significant speaker-discriminating features were "clarity," "roughness," "magnitude," and "animation," none of which can be related to acoustical parameters of the speech signal in a quantitative way. Finally, the direct use of these speaker-discriminating perceptual factors for speaker recognition is not very effective when compared with judgements made by actually listening to the speech data [10]. Thus the value of these specific perceptual features for speaker recognition is questionable.
Another approach to determining the potential for discrimination between speakers is to examine and statistically characterize the inventory of acoustical measurements of the speech signal as a function of speaker. One notable study of this sort attempted to assess discrimination potential as a function of phonetic class after first manually locating speech events within utterances [13]. Useful measures for discriminating among speakers were found to include voice-pitch frequency, the amplitude spectra of vowels and nasals, slope of the glottal source spectrum, word duration, and voice onset time. Of these, the voice pitch frequency exhibited the greatest speaker discrimination, as measured by the F-ratio of between- to within-speaker variance. Unfortunately, voice pitch is rather susceptible to change over time and it covaries strongly with factors such as speech effort level and emotional state.
Subjective Recognition of Speakers
Speaker recognition performance now becomes the major issue of this paper, including the level of performance that may be achieved under various experimental paradigms and conditions. First the subjective performance of humans is reviewed, followed by computer techniques. The subjective performance of humans usually relates to how reliably one may identify a person by listening to his voice. In addition, however, a great deal of interest has arisen during the past two decades regarding the use of visual identification of voice spectrograms for use in the courtroom.