Probably the most significant paper on speaker recognition, as judged by the amount of further research it has stimulated, was a paper by Kersta introducing the spectrogram as a means of personal identification [3]. The term "voiceprint" was introduced in this paper, and 99-percent correct identification performance based upon visual comparison of these voiceprints (spectrograms) was reported in a voiceprint identification task using 12 reference speakers.
The use of the term "voiceprint" has probably contributed to the popularity of voiceprint identification by analogy to the term "fingerprint." In fact Kersta himself, in his 1962 paper, disingenuously states: "Closely analogous to fingerprint identification, which uses the unique features found in people's fingerprints, voiceprint identification uses the unique features found in their utterances." Of course, as we have seen, the spectrogram is a function of the speech signal, not of the physical anatomy of the speaker, and it depends far more upon what the speaker does than upon what he "is." What the speaker does, in turn, is an indescribably complex function of many factors.
Unfortunately, the good performance reported in Kersta's paper has not been observed in subsequent evaluations simulating real-life conditions. In the largest evaluation of voiceprints ever conducted, under the direction of professor Oscar Tosi at Michigan State University, [14], 0.5-percent identification error was achieved using voiceprints for nine clue words under the restrictive conditions of isolated word utterances, closed trials, and contemporary speech. That is, the unknown speech tokens were spoken in isolation, the unknown speaker was known to be represented in the set of reference speakers, the unknown speech tokens were collected during the same data collection session as the reference tokens, and identification decisions were based upon comparison of spectrograms for all nine clue words. Unfortunately, these conditions do not fit any type of realistic voice identification scenario, particularly any type of forensic model, which is the major application of voiceprint identification. In an attempt to assess the performance of voiceprint identification in a more realistic environment, Tosi discovered that under conditions of nonisolated words, open trials, and noncontemporary speech the identification error rate escalated to 18 percent. Curiously, Tosi concludes that the voiceprint voice identification technique "could yield a negligible error” providing that more knowledgable examiners are selected and that they make decisions only when they are "absolutely certain" [16]. (Almost two thirds of the false identifications were judged as "uncertain.")
Speech scientists have tended to be critical of the voiceprint speaker recognition technique for a number of reasons [11]: the technique is not a well-defined objective procedure (art rather than science), the identification performance is strongly influenced by specific conditions which, without an underlying model, cannot be adequately forecast, and various evaluations of identification accuracy have been equivocal. A scientific committee sponsored by the National Research Council subsequently concluded that the experimental conditions covered by available evaluations do not constitute an adequate basis for making judgements of the reliability of voice identification in forensic applications [29]. For example, the speaker's emotional state, which can have dramatic impact upon the speech signal [17] and which is often an active element in the forensic model, has not been included in experimental conditions for controlled evaluations of voiceprint identification. In fact, there is evidence that the use of voiceprint identification has been extended far beyond its domain of usefulness in voice identification in actual criminal trials [20]. Thus without an objective means of calibration, the use of the voiceprint technique is dangerously susceptible to misuse.
The recognition reliability of voiceprints, relative to the reliability of a listener's judgement, is also an important consideration in the use of voiceprints (and in weighing voiceprint evidence in the courtroom). In previous studies comparing the performance of voiceprint identification with aural speaker discrimination by human listeners, the error rates for aural discrimination have always been smaller [6], [8], [10]. In the 1968 study by Stevens, for example, a closed-set identification test using a homogeneous group of eight reference speakers yielded 6-percent error for listening and 21-percent error for voiceprints. (In addition, the error rate for the voiceprint examiner with the best performance was still higher than that for the listener with the poorest listening performance.)
Thus the reliability of the voiceprint technique for speaker identification is clearly a fragile issue, because identification performance is sensitive to many acoustic, environmental, and speaker conditions. Furthermore, the use of the voiceprint technique is highly questionable, because better performance can likely be obtained through a listener's judgement. This brings up an important perspective on the development and evaluation of speaker recognition technology in general; namely, the comparative performance of a putative technique with respect to some generally accepted benchmark. Such a performance comparison seems to be a valuable step toward calibration of the absolute performance of any speaker recognition technique, be it a subjective one such as voiceprint examination or an objective one using computer-based speaker recognition algorithms.
Computer Recognition of Speakers
Let us now turn our attention from speaker recognition by listeners to speaker recognition by computers. This technology has been an active research area for over twenty years, but with limited application success to date. Two excellent reviews of computer recognition of speakers were published in a previous special issue of these Proceedings [22], [23], and serve as an appropriate starting point for this review. In particular, Atal presents a general and reasonably comprehensive view of the selection of speech parameters for speaker recognition. This paper will discuss computer recognition of speakers in the context of two general application areas. These areas of interest are text-independent speaker recognition, in which the speaker is noncooperative, and text-dependent speaker recognition, in which the speaker is cooperative and interacts directly with the computer.