Speaker Recognition-Identifying People by their Voices george r. Doddington, member, ieee



Yüklə 210,37 Kb.
səhifə5/10
tarix20.09.2023
ölçüsü210,37 Kb.
#145400
1   2   3   4   5   6   7   8   9   10
doddington1985

Visual Identification of Speakers from Spectrograms

Probably the most significant paper on speaker recogni­tion, as judged by the amount of further research it has stimulated, was a paper by Kersta introducing the spectro­gram as a means of personal identification [3]. The term "voiceprint" was introduced in this paper, and 99-percent correct identification performance based upon visual com­parison of these voiceprints (spectrograms) was reported in a voiceprint identification task using 12 reference speakers.
The use of the term "voiceprint" has probably contrib­uted to the popularity of voiceprint identification by anal­ogy to the term "fingerprint." In fact Kersta himself, in his 1962 paper, disingenuously states: "Closely analogous to fingerprint identification, which uses the unique features found in people's fingerprints, voiceprint identification uses the unique features found in their utterances." Of course, as we have seen, the spectrogram is a function of the speech signal, not of the physical anatomy of the speaker, and it depends far more upon what the speaker does than upon what he "is." What the speaker does, in turn, is an indescribably complex function of many factors.
Unfortunately, the good performance reported in Kersta's paper has not been observed in subsequent evaluations simulating real-life conditions. In the largest evaluation of voiceprints ever conducted, under the direction of profes­sor Oscar Tosi at Michigan State University, [14], 0.5-percent identification error was achieved using voiceprints for nine clue words under the restrictive conditions of isolated word utterances, closed trials, and contemporary speech. That is, the unknown speech tokens were spoken in isolation, the unknown speaker was known to be represented in the set of reference speakers, the unknown speech tokens were collected during the same data collection session as the reference tokens, and identification decisions were based upon comparison of spectrograms for all nine clue words. Unfortunately, these conditions do not fit any type of realistic voice identification scenario, particularly any type of forensic model, which is the major application of voiceprint identification. In an attempt to assess the perfor­mance of voiceprint identification in a more realistic en­vironment, Tosi discovered that under conditions of noniso­lated words, open trials, and noncontemporary speech the identification error rate escalated to 18 percent. Curiously, Tosi concludes that the voiceprint voice identification technique "could yield a negligible error” providing that more knowledgable examiners are selected and that they make decisions only when they are "absolutely certain" [16]. (Almost two thirds of the false identifications were judged as "uncertain.")
Speech scientists have tended to be critical of the voiceprint speaker recognition technique for a number of reasons [11]: the technique is not a well-defined objective procedure (art rather than science), the identification per­formance is strongly influenced by specific conditions which, without an underlying model, cannot be adequately forecast, and various evaluations of identification accuracy have been equivocal. A scientific committee sponsored by the National Research Council subsequently concluded that the experimental conditions covered by available evalua­tions do not constitute an adequate basis for making judge­ments of the reliability of voice identification in forensic applications [29]. For example, the speaker's emotional state, which can have dramatic impact upon the speech signal [17] and which is often an active element in the forensic model, has not been included in experimental conditions for controlled evaluations of voiceprint identification. In fact, there is evidence that the use of voiceprint identifica­tion has been extended far beyond its domain of usefulness in voice identification in actual criminal trials [20]. Thus without an objective means of calibration, the use of the voiceprint technique is dangerously susceptible to misuse.
The recognition reliability of voiceprints, relative to the reliability of a listener's judgement, is also an important consideration in the use of voiceprints (and in weighing voiceprint evidence in the courtroom). In previous studies comparing the performance of voiceprint identification with aural speaker discrimination by human listeners, the error rates for aural discrimination have always been smaller [6], [8], [10]. In the 1968 study by Stevens, for example, a closed-set identification test using a homogeneous group of eight reference speakers yielded 6-percent error for listening and 21-percent error for voiceprints. (In addition, the error rate for the voiceprint examiner with the best performance was still higher than that for the listener with the poorest listening performance.)
Thus the reliability of the voiceprint technique for speaker identification is clearly a fragile issue, because identification performance is sensitive to many acoustic, environmental, and speaker conditions. Furthermore, the use of the voice­print technique is highly questionable, because better per­formance can likely be obtained through a listener's judge­ment. This brings up an important perspective on the devel­opment and evaluation of speaker recognition technology in general; namely, the comparative performance of a puta­tive technique with respect to some generally accepted benchmark. Such a performance comparison seems to be a valuable step toward calibration of the absolute perfor­mance of any speaker recognition technique, be it a subjec­tive one such as voiceprint examination or an objective one using computer-based speaker recognition algorithms.

  1. Computer Recognition of Speakers

Let us now turn our attention from speaker recognition by listeners to speaker recognition by computers. This tech­nology has been an active research area for over twenty years, but with limited application success to date. Two excellent reviews of computer recognition of speakers were published in a previous special issue of these Proceedings [22], [23], and serve as an appropriate starting point for this review. In particular, Atal presents a general and reasonably comprehensive view of the selection of speech parameters for speaker recognition. This paper will discuss computer recognition of speakers in the context of two general appli­cation areas. These areas of interest are text-independent speaker recognition, in which the speaker is noncooper­ative, and text-dependent speaker recognition, in which the speaker is cooperative and interacts directly with the com­puter.


  1. Yüklə 210,37 Kb.

    Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   10




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin