Table 2 Verification Phrase Construction for the Tl Operational Voice Verification System
GOOD
|
BEN
|
SWAM
|
NEAR
|
PROUD
|
BRUCE
|
CALLED
|
HARD
|
STRONG
|
JEAN
|
SERVED
|
HIGH
|
YOUNG
|
JOYCE
|
CAME
|
NORTH
|
the utterance in the same way as it is prompted. We are concerned about maintaining highly consistent and reliable user speech input and relatively unconcerned about possible help to the impostor through this voice prompt template.)
Each reference word is represented by a template of six frames in which the reference frames are spaced 20 ms apart. Each of the 14 normalized filter amplitudes is represented as a 3-bit number, but the reference data are stored as 8-bit numbers, with 5 fractional bits allocated to accommodate adaptive updating of the reference templates. Thus the basic reference speech data storage requirement is 1350 bytes per speaker. The processing requirements of this system are also reasonably modest. During verification, the processing is dominated by Euclidean distance computations. Each input frame must be compared with every active reference frame, and with an input frame period of 10 ms (the input frame period is half of the reference frame period) the basic multiply-accumulate rate is 34000 multi- ply-accumulates per second.
A single utterance is inadequate to provide the high level of verification performance desired. Therefore, a sequential decision using multiple verification phrases is employed which provides less than 1-percent rejection of users and less than 1-percent acceptance of impostors. Four phrases are constructed randomly at the outset of a verification attempt so that all sixteen reference words are used, and all four phrases may be used if required during a verification attempt. In fact, phrases that have been used may be reprompted and reused, which is sometimes necessary because of occasional mispronunciations. Thus up to seven user utterances may be solicited in the course of a verification attempt, although the number as measured in operational usage averages only 1.6.
The gross rejection rate of the operational system has been measured as 0.9 percent, with a casual impostor acceptance rate of 0.7 percent. The term "gross rejection rate" is used rather than user rejection rate, because the system is clearly justified in rejecting some of the entrants. For example, a special rule states that if the entrant makes no response for two consecutive prompts, then the verification is aborted. Twenty percent of all rejections (0.2 percent of all entry attempts) are attributable to this special rule. Another interesting correlation with rejection rate is the number of people in the entry booth. Since the floor of the booth is a weight scale, it is possible to accomodate multiple entrants. This is achieved by recording the weight of each user during his enrollment and then by counting the number of people in the booth during verification. It is curious to note that the rejection rate is much lower with only one person in the booth (0.5 percent) than it is with more than one person in the booth (1.8 percent). A similarly curious correlation exists between rejection rate and time of day, with the rejection rate between 9 am and 3 pm being four times lower than the rejection rate between 9 pm and 3 am. Indeed, recordings of booth activity for multiple entrants during the night shift are at times rather entertaining!
Another important factor in operational speaker verification is enrollment. Specifically, enrollment is a problem because the initial estimate of a person's speech characteristic taken at a single session is likely to be biased by the speaker's momentary speech qualities, so that a relatively poor match is sometimes encountered during the first few sessions after enrollment. Perhaps an even more troublesome problem is the dramatic change that often occurs in a user's voice during the first few entry attempts. During enrollment the typical new user is intimidated by the system, and this intimidation can easily affect the user's voice, for example in changes in the loudness, rate of speech, and pitch frequency. Although the reference data are updated by averaging with the input data after each successful verification, rapid changes such as during this initial learning period and also at the onset of respiratory ailments may result in serious rejection problems. An example of this is demonstrated by observing the user rejection rate as a function of user experience. Specifically, during the first four verification attempts the user rejection rate has been measured as just under 10 percent! Immediately after this four-session orientation period, however, the rejection rate drops to 1 percent and then remains uniformly less than 1 percent. Out to about 1000 verification trials the rejection rate hovers at slightly below 1 percent, then it gradually declines to less than one quarter of a percent for users with more than 10000 verifications.
Another difficulty in achieving satisfactory total system performance is the inhomogeneity in performance across a population of users. Specifically, it is generally observed that verification performance is very good for most users, but that for a small percentage of users the performance is not so good. This has been observed for the Tl operational system by measuring the distribution of rejection rate as a function of user. This distribution reveals that a typical user (i.e., having the median rejection rate) exhibits a rejection rate of only one half the average rejection rate. This suggests that most of the rejections are clustered in a small portion of the population (which we refer to as "goats," in contrast with the better performing "sheep"). Indeed, it is observed that only one quarter of the population exhibit rejection rates greater than the average rejection rate.
Speaker Recognition in the Future
Fifteen yeas ago, when I first became involved in speech technology, I was frankly not very optimistic about the prospects for commercial application of automatic speech recognition and speaker recognition technology. This was not so much because the technology was deficient, but more because economical solutions were beyond imagination. Now, of course, computers weigh grams rather than tons, and the cost has gone down almost proportionally. In fact, the cost problem has been virtually eliminated. We have finally reached the point, with the advent of highspeed monolithic digital signal processor circuits, where the cost of the speech processing, per se, is potentially a small fraction of the cost of the complete system. Yet we still see very little commercial activity in the area of speaker recognition. Why is this?
The most obvious application of speaker recognition technology is in security systems, for physical entry control, or for various business transactions conducted over the telephone. But the product development activities in these areas actually seems to have subsided during the past five years. There are a number of possible factors which have contributed to this decline. First, this lack of business interest in speaker recognition is surely partially attributable to remaining inadequacies in speaker recognition performance. As in speech recognition, machine performance for speaker recognition tends to be rather "fragile," that is, sensitive to distortions and variations in the speech signal that are innocuous or even imperceptible to human listeners. This lack of "robustness" results in errors, and worse, these errors are difficult for the user to "understand" and may therefore leave the user indignant with an exaggerated perception of the offense.
Other factors not directly related to recognition performance may play even more important roles in hindering the adoption of speaker recognition technology. Most of these factors are system complexity factors which are almost a natural consequence of the use of a high-performance man/machine interface. Such factors have also limited the growth of speech recognition products. For example, the interaction between the system and the user must be given special consideration in all phases of system development, which causes major impact on overall system design. (The speech subsystem is not just something that is neatly grafted onto an otherwise self-contained system.) This interdependence between the speech technology and the application host extends to many major design decisions, such as:
*Where should the reference speech data be stored? Centrally at the application host, or peripherally at the terminal, or perhaps with the user as a magnetic stripe card. (Speaker recognition typically requires a large amount of reference data per user, perhaps as much as 1 kbyte/user or more for a high-performance system.) The handling of this problem bears on the cost, the security, and usage protocol of the system.
*Where will the users be enrolled?
•How will information be communicated between the security terminal and the central host? (Audio lines may be preferable for short links, but terminal-based digital signal processing may be mandatory for long links where line degradation might be prohibitive.)
*How will the system handle user rejections? (This is a question that must be answered in operational systems, in such a way that the security of the system is maintained while the cost and feasibility are kept under control.)
Finally, perhaps the most important factor which is inhibiting the widespread adoption of speaker recognition technology for security applications is the lack of a truly compelling need. Although speaker recognition is an appealing technology for enhancing security in physical entry control and data access control applications, the benefits afforded by voice apparently have not yet outweighed the burden and expense of installing such a system.
It may be that voice verification will never catch on as a premium method of identity validation. However, there are indications, as our society edges forward into the abyss of total automation and computerization, that the need for
personal identity validation will become more acute. This need may be even more important when considered in the context of consumer business telecommunications. Specifically, the time, effort, and cost required for physical transportation to conduct personal business becomes more prohibitive in proportion to the richness of our lives and our desire to enjoy every minute of it. So there has arisen a vast array of services that are now available over the telephone —airline reservations, bank-by-phone, and telephone-order catalog services. These automatic telephone transaction systems could likely be facilitated if a more secure means of personal authentication were available. And of course, over the telephone, the use of the voice signal seems an ideal choice.
In the course of speaker recognition technology development it is important to bear in mind the nature of the problem. Specifically, it is not reasonable to expect that any level of performance is possible, limited only by improvements in feature extraction and algorithm development. Rather, it should be clear from the preceding discussion and illustrations in this paper that the performance of a speaker recognition system is dependent upon the amount of control that can be exerted on the operational conditions. And of greatest importance among all the variables is the speaker. For well-controlled conditions with a cooperative user operationally satisfactory performance might be obtained in a security system. On the other hand, for uncontrolled conditions and with uncooperative speakers, it may be impossible to assure any particular level of performance.
The fact that speaker recognition performance achievable on any particular task is a sensitive function of the data as well as the task definition strongly suggests the need for benchmark databases. The use of benchmark databases for speech recognition evaluations has grown in popularity during the last five years, in response to the need for meaningful comparative evaluation of systems and in view of the extreme performance sensitivity to particular databases. The need for such benchmarking databases for speaker recognition technology is acute for research into automatic recognition technology as well as for evaluation of proposed systems. This need is particularly acute for forensic applications of speaker identification. In this difficult application environment it is critically important to assess and compare the performance of the various techniques that are currently being used and that are being proposed. How do lay human listener judgements compare with voiceprint examination? Can speaker recognition performance be enhanced by combining listening and visual examination of spectrograms? What are reasonable bounds on speaker verification performance as a function of realistic task conditions? These questions remain largely unanswered.
Performance evaluation is also a difficult issue for security applications. These systems are invariably text-dependent systems, and the text definition is often considered to be part of the system definition. This makes comparative evaluation of these systems difficult. The database definition for free-text systems, on the other hand, is somewhat easier since the text is unconstrained and cannot be specified by the application. Of course, there exist endless varieties of degrading conditions that might be defined as part of an evaluation database, but it appears quite feasible and desirable to establish a benchmark database for the evaluation of free-text speaker recognition technology. In the definition of these evaluation databases emphasis really should be placed upon enriching the database by including as many difficult conditions into the database as possible. This provides two key benefits. First, technology development is facilitated by focusing attention and research in the important problem areas. Second, research efficiency is increased by attaining a given level of statistical confidence in the performance of the system (i.e., number of errors) with a significantly smaller database and correspondingly less computational and storage burden.
Finally, I would recommend that research be directed more seriously toward the problem of achieving high- performance speaker verification over standard telephone circuits. This is a difficult problem because of the variable carbon button microphone transducer and transmission channel, but a satisfactory solution to this problem would facilitate telephone business services and could help to revolutionize the way people do their personal business. As a first step, it might be advisable to calibrate the performance of human listeners carefully and determine if an adequate level of listener verification performance is feasible by appropriate choice of speech material. Then attempt to achieve this level of performance by machine. What is an “adequate level" of performance? That is a most difficult question and one that can probably only be determined by a complete system definition and perhaps then by trial and error.
References
F. McGehee, "The reliability of the identification of the human voice," J. General Psychol., vol. 17, pp. 249-271,1937.
R. W. Peters, "Studies in extra messages: The effects of various modifications of the voice signal upon the ability of listeners to identify speakers' voices," NM 001-104-500, Joint Report 61, Pensacola, FL, USNSAM, 1954.
L. G. Kersta, "Voiceprint identification," Nature, vol. 1%, pp. 1253-1257,1962.
S. Pruzansky, "Pattern-matching procedure for automatic talker recognition," J. Acoust. Soc. Amer., vol. 35, pp. 354-358, 1963.
W. D. Voiers, "Perceptual bases of speaker identity," J. Acoust. Soc. Amer., vol. 36, no. 6, pp. 1065-1073, 1964.
J. R. Carbonell et al. "Speaker authentication techniques," BBN report 1296, Cambridge, MA, Bolt Beranek and Newman Inc., 1965.
F. R. Clarke et al., "Characteristics that determine speaker recognition," Electronic Systems Division, USAF, Tech. Rep. ESD-TR-66-636,1966.
K. N. Stevens et al., "Speaker authentication and identification: A comparison of spectrographic and auditory presentations of speech material," J. Acoust. Soc. Amer., vol. 44, pp. 1596-1607, 1968.
R. H. Bolt et al., "Identification of a speaker by speech spectrograms," Science, vol. 166, pp. 338-343,1969.
F. R. Clarke and R. W. Becker, "Comparison of techniques for discriminating among talkers," ]. Speech Hearing Res., vol. 12, pp. 747-761, 1969.
R. H. Bolt et al., "Speaker identification by speech spectrograms: A scientist's view of its reliability for legal purposes," J. Acoust. Soc. Amer., vol. 47, no. 2, pp. 597-612,1970.
M. Hecker, "Speaker recognition: An interpretive survey of the literature," ASHA Monograph 16 (Amer. Speech and Hearing Assoc ), Washington, DC, 1971.
J. J. Wolf, "Efficient acoustic parameters for speaker recognition," J. Acoust. Soc. Amer., vol. 51, no. 6, pp. 2044-2056, 1972.
Anon., "vice identification research," Law Enforcement Assistance Administration, U.S. Department of Justice, Rep. PR [28] 72-1, no. 2700-0144,1972.
R. C. Lummis and A. E. Rosenburg, "Test of an automatic [29] speaker verification method with intensively trained professional mimics," ). Acoust. Soc. Amer., vol. 51, no. 1 pt. 1 (abstract), 1972. [30]
O. Tosi et al., "Latest developments in voice identification," J. Acoust. Soc. Amer., vol. 51, no. 1 (abstract), 1972.
С. E. Williams and K. N. Stevens, "Emotions and speech:
Some acoustical correlates," /. Acoust. Soc. Amer., vol. 52, pp. [31] 1238-1250, 1972.
A. E. Rosenburg, "Listener performance in speaker verification tasks," IEEE Trans. Audio Electroacoust., vol. AU-21, no. 3, pp. [32] 221-225,1973.
Speaker Recognition—Identifying People by their Voices 1
GEORGE R. DODDINGTON, member, ieee 1
B. S. Atal, "Automatic recognition of speakers from their voices," Proc. IEEE, vol. 6, no. 4, pp. 460-475,1976.
A. E. Rosenberg, "Automatic speaker verification: A review," [36] Proc. IEEE, vol. 64, no. 4, pp. 475-487,1976.
"Evaluation of an automatic speaker verification sys- [37] tern over telephone lines," Bell Syst. Tech. J., vol. 55, no. 6, pp. 723-744, 1976.
J. D. Markel, B. Oshika, and A. H. Cray, Jr., "Long-term feature [38] averaging for speaker recognition," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-25, no. 4, pp. 330-337, 1977. [39]
R. S. Cheung, and B. Eisenstein, "Feature selection via dynamic programming for text-independent speaker identifica- [40] tion," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-26, pp. 397-403,1978.
C. A. McConegal, L. R. Rabiner, and B. J. McDermott, "Speaker [41] verification by human listeners over several speech transmission systems," Bell Syst. Tech. J., vol. 57, no. 8, pp. 2887-2900,
1978.
L. Pfeifer, "New techniques for text-independent speaker identification," in Proc. ICASSP-78, pp. 283-286, 1978.
R. H. Bolt et al., "On the theory and practice of voice identification," National Academy of Sciences, Washington, DC, Rep. 0-309-02873-16,1979.
J. D. Markel and B. Davis, "Text-independent speaker recognition from a large linguistically unconstrained timespaced data base," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, no. 1, pp. 74-82,1979.
E. Wohlford, "A comparison of four techniques for auto- mtic speaker recognition," in Proc. ICASSP-80, pp. 908-911, 1980.
Furui, "Comparison of speaker recognition methods using statistical features and dynamic features," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, no. 3, pp. 342-350, 1981.
R. Schwartz, S. Roucos, and M. Berouti, "The application of probability density estimation of text-independent speaker identification," in Proc. ICASSP-82, pp. 1649-1652,1982.
M. Shridhar and N. Mohankrishnan, "Text-independent speaker recognition: A review and some new results," Speech Common., vol. 1, pp. 257-267,1982.
К. P. Li and E. H. Wrench, Jr., "An approach to text-independent speaker recognition with short utterances," in Proc. ICASSP-83, pp. 555-558,1983.
F. Nolan, The Phonetic Bases of Speaker Recognition. Cambridge, UK: Cambridge Univ. Press, 1983.
J. Wolf and M. Krasner, "Further investigation of probabilistic methods for text-independent speaker identification," in Proc. ICASSP-83, pp. 551-554, 1983.
M. Krasner et al., "Investigation of text-independent speaker identification techniques under conditions of variable data," in Proc. ICASSP-84, paper 18B.5,1984.
P. E. Papamichalis and G. R. Doddington, "A speaker recognizability test," in Proc. ICASSP-84, paper, 18B.6, 1984.
A. Schmidt-Nielen and R. Stern, "Identification of known voices as a function of familiarity and narrow-band coding," J. Acoust. Soc. Amer., vol. 77, no. 2, pp. 658-663,1985.
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
1664
H. Gish et al., "Investigation of text-independent speaker identification over telephone channels," in Proc. ICASSP-85, paper 11.1.1, 1985.
(a)
(b)
(c)
Dostları ilə paylaş: |