The usefulness of identifying a person from the characteristics of his voice is increasing with the growing importance of automatic information processing and telecommunications. This paper reviews the voice characteristics and identification techniques used in recognizing people by their voices. A discussion of inherent performance limitations, along with a review of the performance achieved by listening, visual examination of spectrograms, and automatic computer techniques, attempts to provide a perspective with which to evaluate the potential of speaker recognition and productive directions for research into and application of speaker recognition technology.
Introduction
The human ear is a marvelous organ. Beyond our uniquely human ability to receive and decode spoken language, the ear supplies us with the ability to perform many diverse functions. These include, for example, localization of objects,
enjoyment of music, and the identification of people by their voices. Currently, along with efforts to develop computer procedures that understand spoken messages, there is also considerable interest in developing procedures that identify people from their voices. The purpose of this paper is to review this speaker recognition problem and the technology being developed and applied to solve it. First, however, it might be appropriate to discuss the motivation for such study. Why develop a speaker recognition machine?
Speaker recognition is an example of biometric personal identification. This term is used to differentiate techniques that base identification on certain intrinsic characteristics of the person (such as voice, fingerprints,
retinal patterns, or genetic structure) from those that use artifacts for identification (such as keys, badges, magnetic cards, or memorized passwords). This distinction confers upon biometric techniques the implication of greater identification reliability, perhaps even infallibility, because the intrinsic biometrics are presumed to be
more reliable than artifacts, perhaps even unique. Thus a prime motivation for studying speaker recognition is to achieve more reliable personal identification. This is particularly true for security applications, such
Manuscript received March 20,1985; revised July 18, 1985.
The author is with Speech Research, Computer Science Laboratories, Texas Instruments Inc., Dallas, TX 75265, USA.
as physical access control (a voice-actuated door lock for your home or ignition switch for your automobile), computer data access control, or automatic telephone transaction control (airline reservations or bank-by-phone). Convenience is another benefit which accrues to a biometric system, since biometric attributes cannot be lost or forgotten and thus need not be remembered.
Applications also exist which depend uniquely upon the identification of a person by his voice. Such applications include forensic science and the automated processing of reconnaissance information. For example, 32 channels of enemy air-to-ground telecommunications are being monitored to detect activities of the Red Baron. Is he in the air now? Or an axe murderer telephones the location of his victim's body to the police. Does the suspect's voice match the murderer's? The identification problems posed by these applications can only be solved by speaker recognition technology. So far, all of the applications that have been mentioned fall into a category which may be called voice verification. But there are several different speaker recognition task definitions, with different performance characteristics for each. These will now be described.
Types of Speaker Recognition Tasks and Applications
Speaker recognition is a generic term which refers to any task which discriminates between people based upon their voice characteristics. Within this general task description there are two specific tasks that have been studied extensively. These are referred to as speaker identification and speaker verification. (Sometimes the term “voice'' or "talker" is substituted for "speaker," and sometimes the term "authentication" is substituted for "verification." Thus for example, speaker verification and voice authentication refer to the same task.) The distinction between identification and verification is simple: The speaker identification task is to classify an unlabeled voice token as belonging to (having been spoken by)
one of a set of N reference speakers
(N possible outcomes), whereas the speaker verification task is to decide whether or not an unlabeled voice token belongs to a specific reference speaker (2 possible outcomes—the token is either accepted as belonging to the reference speaker or is rejected as belonging to an impostor). Note that the information in bits, denoted /, to be gained from the identification task is in general greater than that to be gained from the verification task:
/
ident = log
2(N) (assuming equal
a priori probability of occurrence for all reference speakers)
/
ver = 1 (assuming
a priori probability of occurence of reference speaker = 0.5).
It is natural then to expect that,
all other factors being equal, recognition performance (i.e., probability of error) will be better for the verification task than for the identification task. An example of this contrast is shown in Fig. 1