During the past few years, text-independent (or "free- text") speaker recognition has become an increasingly popular area of research, with a broad spectrum of potential applications. The free-text speaker recognition task definition is highly variable, from an acoustically clean and prescribed task description to environments where not only is the speech linguistically unconstrained but also the acoustic environment is extremely adverse. Possible applications include forensic use, automatic sorting and classification of intelligence data, and passive security applications through monitoring of voice circuits. In general, applications for free-text speaker recognition have limited control of the conditions which influence system performance. Indeed, the definition of the task as "free-text" connotes a lack of complete control. (It may be assumed that a fixed text would be used if feasible, because better performance is possible if the text is known and calibrated beforehand.) This lack of control leads to corruption of the speech signal and consequently to degraded recognition performance. Corruption of the speech signal occurs in a number of ways, including distortions in the communication channel, additive acoustical noise, and''probably most importantly through increased variability in the speech signal itself. (The speech signal may be expected to vary greatly under operational conditions in which the speaker may be absorbed in a task or involved in an emotionally charged situation.) Thus the free-text recognition task typically confers upon the researcher multiple problems—namely, that the input speech is unconstrained, that the speaker is uncooperative, and that the environmental parameters are uncontrolled.
Research into speaker characteristics and free-text recognition algorithms seems more appealing than fixed-text speaker recognition in a sense, because emphasis is on the search for features and characteristics unique to the individual rather than on artifactual differences that со-vary with particular phonetic environments. Nonetheless, performance of free-text speaker recognition has never approached that achievable within a controlled fixed-text task definition. Perhaps as a result of this, interest in and research on free-text recognition has historically lagged behind fixed-text work. During the last five years, however, research in free-text speaker recognition has matured greatly and interest in the free-text task is now quite high, judging by the relative amount of work in the area [33]-[35], Focus has shifted from highly controlled databases and laboratory experiments to the processing of actual operational data. One consequence of this realism is that the level of recognition performance achieved lately has, unfortunately, deteriorated, with speaker recognition error rates not infrequently in excess of 20 percent [41].
One of the key issues in developing a text-independent speaker recognition system is to identify appropriate features and measures which will support good recognition performance. Use of the long-term average spectrum as a feature vector was discovered to have potential for free-text recognition during initial exploratory studies of fixed-text recognition using spectral pattern matching techniques [4]. In the Pruzansky study [4], speaker recognition error rate was found to remain undegraded (at 11 percent) even after averaging spectral amplitudes over all frames of speech data into a single reference spectral amplitude vector for each talker. To illustrate this feature vector, the long-term amplitude spectra for the different-speaker utterances shown in Fig. 2 are displayed in Fig. 4, and the long-term spectra for the same-speaker utterances of Fig. 3 are displayed in Fig. 5. Unfortunately, the long-term spectrum is not a good stable feature vector to use for speaker recognition. Long-term spectrum is obviously sensitive to changes in the spectral response of any interposed communications channel. More important, we have seen that the long-term spectrum is not particularly stable across variations in the speaker's speech effort level. A number of increasingly more sophisticated approaches have been developed to overcome some of the more fundamental limitations of a simple Euclidean distance measure on a simple spectral amplitude vector [26], [31], [34]. These approaches typically attempt to stabilize and statistically characterize features which represent the speech spectrum. These features include statistically orthogonal spectral vector combinations, cepstral coefficients, and a variety of LPC-based parameters. Surprisingly, the primary measure of choice remains the spectral amplitude vector, and very little effort has been devoted to the development of other measures such as pitch, formant frequencies, or statistical time functions. One reason for selecting the spectral amplitude vector is that it has typically produced performance superior to other features such as voice-pitch frequency [25].
Another key issue in free-text speaker recognition is the general strategy used to make a recognition decision. There have been two distinct approaches to this problem. First is the use of long-term averages. That is, certain features of the speech signal are computed for each incoming frame and are then averaged over a complete segment of speech. A recognition decision is made by computing the statistical