POPULATION SIZE Fig. 1. This figure illustrates the relative performance for speaker verification and speaker identification for a hypothetical task in which the feature vector for speaker i is distributed as a multivariate Gaussian distribution with mean vector u(/) and an identity covariance matrix and further in which the speakers are chosen randomly so that the mean vector for speaker i is also a multivariate Gaussian distribution with zero mean and a covariance matrix equal to a constant S times the identity matrix [19]. For the identification task, the decision is to choose that speaker for which the likelihood of the input vector is maximum, given the observed feature vector. For the verification task there are three possible decisions. First, the optimum decision is to choose that decision hypothesis for which the likelihood of the observed feature vector is greatest, given knowledge of the impostor distributions. Second, since it is unlikely that there will be knowledge of the impostors in any real application, the distribution of the impostor population is estimated for indefinitely large impostor populations. Finally, the easiest and often a completely satisfactory approach is to assume that the probability distribution for impostors is extremely diffuse and therefore to assume a constant likelihood function for impostors. The results of a computer simulation of this example are plotted for three values of speech feature vector dimensionality; namely, m = 1, 2, and 4. The variance of impostor mean features 5 was chosen to provide a suboptimum verification error rate of 2.5 percent. (S = 2557 for m = 1, S = 72.4 for m = 2, and S = 11.0 for m = 4.) The various recognition hypotheses were assigned equal a priori likelihood, and the error rates shown were obtained by averaging over a very large number of randomly chosen populations. Note that optimum verification always yields better error rates than optimum identification, except when the population size is 2, in which case verification and identification are the same task. The most important result illustrated by this example is that verification performance, even for suboptimum decision rules, remains satisfactory as the population size becomes large, whereas the performance for speaker identification continues to degrade with the identification error rate approaching 100 percent for sufficiently large populations.
for optimum decision rules applied to a multivariate Gaussian model. In fact, population size is a critical performance parameter for speaker identification, with the probability of error approaching 1 for indefinitely large populations. The performance for speaker verification is unaffected by population size, however. Although the performance for speaker verification is stable with increasing speaker population size, there is one difficulty with verification that is not present with identification. This is, namely, that a much more comprehensive grasp of the variability of the speech features used for discrimination is required in the verification task. Thus while determining which reference is "closest" to the input token may serve the identification task reasonably well without statistical calibration, the verification task demands proper statistical characterization of verification features in order to judge "close enough."
Speaker identification as defined above is also sometimes called "closed-set" identification, which contrasts it from "open-set" identification. In open-set identification the possibility exists that the unknown voice token does not belong to any of the reference speakers. The number of possible decisions is then N 4- 1, which includes the option to declare that the unknown token belongs to none of the reference speakers. Thus open-set identification is a combination of the identification and verification tasks which combines the worst of both—performance is degraded by the complexity of the identification task, and the rejection option requires good characterization of speech feature statistics.
The degree of control over the generation of the speech token is another important speaker recognition parameter. Fixed-text or text-dependent speech tokens are used in applications in which the unknown speaker wishes to be recognized and is therefore cooperative. Free-text or textindependent speech tokens are required in those applications where such control cannot be maintained, either because the speaker is not cooperative or perhaps because the recognition must be done unobtrusively. Generally speaking, recognition performance in fixed-text applications will be better than in free-text applications, because better calibration of the input speech token is possible with identical reference speech material and because the ability to control the text of the input speech will often extend also to the control of the speaker and his environment. Serving to offset this somewhat, free-text applications often provide longer samples of speech for recognition, which tends to improve recognition performance.
Armed with these task definitions we can now tabulate the recognition tasks required by the various applications. This is done in Table 1. Of these applications, security applications will exhibit the best speaker recognition performance because of the cooperative user and controlled conditions. A most difficult problem in the forensic and reconnaissance applications is to establish a valid statistical model upon which verification decisions may be based. An effective statistical model is difficult to establish in this environment because of the lack of control over the speech signal and the speaker and also because of the difficulty in predicting acoustical and transmission conditions. Even with an adequate statistical model for making verification decisions, the lack of control in text-independent applications invariably results in much poorer recognition performance than for those applications in which control is exercised over all aspects of the speaker verification task.