Speaker recognition by listeners has been studied broadly, typically with the motivation of learning something about how listeners recognize speakers. That is, studies have been made to investigate the variables that affect listener performance and to understand the perceptual bases of speaker recognition. In certain cases, however, it becomes important to know how WELL listeners can recognize speakers. In forensic applications particularly, it is important to be able to assess correctly the probative value of a listener's judgement regarding the identity of a speaker. It has frequently been stated that the judgement of listeners is not very reliable [9]. There is support, however, for the notion that listener performance compares favorably with that of other methods [8].
That performance is strongly influenced by acoustic and speaker conditions is intuitively obvious when considering the listening task. But definition of the listener's task probably has an even stronger impact on recognition performance for listeners than for machines. This is because the listening domain allows a wider selection of recognition environment and strategy. For example, listeners can recognize speakers without an explicit comparison of two speech samples, based upon acquaintance with the reference voices. This is, of course, the natural form of speaker recognition with which we are all familiar. Such decisions are likely to use a good deal of high-level idiosyncratic (including nonphysiological) information about the speaker, deriving from the extensive reference knowledge of the speaker. Thus listener performance in this task may be even better than in cases where the decision is based upon contemporaneous comparison with reference voices. (I know of no good comparative study, however.)
A recent interesting study of speaker recognition by listeners reports good performance without explicit reference speech tokens [40]. In this study, speech samples from 24 people were collected during the course of playing a battleship game requiring communication between participants. Coworkers (in a work unit of about 40 people) were then tested on these samples to measure the listeners' speaker recognition performance. Measured performance was quite good, with an identification error rate of only 12 percent. This study is particularly interesting in that speaker recognizability was also measured using speech tokens processed through a narrow-band (2.4-kbit/s) LPC vocoder. In this case, the identification error rate rose to 31 percent.
A different task, related to the listening task stated above, is the task of listening to a set of reference samples and matching one of these samples to an unknown sample of speech heard at (and remembered from) some past time. This task is similar in form to the recognition of familiar speakers, except that in this case the unknown and reference samples are presented in reverse order (unknown first) and the speakers are not familiar to the listener. The reference samples are also typically limited to a short duration. This task is often a reasonable model for forensic applications in which the voice of an identity in question is not recorded, thus preventing detailed voice comparisons. It is intuitive that the performance of the listener will degrade with the time span between hearing the unknown sample and the references, and this has been borne out in one well-known study. In this study [1], listeners first listened to an unknown speaker read a paragraph, then at a later session they listened to five reference speakers (including the unknown) read the same paragraph. The speaker identification accuracy of the listeners in this experiment varied strongly as a function of the time interval between sessions, with accuracy of better than 80 percent correct for intervals less than one week dropping to chance performance for an interval of half a year.
We have seen that our understanding of the perceptual bases of speaker recognition does not support a useful model of our listening ability [5]. Perhaps a more rewarding study is the calibration of listener performance as a function of acoustic variables. Such variables include signal-to-noise ratio, speech bandwidth, amount of speech material used, and various speech transmission and coding systems. Such studies may also have a significance beyond the mere calibration of human performance. Specifically, if we assume that human listeners do very well (i.e., close to optimum) in speaker recognition tasks, then listener performance evaluations can give us good bounds on what is achievable. This information can be used, for example, to assess the potential for studies of machine recognition or to test the ultimate feasibility of practical applications.
Human listeners have proven themselves to be robust speaker recognizers when presented with degraded speech, with unflagging performance under significant spectral distortions, and noise conditions. For example, although it has been determined that the octave band of frequencies from 1 to 2 kHz is most useful to listeners [2], listeners have demonstrated only a modest doubling of speaker recognition error when the speech signal is severely high-passed at 2 kHz or low-passed at 1 kHz [7]. In this same study, the degrading effect of additive white noise was measured. Recognition performance remained high until the noise power exceeded the speech signal power. Although the performance of the listeners in this task domain was modest (20-percent error), the robustness of their performance under severe signal degradation should serve to inspire scientists striving to model the speaker recognition process.
More complex distortions of the speech signal have also been studied in a variety of contexts. In one study admitting of high performance (fixed text, contemporaneous comparison of unknown with reference), the effect on speaker verification performance of several speech coding systems (namely a 24-kbit/s ADPCM coder and an LPC pitch-excited vocoder) was measured [27]. In this study, several sentences were read by subjects and then processed through the ADPCM and LPC systems. Listeners, presented with two different tokens and asked to declare whether the speakers were the same or different, performed with relatively little degradation when the tokens were processed by different coding systems. As expected, the best performance was attained for the "no processing" control task (15 percent rejection of valid matches and 7 percent acceptance of invalid matches). But listener performance degraded surprisingly little when one of the two tokens was processed through either the ADPCM or LPC coding system. In this case, the likelihood of rejecting a correct match doubled while the likelihood of accepting an incorrect match increased very little.
Another very similar experiment has been conducted using, however, different texts for comparison [39] and evaluating the performance under degradations of telephone transmission and 2.4-kbit/s LPC coding. A speaker recognition task was evaluated, using five reference speakers and with the speakers reading phonetically balanced sentences. In all the tests, the speech of the reference speakers was presented in unprocessed form. Unprocessed tokens were misrecognized 14 percent of the time when they were taken from the same session as the reference speech and over 20 percent of the time when they were taken from a different session. Further, a significant increase in error rate was observed when the unknown tokens were processed through either the LPC system or the telephone (including transduction using a carbon button microphone), with the identification error rate for these two distortions being almost 50 percent.
Thus we see that, depending on the task definition and recognition conditions, listeners exhibit a wide range of speaker recognition performance. While it is clear that there are many sources of knowledge that are useful in the speaker recognition task, it is not clear which knowledge sources are most important for the achievement of high- recognition performance. It is likely that the various sources of knowledge contribute in varying ways to speaker recognition—providing weak, moderate, or high discrimination power and being more or less robust against various signal degradations and in various task definitions. More knowledge about these relationships would surely be helpful both in the development of automatic speaker recognition systems and in the assessment of speaker recognition by listeners in untested conditions.