This figure exhibits five different spectrograms, one each from five different men. The spectrogram is a display of the amplitude of a speech signal as a function of frequency and time. The spectral amplitude is computed as a short-time Fourier transform of the speech signal and is plotted on the z-axis, with greater spectral amplitude being depicted as a darker marking The running window used in the spectral analysis is 6 ms long and the signal is weighted by a Hamming window. The abscissa is the time axis, with 1 s being displayed in this figure. The ordinate is the frequency axis, which spans the range from 0 Hz to 4 kHz in this figure Although the spectrogram represents amplitude only imprecisely, a great deal of phonetic information may be decoded from the spectrogram by an expert phonetician. Indeed, many studies have shown that the energy loci in frequency and time are perceptually more important than exact calibration of the spectral amplitudes. (These energy loci are usually referred to as "formant” frequencies by phoneticians.) The utterance spoken in each of the five speech spectrograms displayed in this figure is "Berlin Forest." Note that although there are general similarities in the spectrograms as dictated by the linguistic message, the speaker differences are striking. So striking, in fact, that one might be tempted to question the sameness of the phonetic transcription. Two of the speaker represented in this figure are identical twins, and their voices do sound quite similar These are speakers (b) and (c). Even for these twins, the spectrograms are unquestionably different.
This figure exhibits five different speech spectrograms, all from the same speaker, one who is not represented in Fig 2. The utterance spoken in these spectrograms is the same as in Fig 2. Note that the variation in spectrograms produced by the same person can be extremely great, even greater than the differences between speakers seen in Fig. 2. Examples (a) and (b) are for speech under nominal conditions, but taken from two different recording sessions. There are some significant amplitude differences, particularly above 2 kHz, but the formant frequencies remain nearly identical in frequency and time. Example (c) is for speech collected through a carbon button microphone. Notice that the nonlinearities of this microphone create significant spectral distortions and that the weaker formant frequencies above 21 kHz are largely obscured. Example (d) is for very softly spoken speech (about 20 dB below a normal comfortable level). Notice several changes: first, the spectrum falls off more rapidly with frequency, with most of the signal energy appearing below 1 kHz. Second, the spectrum appears "noisy," which is largely attributable to irregular voiced excitation. These changes also make the formant frequencies much less distinct Example (e) is for very loudly spoken speech (about 20 dB above a normal comfortable level) The spectrogram of this speech signal bears little resemblance to that in (a) (at least relative to the expected between-speaker differences exhibited in Fig. 2), with a much higher pitch frequency and with relatively more energy at the higher frequencies. Indeed, the two speech signals do not sound much like the same person, either.