A. Formant Transitions and Burst Spectrum The role of the formant transitions and the burst spectrum, and
their relative importance, has been considerably researched and
debated in the literature. However, despite the wealth of infor-
mation that exists in the literature, a considerable amount of am-
biguity and contradiction also exists. A comprehensive survey
of previous research [5] illustrates that the burst spectrum and
the formant transitions are important for the place of articulation
detection. Their role, however, in the voicing detection seems
to be secondary. These two features are actually closely related,
functionally equivalent and complementary [20]. Their percep- tual weight seems to depend on the degree of their salience, such
that the formant transition role becomes more significant when
the transitions involved are sharp and clear. On the other hand,
its role (perceptual weight) becomes negligible when the tran-
sition is slight and ambiguous.
It is also clear after this long history of research that absolute
acoustic invariance is not possible for the stop place detection.
Relational invariance, where the feature depends on the neigh-
boring vowel in a well-defined manner, on the other hand, seems
to be a more plausible and useful approach.
B. Burst Amplitude Previous research found that labial stops are usually weaker
than alveolars and velars [21], [23], [45]. Perceptual experi-
ments on syllable-initial alveolar and labial stop consonants
also showed that the relative amplitude of the burst can
influence the identification of alveolar and labial place of
articulation [30]. This influence is more profound for voiceless
than for voiced stops and is evident only for stops which have
ambiguous spectra. This indicated that the amplitude does play
a role in the place detection. This role however seems to be
secondary, since the burst spectrum seems to override its effect.
C. Durations and Voicing The stop consonant consists of a closure interval, a release
(transient, frication and aspiration) interval and a transition in-
terval (from voicing onset to the vowel’s nucleus). The durations
of these different segments, and of the stops as a whole, were in-
vestigated by many researchers [6], [14], [21], [27], [42], [43],
[45]. It was found that the mean voicing onset time (VOT) for
voiceless stops is longer than their voiced cognates. It was also
clear that the VOT could play a major role in the voicing detec-
tion but not in the place of articulation detection, in which its
role is secondary at best. Moreover, in spite of its importance
in voicing detection, it is not expected to be able to perform the
task alone. Other features are needed to resolve the significant
overlap that exists between voiced and unvoiced VOT distribu-
tions especially for stops in different contexts.
Another feature that was investigated by many researchers
is the presence of voicing during the stop closure (prevoicing).
This feature is found to be a sufficient, but not necessary, con-
dition for voicing.
III. A
COUSTIC
–P
HONETIC
C
LASSIFICATION
In this section, the experiments performed on the stop con-
sonants for the place of articulation and voicing detection are
discussed. We made use of the wealth of information that exists
in the literature, our own acoustic and spectrogram reading
knowledge and different statistical tools to build the resulting
“knowledge-based” system. Statistical discriminant analysis,
histogram analysis, information transmission analysis and
decision trees are some of the statistical tools that helped design
the system (i.e., to determine thresholds, evaluate features,
combine features, etc.).
This system is designed using continuous speech from ten
speakers (five males and five females) from the TIMIT data-
base and then tested on 1200 stops extracted from continuous
speech of 60 different speakers (not used in the design phase)
from seven different dialects of American English in the TIMIT
database. We will concentrate here on the place and voicing de-
tection, the stop manner of articulation is detected by another
system developed to segment and categorize the phonemes in
an utterance [5]. Since the experiments discussed in this work
report the classification results, errors obtained in the detection
and segmentation were excluded from the results. The output
of the segmentation system marks the closure and release seg-
ments of the stop. It also marks the point of voicing onset as
evidenced by the presence of low frequency energy in the F0
and F1 regions. This is explained schematically in Fig. 1.
The front-end signal processing system that is used in
our experiments is an auditory-based Bark-scaled filter-bank
system. It is a modification to the system developed by Seneff
and described in detail in [36], [37]. The block diagram is given
in Fig. 2. The filter bank used is a bank of 36 critical-band
filters (Bark scale) with the distribution given by Zwicker [46].
It is preceded by a 20 dB/decade high-frequency pre-emphasis.
This, and other, auditory-based distributions have proved to
yield better results in ASR applications [7], [15], [24], [25],
[35], [38]. The system gives two outputs, the mean rate and
the synchrony output. The synchrony describes the temporal
pattern and is extracted using the average localized synchrony
detector (ALSD) [3], [5]. This is a modification to Seneff’s
generalized synchrony detector (GSD) [36], [37], developed
by the authors to alleviate some of its limitations. Mainly
it employs a novel spacial averaging technique to enhance
its formant extraction ability while suppressing the spurious
peaks. The synchrony is used for its superior formant extraction
ability, higher response to periodic signals and higher immunity
to noise, while the mean rate is used for its higher sensitivity
and better ability in describing the overall spectral shape. This
is in agreement with auditory neurobiology, where the average
response and the temporal pattern of the neural firings play
complementary rules that are similar to the rules employed in
this work [17], [18].
ALI et al.: ACOUSTIC–PHONETIC FEATURES FOR THE AUTOMATIC CLASSIFICATION OF STOP CONSONANTS
835
Fig. 1.
Block diagram of the stop recognition system used.