University of Pennsylvania ScholarlyCommons

Yüklə 276,45 Kb.

Pdf görüntüsü

səhifə	5/7
tarix	12.05.2023
ölçüsü	276,45 Kb.
	#111996

1 2 3 4 5 6 7

Acoustic phonetic features for the autom

invariance is employed using the second formant location of
the following vowel/semivowel (VF2) at the vowel onset point.
The points of taking the measurements are determined using the
segmentation and categorization program [5]. If there is no fol-
lowing vowel or if the second formant is not clear enough to
be extracted, a value of zero is assigned to VF2. Using these
two features (BF and VF2), a preliminary place detection is per-
formed using the regions shown in Fig. 4. These regions were
designed by the help of unsupervised clustering and Bayesian
decision algorithms which showed clear clusters (especially for

836
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
Fig. 2.
Block diagram of an auditory-based front-end system.
Fig. 3.
Algorithm for voicing detection of stop consonants.
TABLE I
C
ONFUSION
M
ATRIX FOR
V
OICING
D
ETECTION ON
1200 S
TOPS
.
A
CCURACY IS
96%
alveolars and velars) of different places of articulation in the
shown regions.
The results of using the BF alone in the place detection are
shown in Table II, while Table III shows the result of using both
BF and VF2 as shown in Fig. 4. The results are for 270 stops
spoken by six different speakers from the TIMIT database. A
significant improvement of 10% in the accuracy is clear. This
indicates the importance of the vowel context and verifies the
concept of relational invariance in the place recognition.
Accounting for the context dependence using Fig. 4 in
the classification process also helps normalize for speaker
variability. Variations due to speaker gender or dialect are
expected to affect the neighboring vowel besides affecting the
stop consonant itself. Therefore, the relation between BF and
VF2 is less speaker-dependent than BF alone, and hence yields
better multispeaker classification results.
It is obvious from Tables II and III and Fig. 4 that the labials
are the most missed class when using the burst frequency and the
vowel formant. This is in agreement with previous researchers
who noted the absence of a prominent peak in labials [23], [45].
They are characterized by a “flat” and weak release spectrum,
which is due to the absence of any resonant cavities in their ar-
ticulation. To improve the detection of labials, the properties of
flatness and weakness of their release spectra need to be ex-
tracted. The authors developed a new feature called the max-
imum normalized spectral slope (MNSS). This feature was used
in the fricative detection and it proved to be very useful in de-
tecting the dentals [1], [4]. It is defined as
(2)
where
is the th filter mean-rate (envelope) output at the
th instant, while
is a difference function which approx-

ALI et al.: ACOUSTIC–PHONETIC FEATURES FOR THE AUTOMATIC CLASSIFICATION OF STOP CONSONANTS
837
Fig. 4.
Two-dimensional space preliminary classification regions for (a) unvoiced stops and (b) voiced stops. Zero VF2 corresponds to the absence of a following
vowel’s second formant. Alvelars (+), velars (*), and labials (o). It is clear that alveolars and velars show better clustering labials.
TABLE II
C
ONFUSION
M
ATRIX FOR
P
RELIMINARY
S
TOP
P
LACE
D
ETECTION
U
SING THE
B
URST
F
REQUENCY
(BF) A
LONE
. T
OTAL
N
UMBER OF
S
TOPS IS
270 F
ROM
S
IX
D
IFFERENT
S
PEAKERS
. A
CCURACY IS
72%
TABLE III
C
ONFUSION
M
ATRIX FOR
P
RELIMINARY
S
TOP
P
LACE
D
ETECTION
U
SING THE
B
URST
F
REQUENCY
(BF)
AND
V
OWEL
S
ECOND
F
ORMANT
(VF2). T
OTAL
N
UMBER OF
S
TOPS IS
270 F
ROM
S
IX
D
IFFERENT
S
PEAKERS
. A
CCURACY IS
82%
imates the derivative with respect to frequency. It could be as
simple as the difference between two neighboring filters, i.e.,
(3)
It is found that a low value of MNSS is a sufficient, but not
necessary, condition for labials. The threshold was statistically
found (using histogram analysis) to depend on the voicing status
of the stop and to be close to the threshold value used in the
fricatives, which indicate that this indeed is a characteristic of
the labial place of articulation. Stops followed by silences or
fricatives, however, do not follow this rule. Those are detected
by the segmentation block and the MNSS is not be used with
them.
Another aspect of the burst spectrum is the burst frequency
prominence. This feature is helpful in discriminating between
velars and alveolars. Based on our experiments and comparative
analysis of numerous features, two features were developed to
describe this property. They are a) the difference between the
most dominant peak (MDP) and the energy of the three highest
filters, i.e., dominance relative to the highest filters (DRHF) and
b) the MDP laterally inhibited by the ten filters above it, call it
LINP. The DRHF is defined as
(4)
Alveolars are usually characterized by a low value of DRHF due
to their high frequency content, and the proximity of their MDP
to the highest filters. Therefore, a small DRHF is a necessary
condition for an alveolar. The other parameter, LINP, is defined

838
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
as
where:
and:
and for
if
then
where:
(5)
This parameter is used to detect the prominence of the BF peak
compared to the filters above it. Large values of LINP were
found to indicate a velar, small values of LINP indicate a non-
velar, and moderate values of LINP are ambiguous.
The last feature needed for the place of articulation detection is
the formant transitions before and after the stop. These transitions
are only applicable if the stop is preceded or followed by a sono-
rant, as detected by the segmentation and categorization program.
Their role was found to depend on whether there is a release or not.
For released stops (i.e., stops with a release (burst) segment that
is evident in the spectrogram), the formant transitions play only
an auxiliary role, while in unreleased stops, their role is primary.
In the case of released stops, only salient transitions are consid-
ered. For a transition to be salient, it has to be of significant slope
that exceeds a certain threshold and continuous without sudden
jumps or anomalies. Three cases are considered.
1) Clear F2 upward transition to the following sonorant or
downward transition from the preceding sonorant. Then,
the stop is accepted as labial regardless of the other fea-
tures.
2) Clear F2 downward transition to the following sonorant
or upward transition from the preceding sonorant. Then
the stop is accepted as nonlabial regardless of the other
features. It is decided whether it is alveolar or velar based
on the other features, namely the BF, VF2, DRHF, and
LINP.
3) F2 and F3 move away from each other to the following
sonorant, or toward each other from the preceding sono-
rant (velar pinch). In this case, the stop is detected as velar,
regardless of the other features involved.
For unreleased stops, the formant transitions (usually pre-
ceding the stop) are the only available place cue. Some de-
tailed context-dependent rules were developed to handle those
stops [5]. It was not possible, however, to test the correctness of
those detailed rules reliably due to the relatively small number
of unreleased stops in the database, most of which tend to have
clear and strong transitions as explained before and hence do
not need the detailed, sonorant-dependent rules. Nevertheless,
in the cases tested, the algorithm achieved good accuracy as will
be explained later.
This approach is in-line with Dorman et al. [20] who found
that the significance of the transitions in the human percep-
tion process was dependent on their clarity and slope. It also
has a practical advantage. Formant transitions are very difficult
to measure accurately. Therefore, restricting their use to cases
where they are clear, salient and accurately measurable, leads to
an improvement in the place detection.
An algorithm, developed to detect the place of articulation
using the features and techniques detailed above, is shown in
Fig. 5. It gave an accuracy of 90% as shown in Table IV. Per-
forming the same experiment without using the formant tran-
sitions causes a 4% drop in accuracy from 90% to 86%. Com-
bining the voicing detection and the place of articulation detec-
tion into one system, we obtain a stop classification system. The
overall classification accuracy is 86% as shown in Table V.
IV. D
ISCUSSION
In this work, we developed a new feature-based stop
classification system using an auditory-based front-end. The
feature-extraction system makes use of both the synchrony and
mean-rate outputs. It was clear from our results that the method
used in translating the acoustic abstract feature into a measur-
able parameter has a clear impact on the overall performance.
The synchrony is preferred in format/peak extraction (such as
the BF), while the mean-rate is used for spectral shapes and
amplitudes (such as the MNSS). A new synchrony detector
(ALSD) is used to enhance the formant and peak extraction
ability. Its ability to detect periodicity and extract dominant
peaks accurately is superior to that of the mean-rate envelope
detector (an improvement of 5%), and to other synchrony
detectors [5]. Repeating the above experiments using the GSD
(instead of the ALSD) showed a consistent deterioration of
3% in the place detection on clean and noisy speech. This is
attributed to the ALSD’s ability to robustly extract the formants
while suppressing the spurious peaks [3].
Various acoustic-phonetic features are evaluated for their in-
formation content individually and in combination with other
features. Some new features were also proposed to describe var-
ious aspects of the release spectrum, such as the burst frequency,
the spectral flatness, amplitude, compactness, etc. New knowl-
edge-based algorithms were developed to combine the chosen
features in the decision making process. These algorithms are
designed using a relatively small database (ten speakers) and
tested on a much larger database (60 speakers) that was not used
in the design process, which demonstrates good generalization
ability. They are similar to decision trees, but describe complex
interactions between the various features that may be multiple
dimensional at some nodes and may depend on the salience of
the feature at other nodes. Unlike data-driven approaches, these
statistically guided knowledge-based algorithms help improve
our understanding of the acoustic–phonetic characteristics of
the stop consonants and the complex relation and interaction
among various features.
To put the obtained results in perspective, we had to com-
pare them with data-driven systems that rely on huge training
databases. Since the databases used in the experiments are dif-
ferent, caution should be exercised when interpreting these com-
parisons especially when the difference in accuracy is small.
Searle et al. [34], in one of the most successful stop con-
sonant recognition experiments, used an auditory-based filter
bank and statistical discriminant analysis to detect the place
of articulation. They obtained an accuracy of 77% on 148

ALI et al.: ACOUSTIC–PHONETIC FEATURES FOR THE AUTOMATIC CLASSIFICATION OF STOP CONSONANTS
839
Fig. 5.
Hard-decision algorithm for the place of articulation detection of stops. Condition A in the figure is (LINP
>LINP_THHI), condition B is
(LINP
TABLE IV
C
ONFUSION
M
ATRIX FOR THE
P
LACE OF
A
RTICULATION
D
ETECTION ON
1200 S
TOPS
. A
CCURACY IS
90%
stops. In our experiments, we obtained an accuracy of 90%
on 1200 stops. Bush et al. [12] obtained classification results
ranging between 72% and 81% on 216 stops in syllable initial
positions for three male and three female speakers. The results
obtained in our experiments show a clear improvement for
a much larger database using continuous speech in various
syllable positions.
De Mori and Flammia [16] performed phoneme recognition
experiments on stops and nasals using back propagation neural
networks as classifiers. The stop classification performance was
about 82%. This is comparable to the 86% we obtained using a
knowledge-based approach.
Nathan and Silverman [29] used time-varying features in a
statistical framework to perform place of articulation detection.
Their results ranged between 72.3% to 89.1%. On the other
hand, Rangoussi and Delopoulos [32] obtained results ranging
between 90% and 94% for the place of articulation detection on
a smaller testing data set using time-frequency analysis and the

840
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
TABLE V
C
ONFUSION
M
ATRIX FOR THE
C
LASSIFICATION OF
1200 S
TOPS
. O
VERALL
A
CCURACY IS
86%
LVQ classifier. Both results are comparable to the 90% obtained
in our work.
Samuelian [33] performed phoneme-level recognition of
stops, nasals and liquids using decision trees. He obtained an
83%-90% accuracy for recognition of stops on three speakers.
This is comparable to the 86% obtained in this work on a
larger number of speakers (60 from seven different dialects).
He used statistical tools (namely the C4.5 inductive inference
algorithm) to build a decision-tree system. His system however
suffered from the inherent traditional limitations of the decision
tree algorithms, especially their limited ability to capture
multidimensional complex interactions among features like
the ones described previously in the place detection algorithm.
Moreover, his frame-level recognition did not use the context
information as was performed in this work.
V. C
ONCLUSION
In this work, we investigated the acoustic–phonetic feature-
based classification of stop consonants in speaker-independent
continuous speech. We used a new auditory-based front-end
processing system to generate a dual mean-rate and synchrony
representation that combines the advantages of both outputs.
Based on the previous research and our own statistical anal-
ysis and spectrogram reading experiments, we created a new
set of static and dynamic features that are rich in their infor-
mation content and useful in specific classification tasks. New
knowledge-based algorithms were developed to extract the ar-
ticulatory gestures from these features. Classification experi-
ments were performed on stop consonants extracted from the
continuous speech of 60 speakers from seven different dialects
of American English in the TIMIT database. The results yielded
a 96% and 90% for the voicing and place of articulation detec-
tion, respectively. The overall stop classification had an accu-
racy of 86%. These results demonstrate the importance of using
multiple interacting features, context dependence, and relational
invariance of features (as opposed to absolute invariance), and
emphasize the significance of developing new parameters and
algorithms to account for speech variability.
A
CKNOWLEDGMENT
The authors would like to thank the anonymous reviewers for
their valuable comments and insightful suggestions.
R
EFERENCES
[1] A. M. A. Ali, et al., “An acoustic-phonetic feature-based system for
the automatic recognition of fricative consonants,” in Proc. IEEE

Yüklə 276,45 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7