University of Pennsylvania ScholarlyCommons

Yüklə 276,45 Kb.

Pdf görüntüsü

səhifə	4/7
tarix	12.05.2023
ölçüsü	276,45 Kb.
	#111996

1 2 3 4 5 6 7

Acoustic phonetic features for the autom

A. Voicing Detection
Three main features were found to be useful for voicing de-
tection:
1) voicing during closure (prevoicing);
2) voicing onset time (VOT);
3) closure duration.
These three features are combined in the algorithm shown in
Fig. 3 to generate a voicing decision. Prevoicing is found to be
a sufficient, yet not necessary, condition for voicing. Detecting
prevoicing is performed by measuring the ratio between the
low-frequency mean-rate energy (up to 450 Hz) in the last 20 ms
of the closure interval and its maximum value through the whole
utterance. If this ratio exceeds a certain threshold, (obtained sta-
tistically using histogram analysis), the stop is considered pre-
voiced. Durations, (i.e., the VOT and the closure duration), are
measured using the boundaries generated by the segmentation
and categorization system [5] to mark the various segments of
the stop consonant.
As shown in Fig. 3, prevoicing is used as the only voicing
detection feature for stops that are followed by silences or frica-
tives (as detected by the segmentation block). For the rest of
the stops, it is used as a sufficient condition for voicing. On the
other hand, the VOT is usually larger for voiceless stops relative
to voiced ones. Histogram analysis showed that two threshold
values are needed for accurate voicing detection using the VOT.
The choice of the threshold depends on the closure duration as
shown in Fig. 3. All thresholds used for the closure duration and
VOT are statistically optimized, using histogram analysis and
information transmission analysis, to minimize the probability
of error during the design phase.
Using the above algorithm for voicing detection yielded an
accuracy of 96% as shown in the confusion matrix of Table I.
A remark that is worth noting is the interesting role played by
the closure duration. Though it does not play a direct role in de-
tecting voicing (i.e., voiced stops did not show systematic clo-
sure duration variation relative to unvoiced stops), its indirect
role is significant. Attempting voicing detection without the clo-
sure duration caused a drop in accuracy from 96% to 90%.
B. Place of Articulation Detection
The first step in the place of articulation detection is to extract
the flaps. The flap
is an allophone of
and
that is
used in some dialects in certain contexts (like “matter,” “better,”
etc.). Flaps are characterized by a very short drop in the total
energy between two sonorants, which is followed by no release
burst and has phonation in it. The duration of the flaps has to
be less than or equal to 32 ms. Using these criteria, flaps were
recognized correctly with an accuracy of 94%.
The following features are in the place detection of the re-
maining stops (
):
1) burst frequency (BF);
2) second formant (F2) of the following vowel (VF2);
3) maximum normalized spectral slope (MNSS);
4) burst frequency prominence (DRHF and LINP);
5) formant transitions before and after the stop;
6) voicing decision (using the previous section algorithm).
The burst frequency was statistically found (using informa-
tion transmission and statistical discriminant analyzes [5]) to be
the most important feature for the place detection from the in-
formation content standpoint. It is defined as the most promi-
nent peak in the synchrony output during the stop release. The
synchrony output is used, as opposed to the mean-rate output,
for peak extraction because of its superior ability to extract for-
mants and dominant peaks accurately and its lower sensitivity
to noise. The BF for the whole release was taken to be the min-
imum frequency of the previously mentioned peaks along the
whole release duration. This is defined as follows:
where:
and:
(1)
It was found, however, that the burst frequency is highly con-
text dependent. This variability can be significantly reduced by
taking the next vowel height into consideration. This relational

Yüklə 276,45 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7