Title of the article


The prosodic hierarchy of words



Yüklə 3,17 Mb.
səhifə29/92
tarix02.01.2022
ölçüsü3,17 Mb.
#2212
1   ...   25   26   27   28   29   30   31   32   ...   92

4.2.Levenshtein distance


Using the Levenshtein distance, two dialects are compared by comparing the pronunciation of a word in the first dialect with the pronunciation of the same word in the second. It is determined how one pronunciation is changed into the other by inserting, deleting or substituting sounds. Weights are assigned to these three operations. In the simplest form of the algorithm, all operations have the same cost, e.g. 1. Assume afternoon is pronounced as [tnn] in the dialect of Savannah, Georgia, and as [] in the dialect of Lancaster, Pennsylvania15. Changing one pronunciation into the other can be done as in table 1 (ignoring suprasegmentals and diacritics for this moment)16:

Table 1. Changing one pronunciation into another using a minimal set of operations.
tnn delete  1

tnn insert r 1

trnn subst. / 1






3
In fact many sequence operations map [tnn] to []. The power of the Levenshtein algorithm is that it always finds the cost of the cheapest mapping.

Comparing pronunciations in this way, the distance between longer pronunciations will generally be greater than the distance between shorter pronunciations. The longer the pronunciation, the greater the chance for differences with respect to the corresponding pronunciation in another variety. Because this does not accord with the idea that words are linguistic units, the sum of the operations is divided by the length of the longest alignment which gives the minimum cost. The longest alignment has the greatest number of matches. In our example we have the following alignment:

Table 2. Alignment which gives the minimal cost. The alignment corresponds with table 1.
   t  n  n

       



1 1 1
The total cost of 3 (1+1+1) is now divided by the length of 9. This gives a word distance of 0.33 or 33%.

In Section 3.1.3 we explained how distances between segments can be found using spectrograms. This makes it possible to refine our Levenshtein algorithm by using the spectrogram distances as operation weights. Now the cost of insertions, deletions and substitutions is not always equal to 1, but varies, i.e., it is equal to the spectrogram distance between the segment and ‘silence’ (insertions and deletions) or between two segments (substitution).

To reckon with syllabification in words, the Levenshtein algorithm is adapted so that only a vowel may match with a vowel, a consonant with a consonant, the [j] or [w] with a vowel (or opposite), the [i] or [u] with a consonant (or opposite), and a central vowel (in our research only the schwa) with a sonorant (or opposite). In this way unlikely matches (e.g. a [p] with a [a]) are prevented.

In our research we used 58 different words. When a word occurred in the text more than once, the mean over the different pronunciations was used. So when comparing two dialects we get 58 Levenshtein distances. Now the dialect distance is equal to the sum of 58 Levenshtein distances divided by 58. When the word distances are presented in terms of percentages, the dialect distance will also be presented in terms of percentages. All distances between the 15 language varieties are arranged in a 15  15 matrix.


Yüklə 3,17 Mb.

Dostları ilə paylaş:
1   ...   25   26   27   28   29   30   31   32   ...   92




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin