3.3.Difficulty
One way to characterize the complexity of the training set is to compute the entropy of the distribution of successors, for every available left context. The entropy of a language L viewed as a stochastic process measures the average surprise value associated with each element (Mitchell, 1997). In our case, the language is a set of words and the elements are phonemes, hence the appropriate entropy measures the average surprise value for phonemes c preceded by a context s. Entropy is measured for a given distribution, which in our case is the set of all possible successors. We compute entropy Entr(s) for a given context s with (1):
Equation 1. Entropy
where α is the alphabet of segment symbols, and p(c) the probability of a given context. Then the average entropy for all available contexts sL, weighted with their frequencies, will be the measure of the complexity of the words. The smaller this measure, the less difficult are the words. The maximal possible value for one context would be log2(45), that is, 5.49, and this would only obtain for the unlikely case that each phoneme was equally likely in that context. The actual average value of the entropy measured for the Dutch monosyllables, is 2.24, σ = 1.32. The minimal value was 0.0, and the maximal value was 3.96. These values may be interpreted as follows: The minimal value of 0.0 means that there are left contexts with only one possible successor (log2(1) = 0). A maximal value of 3.96 means that there is one context which is as unpredictable as one in which 23.96 = 16 successors were equally likely. The mean entropy is 2.24, which is to say that in average 4.7 phonemes follow a given left context.
Dostları ilə paylaş: |