3.4.Negative Data
We noted above that negative data is also necessary for evaluation. Since we are interested in models that discriminate more precisely the strings from L (the Dutch syllables), the negative data for the following experiments will be biased toward L.
Three negative testing sets were generated and used: First, a set RM containing strings with syllabic form [C]0...3V[C]0...4, based on the empirical observation that the Dutch mono-syllables have up to three onset (word initial) consonants and up to four coda (word final) consonants. The second group consists of three sub-sets of RM: {R1M , R2M , R M 3 + }, with fixed distances of the random strings to any existing Dutch word at 1, 2, and 3+ phonemes, respectively (measured by edit distance (Nerbonne, Heeringa & Kleiweg, 1999)). Controlling for the distance to any training word allows us to assess more precisely the performance of the model. And finally, a third group: random strings built of concatenations of n-grams picked randomly from Dutch monosyllables. In particular, two sets - R2N and R3N - were randomly developed, based on bigrams and trigrams, correspondingly.
The latter groups are the most "difficult" ones, and especially R3N , because it consists of strings that are closest to Dutch. They are also useful for the comparison of SRN methods to n-gram modeling. The corresponding n-gram models will always wrongly recognize these random strings as words from the language. Where the connectionist predictor recognizes them as non-words, it outperforms the corresponding n-gram models, which are considered as benchmark models for prediction tasks such as phonotactics learning.
Dostları ilə paylaş: |