Title of the article

Linguistic Data - Dutch syllables

3.2. Linguistic Data - Dutch syllables

A data base LM of all Dutch monosyllables - 5,580 words - was extracted from the CELEX (1993) lexical database. CELEX is a difficult data source because it contains many rare and foreign words among its approximately 350,000 Dutch lexical entries, which additionally complicate the learning task. Filtering out non-typical words is a formidable task and one which might introduce experimenter prejudice, and therefore all monosyllables were used. The monosyllables have a mean length of 4.1(σ = 0.94; min = 2; max = 8) tokens and are built from a set of 44 phonemes plus one extra symbol representing space (#) used as a filler specifying end-of-word.

The main dataset is split into a training (L1) and a testing (L2) database in proportion approximately 85% to 15%. The training database will be used to train a Simple Recurrent Network and the testing one will be used for evaluating the success of word recognition. Negative data also will be created for test purposes. The complete database LM will be used for some parts of evaluation.

In language modeling it is important to explore the frequencies of word occurrences which naturally bias humans' linguistic performance. If a model is trained on data in proportion to its empirical frequency, this focuses the learning on the more frequent words and thus improves the performance of the model. This also makes feasible a comparison of the model's performance with that of humans performing various linguistic tasks, such as a lexical decision task. For these reasons, we used the word frequencies given in the CELEX database. Because the frequencies vary greatly ([0...100,000]), we presented training data items in proportion with the natural logarithms of their frequencies, in accordance with standard practice (Plaut, McClelland, Seidenberg & Patterson, 1996). This approach resulted in frequencies in a range of [1...12].

