A course In Modern English Lexicology

§ 2. Statistical Analysis

Yüklə 1,43 Mb.

səhifə	113/124
tarix	19.12.2023
ölçüsü	1,43 Mb.
	#184964
növü	Учебник

1 ... 109 110 111 112 113 114 115 116 ... 124

A Course In Modern English Lexicology by Ginzburg R.S., Khidekel S.S. et al. (z-lib.org).pdf

§ 2. Statistical Analysis
linguistics which has been making
progress during the last few decades is the quantitative study of language phenomena and the application of statistical methods in linguistic analysis.
Statistical linguistics is nowadays generally recognised as one of the major branches of linguistics. Statistical inquiries have considerable importance not only because of their precision but also because of their relevance to certain problems of communication engineering and information theory.
Probably one of the most important things for modern linguistics was the realisation of the fact that non-formalised statements are as a matter of fact unverifiable, whereas any scientific method of cognition presupposes verification of the data obtained. The value of statistical methods as a means of verification is beyond dispute.
Though statistical linguistics has a wide field of application here we shall discuss mainly the statistical approach to vocabulary.
Statistical approach proved essential in the selection of vocabulary items of a foreign language for teaching purposes.
It is common knowledge that very few people know more than 10% of the words of their mother tongue. It follows that if we do not wish to waste time on committing to memory vocabulary items which are never likely to be useful to the learner, we have to select only lexical units that are commonly used by native speakers. Out of about 500,000 words listed in the OED the “passive” vocabulary of an educated Englishman comprises no more than 30,000 words and of these 4,000 — 5,000
242
are presumed to be amply sufficient for the daily needs of an average member of the English speech community. Thus it is evident that the problem of selection of teaching vocabulary is of vital importance.1 It is also evident that by far the most reliable single criterion is that of frequency as presumably the most useful items are those that occur most frequently in our language use.
As far back as 1927, recognising the need for information on word frequency for sound teaching materials, Ed. L. Thorndike brought out a list of the 10,000 words occurring most frequently in a corpus of five million running words from forty-one different sources. In 1944 the extension was brought to 30,000 words.2
Statistical techniques have been successfully applied in the analysis of various linguistic phenomena: different structural types of words, affixes, the vocabularies of great writers and poets and even in the study of some problems of historical lexicology.
Statistical regularities however can be observed only if the phenomena under analysis are sufficiently numerous and their occurrence very frequent. Thus the first requirement of any statistic investigation is the evaluation of the size of the sample necessary for the analysis.
To illustrate this statement we may consider the frequency of word occurrences.
It is common knowledge that a comparatively small group of words makes up the bulk of any text.3 It was found that approximately 1,300 —
1,500 most frequent words make up 85% of all words occurring in the text.
If, however, we analyse a sample of 60 words it is hard to predict the number of occurrences of most frequent words. As the sample is so small it may contain comparatively very few or very many of such words. The size of the sample sufficient for the reliable information as to the frequency of the items under analysis is determined by mathematical statistics by means of certain formulas.
It goes without saying that to be useful in teaching statistics should deal with meanings as well as sound-forms as not all word-meanings are equally frequent. Besides, the number of meanings exceeds by far the number of words. The total number of different meanings recorded and illustrated in OED for the first 500 words of the Thorndike Word List is 14,070, for the first thousand it is nearly 25,000. Naturally not all the meanings should be included in the list of the first two thousand most commonly used words. Statistical analysis of meaning frequencies resulted in the compilation of A General Service List of English Words with Semantic Frequencies. The semantic count is a count of the frequency of the occurrence of the various senses of 2,000 most frequent words as found in a study of five million running words. The semantic count is based on the differentiation of the meanings in the OED and the
’ 1 See ‘Various Aspects ...’, § 14, p. 197; ‘Fundamentals of English Lexicography, § 6, p. 216.
2 The Teacher’s Word Book of 30,000 Words by Edward L. Thorndike and Irvin Lorge. N. Y., 1963. See also M. West. A General Service List of English Words. L., 1959, pp. V-VI.
3 See ‘Various Aspects ...’, § 14, p. 197.
243
frequencies are expressed as percentage, so that the teacher and textbook writer may find it easier to understand and use the list. An example will make the procedure clear.
room (’space’) takes less room, not enough room to turn round 12%
(in) make room for (figurative) room for improvement
\'7d
come to my room, bedroom, sitting room; drawing room, bath-83%
room
\'7d
(plural = suite, lodgings) my
2%
room in college to let rooms
\'7d
It can be easily observed from the semantic count above that the meaning ‘part of a house’ (sitting room, drawing room, etc.) makes up 83% of all occurrences of the word room and should be included in the list of meanings to be learned by the beginners, whereas the meaning ’suite, lodgings’ is not essential and makes up only 2% of all occurrences of this word.
Statistical methods have been also applied to various theoretical problems of meaning. An interesting attempt was made by G. K. Zipf to study the relation between polysemy and word frequency by statistical methods.
Having discovered that there is a direct relationship between the number of different meanings of a word and its relative frequency of occurrence, Zipf proceeded to find a mathematical formula for this correlation. He came to the conclusion that different meanings of a word will tend to be equal to the square root of its relative frequency (with the possible exception of the few dozen most frequent words). This was summed up in the following formula where m stands for the number of meanings, F for relative frequency — tn — F1/2. This formula is known as Zipf’s law.
Though numerous corrections to this law have been suggested, still there is no reason to doubt the principle itself, namely, that the more frequent a word is, the more meanings it is likely to have.
One of the most promising trends in statistical enquiries is the analysis of collocability of words. It is observed that words are joined together according to certain rules. The linguistic structure of any string of words may be described as a network of grammatical and lexical restrictions.1
The set of lexical restrictions is very complex. On the standard probability scale the set of (im)possibilities of combination of lexical units range from zero (impossibility) to unit (certainty).
Of considerable significance in this respect is the fact that high frequency value of individual lexical items does not forecast high frequency of the word-group formed by these items. Thus, e.g., the adjective able and the noun man are both included in the list of 2,000 most frequent words, the word-group an able man, however, is very rarely used.
1 Set ‘Word-Groups and Phraseological Units’, §§ 1, 2, pp. 64,66, 244
The importance of frequency analysis of word-groups is indisputable as in speech we actually deal not with isolated words but with word-groups. Recently attempts have been made to elucidate this problem in different languages both on the level of theoretical and applied lexicology and lexicography.
It should be pointed out, however, that the statistical study of vocabulary has some inherent limitations.
Firstly, statistical approach is purely quantitative, whereas most linguistic problems are essentially qualitative. To put it in simplar terms quantitative research implies that one knows what to count and this knowledge is reached only through a long period of qualitative research carried on upon the basis of certain theoretical assumptions.
For example, even simple numerical word counts presuppose a qualitative definition of the lexical items to be counted. In connection with this different questions may arise, e.g. is the orthographical unit work to be considered as one word or two different words: work n — (to) work v.
Are all word-groups to be viewed as consisting of so many words or are some of them to be counted as single, self-contained lexical units? We know that in some dictionaries word-groups of the type by chance, at large, in the long run, etc. are counted as one item though they consist of at least two words, in others they are not counted at all but viewed as peculiar cases of usage of the notional words chance, large, run, etc. Naturally the results of the word counts largely depend on the basic theoretical assumption, i.e. on the definition of the lexical item.1
We also need to use qualitative description of the language in deciding whether we deal with one item or more than one, e.g. in sorting out two homonymous words and different meanings of one word.2 It follows that before counting homonyms one must have a clear idea of what difference in meaning is indicative of homonymy. From the discussion of the linguistic problems above we may conclude that an exact and exhaustive definition of the linguistic qualitative aspects of the items under consideration must precede the statistical analysis.
Secondly, we must admit that not all linguists have the mathematical equipment necessary for applying statistical methods. In fact what is often referred to as statistical analysis is purely numerical counts of this or that linguistic phenomenon not involving the use of any mathematical formula, which in some cases may be misleading.
Thus, statistical analysis is applied in different branches of linguistics including lexicology as a means of verification and as a reliable criterion for the selection of the language data provided qualitative description of lexical items is available.
The theory of Immediate Constituents (IC)
§ 3. Immediate Constituents was originally elaborated as an attempt to Analysis
determine the ways in which lexical units are
relevantly related to one another. It was discovered that combinations of such units are usually structured into
1 See also ‘Various Aspects ...’, § 12, p. 195,
2 See ‘Semasiology’, §§ 37, 38, pp. 43, 44.
245
hierarchically arranged sets of binary constructions. For example in the word-group a black dress in severe style we do not relate a to black, black to dress, dress to in, etc. but set up a structure which may be represented as a black dress / i n severe style. Thus the fundamental aim of IC
analysis is to segment a set of lexical units into two maximally independent sequences or ICs thus revealing the hierarchical structure of this set. Successive segmentation results in Ultimate Constituents (UC), i.e. two-facet units that cannot be segmented into smaller units having both sound-form and meaning. The Ultimate Constituents of the word-group analysed above are: a | black | dress | in | severe | style.
The meaning of the sentence, word-group, etc. and the IC binary segmentation are interdependent. For example, fat major’s wife may mean that either ‘the major is fat’ or ‘his wife is fat’. The former semantic interpretation presupposes the IC analysis into fat major’s | wife, whereas the latter reflects a different segmentation into IC’s and namely fat | major’s wife.
It must be admitted that this kind of analysis is arrived at by reference to intuition and it should be regarded as an attempt to formalise one’s semantic intuition.
It is mainly to discover the derivational structure of words that IC
analysis is used in lexicological investigations. For example, the verb denationalise has both a prefix de- and a suffix -ise (-ize). To decide whether this word is a prefixal or a suffixal derivative we must apply IC
analysis.1 The binary segmentation of the string of morphemes making up the word shows that *denation or *denational cannot be considered independent sequences as there is no direct link between the prefix de- and nation or national. In fact no such sound-forms function as independent units in modern English. The only possible binary segmentation is de | nationalise, therefore we may conclude that the word is a prefixal derivative.
There are also numerous cases when identical morphemic structure of different words is insufficient proof of the identical pattern of their derivative structure which can be revealed only by IC analysis. Thus, comparing, e.g., snow-covered and blue-eyed we observe that both words contain two root-morphemes and one derivational morpheme. IC analysis, however, shows that whereas snow-covered may be treated as a compound consisting of two stems snow + covered, blue-eyed is a suffixal derivative as the underlying structure as shown by IC analysis is different, i.e. (blue+eye)+-
ed.
It may be inferred from the examples discussed above that ICs represent the word-formation structure while the UCs show the morphemic structure of polymorphic words.
Distributional analysis in its various forms is

Yüklə 1,43 Mb.

Dostları ilə paylaş:

1 ... 109 110 111 112 113 114 115 116 ... 124