Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	26/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 22 23 24 25 26 27 28 29 ... 57

t∈T
f (t) gives the sum of all hits.
After having computed the scores for each of the subcategories, we assign the
subcategory rmax with the highest score to v. As a term can be within multiple cat-
egories, we also add the subcategories with at least half the score of rmax. Hence,
we add each subcategory r
i
for which the following holds
s
v
(r
i
) ≥ 0.5 · s
v
(rmax).
(5.2)
We will use the subcategories in the mapping techniques described in the next
three subsections.
5.1.4 Term Mapping using Hyponym Patterns
We assume a set of patterns to be given that relate Dutch terms with their hyper-
nyms. [Hearst, 1992]. IJzereef [2004] manually constructed such a set.
Having such a set of patterns, we combine the term v with each of the patterns
into queries. We query an expression (e.g.
such as puffins) and scan the returned
snippets for terms in T preceding the search term. Hence, the aim is to find phrases
(like
seabirds such as puffins) to determine broader terms for v.
For a term t ∈ T found within the snippets for query term v, we compute its
score s
v
(t) as follows.
s
v
(t) = q(t, v) · oc(t) · log
C
f (t)
(5.3)
We use q(t, v) as a penalty score for terms outside the subcategories found in
the previous section.
q(t, v) =



1.0 if t and v share a subcategory
0.3 if t and v share a main category
0.1 if t and v share no category

5.1 Improving the Accessibility of a Thesaurus-Based Catalog
83
The values for q(t, v) are chosen in a somewhat arbitrary way. We will return
to these choices when discussing the experimental results.
Using the scores, we compute a ranked list for the potential hypernym terms
for v found using this method.
5.1.5 Term Mapping using Enumeration Patterns
Snow et al. [2006; 2005] observe that related terms (or
siblings) tend to co-occur
in enumerations. We thus can state that enumerated items share a broader term.
Hence, if we can observe which terms within T are siblings of v, we can use the
structure of the thesaurus to compute the broader term for v.
Similar to the approach described in the previous section, we select a number
of patterns expressing the
RT
relation. Again, we scan the snippets for terms within
the thesaurus. However, we do not score the terms found, but (all) their broader
terms. Hence, the presence of the term
aalscholvers (cormorants) contributes to the
scores for
watervogels (water birds), vogels (birds), and dieren (animals).
A term t is hence scored using the presence of all its narrower terms
NT
∗
(t) in
the snippets.
s
v
(t) =
∑
s∈
NT
∗
(t)
q(s, v) · oc(s) · log
C
f (s)
(5.4)
We assume that the broadest concepts (e.g.
dieren, animals – 26,000,000 hits)
are in general more present on the web than narrower concepts (e.g.
watervogels,
waterbirds – 230,000 hits). Hence, we do take the distance of s to t into account
as the factor
C
f (s)
penalizes common concepts. Again, we compute a ranked list of
potential hypernym terms using this enumeration-based approach.
5.1.6 Term Mapping using a Lexical Approach
We observe that hyponym-hypernym pairs that are lexically similar (e.g.
dienstver-
lenende beroepen and beroepen, earthworms and worms) occur infrequently within
the same sentence. Next to the two approaches based on web information extrac-
tion, we therefore adopt an approach using the morphology of the terms.
If some term t in T is a suffix of v, then v may be a hypernym of t (e.g. if v
contains a preceding adjective). However, not all t that match with a suffix of v
are indeed hypernyms of v. For example, the
GTAA
term
ogen (eyes) is a suffix of
psychologen (psychologists).
However, if the computed categories for v do not overlap with the categories for
suffix t, it is not likely that the two are related. We therefore use the subcategories
as computed in Section 5.1.3 to filter out erroneous lexical mappings.
We construct a list of thesaurus terms that are suffixes of v and share a subcate-
gory with v. If no such terms exist, we create such a list of terms that share a main

84
category with v. The list is sorted by increasing length.
5.1.7 Presenting the Results
Having independently found three lists of potentially relevant terms for the query

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 22 23 24 25 26 27 28 29 ... 57