Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	35/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 31 32 33 34 35 36 37 38 ... 57

a
and
c
g
, with sets of instances I
a
of size n and I
g
of size m. We will again use the
shorthand notation I
a
and I
g
to refer to these sets. The class c
a
consists of objects
or concepts i (pop artists, books, painters), while the instances j of c
g
are relevant
descriptions (labels, tags, genres). We are interested in the perceived relatedness
between the instances in I
a
, as well as the applicability of the tags in I
g
to these
instances.
In this chapter, we focus on the following three problems.
The Instance Relatedness Problem. Given O = ({c
a
}, {r}), where r expresses
the
is related to relation with c
a
as subject and object class. Populate O, where
for each relation instance pair (i, i
0
) ∈ I
a
× I
a
we are interested in the degree of
relatedness t(i, i
0
) of instance i
0
with respect to i.
2
Definition. Given is a set of tags I
g
and an instance i ∈ I
a
. We then call a tag j ∈ I
g
most appropriate for i if a domain expert would select j from the set I
g
as the label
best applicable to i ∈ I
a
.
2
The Instance Categorization Problem. Given O = ({c
a
, c
g
}, {r}), where r
expresses the applicability relation with subject c
a
and object c
g
. Populate O,
where for each instance i ∈ I
a
we are interested in the most appropriate tag m(i) in
I
g
.
2
The Instance Tagging Problem. Given O = ({c
a
, c
g
}, {r}), where r expresses
the relation between the two classes. Populate O, where for each relation instance
pair (i, j) ∈ I
a
× I
g
we are interested in the degree of applicability p(i, j) of tag j
with respect to instance i.
2
We first present two alternative approaches to extract relation instances on the
web. In the next section, we present methods to address each of the three problems
described above.
6.1.2 Two Alternatives to the Pattern-based Approach
We present two methods to extract information from the web, alternative to the one
described in Chapters 2 and 3. In Chapter 5.2 we have already seen that the pattern-
based approach does not suit all information demands. As the relation between an
instance in I
a
and a tag is not always expressed within a sentence, we compare our

112
pattern-based approach with two alternatives.
We base these methods on co-occurrences on the web. If the instances
Johnny
Cash and U2 are often mentioned in the same context, we can conclude that these
instances are related in some sense. The co-occurrences of instances on the web
form the basis of the approach to deal with the problems as defined in this chapter.
After discussing the alternatives, Section 6.2 handles the processing of these co-
occurrences to solve the problems addressed in this chapter.
Pattern-based Method (PM). The Pattern-based Method (
PM
) is based on the
methods described in the first part of this thesis. To find occurrences of relation
instances (i, j) in I

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 31 32 33 34 35 36 37 38 ... 57