2.2 Extraction Information from the Web using Patterns
21
F
α
(
c
j
) =
(1 + α)
· precision(
c
j
)
· recall(
c
j
)
α
· precision(
c
j
) + recall(
c
j
)
(2.2)
The F-measures for evaluating the populated relations are formulated similarly.
As discussed, to measure precision and recall a ground truth ontology is re-
quired. For some information demands, we can not expect such an ontology or any
other form of structured data to exist. Moreover, information extraction tasks with
a known ground truth are not very interesting from an application point of view.
In cases where no ground truth is available, precision is typically estimated by
manually inspecting a sample subset of the instances found. Recall is estimated
using an (incomplete) set of known instances of the class. For example, if we are
interested in an ontology with musical artists, a complete list of such artists is not
likely to be known. However, we can compute the recall using a set of known or
relevant instances (e.g. famous musical artists extracted from structured sources
such as
Last.fm or Wikipedia) and express the recall using this list.
A separate aspect of the evaluation is the notion of
correctness. We cannot
assume that all correctly extracted statement are indeed true. However, based on
the expected redundancy of information on the web, we expect factual information
to be identifiable.
More complex to evaluate are subjective relations, such as the relation between
a musical artist and a genre as regarded by the web community. Nevertheless may
the use of web information extraction techniques be valuable for such information
demands, as subjective information is less likely to be represented in a structured
manner. We return to this topic in Chapter 6.
2.2 Extraction Information from the Web using Patterns
The ontology population problem can be split into two concerns.
Dostları ilə paylaş: