g
to be given. We are interested to exploit methods to learn new terms for the set I
g
.
This can for instance be done with the tf·idf-approach [Knees et al., 2004; Manning
& Sch¨utze, 1999].
148
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
precision
recall
precision/recall for the top 25 tags
top 100
top 250
top 500
Figure 6.18. Average precision and recall for the top 25 tags for the books in
LibraryThing.
6.5 Conclusions
In this chapter we have focussed on the extraction of information from the web
that is not present as such. By combining information from various sources, we
have created characterizations of instances based on the collective knowledge on
the web. We have argued that such collaborative characterizations are often more
informative than the ones provided by experts.
We presented a simple method to identify relatedness among instances in one
class (e.g. artists) using co-occurrences found with web information extraction
methods. With similar techniques we find the most applicable category (e.g. a
genre) for each instance in a given set. The last problem addressed focuses on
the identification of a ranked list of tags applicable to an instance, based on the
information available on the web.
The experimental results for the identification of the relatedness among in-
stances and the categorization of instances are both convincing. We have shown
that the use of the identified relatedness improves the categorization.
In the last part of this chapter, we focused on a novel task in web informa-
tion extraction. We identified an ordered list of tags for a given set of instances.
The computed lists were compared with a ground truth for a social website (e.g.
Last.fm). Although the results of both tagging experiments are modest, we are
encouraged given the difficulty of the task. As we have sketched directions for im-
provement, we hope that this work inspires to continue the research on the tagging
6.5 Conclusions
149
of instances using web information extraction.
To evaluate the methods presented, we compared the output of the method with
benchmarks composed by experts as well as sets collected from social web sites.
Although the use of these benchmark sets gives valuable insights in the quality of
the output, it will be interesting to analyze the use of extracted community data in
applications. For example, the questions remain how the extracted information can
contribute to a recommender system, and how the performance compares with the
use of information gathered from other sources.
150
7
Conclusions
Intelligent applications can benefit from the collective knowledge of the internet
community as to be found on the web. However, the vast majority of the informa-
tion on the web is represented in a human-friendly format using natural language
texts. Such information in natural language texts is not machine interpretable.
In this thesis, we presented approaches to find, extract and structure informa-
tion from natural language texts on the web. Such structured information can be
machine interpreted and hence be used in intelligent applications.
Information extraction is the task of identifying instances of classes and their
relations in a text corpus. We adopted the concept of ontology to model the infor-
mation demand. The information extraction problem is translated into an ontology
population problem.
We proposed a simple ontology population method using patterns. Patterns
are commonly occurring phrases that are typically used to express a given rela-
tion. Patterns are accompanied by placeholders for instances, for example
[City]
is the capital of [Country]. We combine such patterns and known instances into
search engine queries (e.g.
Amsterdam is the capital of). Subsequently, we extract
instances and relations from the retrieved documents. The use of the constructed
queries serves two goals. On the one hand, it shows to be an effective mechanism
to access highly relevant texts, while on the other hand we can identify relations
between instances.
151
152
State-of-the-art search engines have restrictions on the accepted numbers of
automated queries per day. We hence analyze our methods by their
Google Com-
plexity.
After having discussed a general approach to populate an ontology in Chap-
ter 2, we focused on two subproblems: the identification of effective patterns and
the recognition of the instances of the defined classes in texts. The presented ap-
proach contains bootstrapping mechanisms, as learned instances and patterns are
used to formulate new search engine queries.
We make use of the redundancy of information on the web. Many statements
(i.e. subject – relation – object triples) can be found on various pages using diverse
formulations. We use this characteristic of the web as a corpus to filter out erro-
neously extracted data. The more a statement is identified on the web, the higher
the confidence in its correctness.
We have argued that precision of a pattern is not the only criterion for a pat-
tern to be
effective. The patterns identified in the case-studies are recognizable
formulations of the corresponding relation.
To recognize instances in web texts, and more specifically in snippets, we pre-
sented two alternative approaches. On the one hand, we can identify instances
using a knowledge-oriented approach, where regular expressions are created to
match instances of a given class. On the other hand, we presented a data-oriented
approach. Given a set of known instances, a collection of texts is annotated. A clas-
sifier uses the annotated texts as training set in order to recognize new instances in
the other texts.
The methods discussed are illustrated with several case-studies. In the thesis,
we focused on three tasks in web information extraction. In order to benchmark
our method, we extract
facts from the web. The second part focuses on two ap-
plications of web information extraction, while the last part of the thesis focuses
on the discovery of information. By combining content of multiple documents,
we create community-based descriptions for instances such as books, painters and
popular artists.
In the case-studies in Chapter 4 we show that we can precisely identify rela-
tion instances using the pattern-based approach. Both with manually constructed
patterns as well as with learned patterns good results were achieved in the studied
cases. The use of the pattern-instance combinations in queries is an effective ap-
proach to access relevant search results. We have shown that the redundancy of in-
formation on the web enables us to precisely identify instances using the rule-based
approach. For the data-oriented approach, the use of a large and representative set
of known instances is crucial.
In Chapter 5, we discussed two applications of web information extraction. In
the first part of the chapter, we presented a method to map arbitrary terms to a
Conclusions
153
semantically related term in a given thesaurus. As the thesaurus is used to index a
collection, the access to this collection is improved.
For those interested in the retrieval of the lyrics of a given song, we developed
an application that extracts versions of the lyrics from the web and combines them
into a most plausible version. The results of the experiments are convincing.
Chapter 6 focuses on community-based data. We are interested in the identifi-
cation of characterizations of instances like musical artists and painters, based on
the
wisdom of the crowds as expressed on the web. We have discussed three tasks
in the identification of community-based data: the identification of the perceived
relatedness between instances, the categorization of instances and the tagging of
instances. The experimental results for the identification of the relatedness among
instances and the categorization of instances are both convincing. We have shown
that the use of the identified relatedness improves the categorization. With respect
to the tagging of instances using texts on the web, no comparable previous work
is known. Although the results of both tagging experiments are modest, we are
encouraged given the difficulty of the task. We have developed an algorithm to
generate a dynamic ground truth to evaluate the tagging of instances, which facili-
tates the challenging research beyond the categorization of instances.
In this thesis we have shown that we can extract information from the web in
a simple and efficient manner. By combining and structuring information from the
Web, we create a valuable surplus to the knowledge already available. The iden-
tification of collective knowledge and opinions is perhaps more interesting than
collecting plain facts, which often can be mined from semi-structured sources.
As the web is an ever growing corpus of texts, and intelligent applications can
benefit from the extracted information, the future for web information extraction is
bright and promising.
154
Bibliography
Abney, S., Collins, M., & Singhal, A. [2000]. Answer extraction. In In proceedings
Dostları ilə paylaş: |