Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	57/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 49 50 51 52 53 54 55 56 57

Document Outline

on music information retrieval (ismir’07) (pp. 119 – 120). Vienna, Austria:
Austrian Computer Society.
• Geleijnse, G., Korst, J. [2008]. Search engine-based web information ex-
traction. In J. Cardoso M. Lytras (Eds.), Semantic web engineering in the
knowledge society.
• Geleijnse, G., Korst, J., De Boer, V.[2006]. Instance classification using co-
occurrences on the web. In Proceedings of the iswc 2006 workshop on web
content mining with human language technologies (webconmine). Athens,
GA. (http://orestes.ii.uam.es/workshop/3.pdf)
• Geleijnse, G., Korst, J., Pronk, V.[2006]. Google-based information extrac-
tion. In F. de Jong W. Kraaij (Eds.), Proceedings of the 6th dutch-belgian
information retrieval workshop (dir 2006) (pp. 39 – 46). Delft, the Nether-
lands: Neslia Paniculata, Enschede.
• Geleijnse, G., Schedl, M., Knees, P. [2007]. The quest for ground truth
in musical artist tagging in the social web era. In S. Dixon, D. Bainbridge,
R. Typke (Eds.), Proceedings of the eighth international conference on music
information retrieval (ismir’07) (pp. 525 – 530). Vienna, Austria: Austrian
Computer Society.
• Geleijnse, G., Sekulovski, D., Korst, J., Kater, B., Pauws, S., Vignoli, F.
[2008].
Enriching music with synchronized lyrics, images and colored
lights.
In First international conference on ambient media and systems
(ambi-sys 2008). Quebec, QC, Canada.
• Korst, J., Geleijnse, G. [2006]. Efficient lyrics retrieval and alignment. In
W. Verhaegh, E. Aarts, W. ten Kate, J. Korst, S. Pauws (Eds.), Proceed-
ings third philips symposium on intelligent algorithms (soia 2006) (pp. 205
– 218). Eindhoven, the Netherlands.
• Korst, J., Geleijnse, G., De Jong, N., Verschoor, M. [2006]. Ontology-
based extraction of information from the World Wide Web. In W. Verhaegh,

168
Publications
E. Aarts, J. Korst (Eds.), Intelligent algorithms in ambient and biomedical
computing (pp. 149 – 167). Heidelberg, Germany: Springer.
• Sekulovski, D., Geleijnse, G., Kater, B., Korst, J., Pauws, S., Clout, R.
[2008]. Enriching text with images and colored light. In Proceedings of the
is&t/spie 20th annual electronic imaging symposium. San Jose, CA.
The following patent applications relate to this thesis.
• Geleijnse, G., Korst, J. [2005]. Method, system and device for obtaining a
representation of a text. WO2007057799.
• Geleijnse, G., Korst, J., Sekulovski, D. [2007]. Method and apparatus for
enabling simultaneous reproduction of a first media item and a second media
item. Filed.
• Korst, J. , Geleijnse, G. [2005]. Method of obtaining a representation of a
text. WO2007057809.
• Korst, J. , Geleijnse, G., Pauws, S. [2006]. Method and electronic device for
aligning a song with its lyrics. WO2007129250.

Summary
Information Extraction from the Web using
a Search Engine
The web currently is the de-facto source to find an arbitrary piece of infor-
mation. Intelligent applications can benefit from the collective knowledge of the
internet community as to be found on the web. However, the vast majority of the
information on the web is represented in a human-friendly format using natural
language texts. Such information in natural language texts is not machine inter-
pretable. As intelligent applications – for example recommender systems – may
benefit from such structured information, we focus on the extraction of informa-
tion from the web.
We present approaches to find, extract and structure information from natu-
ral language texts on the web. Such structured information can be expressed and
shared using the standard semantic web languages and hence be machine inter-
preted.
Information Extraction focusses on the identification of instances (given names
and terms like
Technische Universiteit Eindhoven, Carla Bruni and Amarillo) of
classes (e.g.
university, person, or location). Apart from the identification of such
instances, their relations are to be discovered in a collection of texts (e.g. the re-
lation between
Amsterdam and the Netherlands). Inspired by the semantic web
community, we specify the information demand (e.g.
‘Find the names of all coun-
tries in the world’, ‘Given a list of pop artists, which one is said to be most related
to Amy Winehouse?’) using an ontology. The information extraction problem is
expressed in terms of an ontology population problem.
Other information extraction tasks focus on the corpus rather than on the on-
tology. Where their goal is to identify all instances and relation as expressed in the
texts, our goal is to solely find the demanded information expressed in the initial
ontology. As information on the web can be assumed to be redundantly available,
we do not have to recognize each formulation of a fact of interest. For example, the
statement
Amsterdam is the capital of the Netherlands is expressed in many texts
using diverse formulations. To extract this information from the web, we may not
169

170
Summary
have to recognize all of the encountered formulations.
In the thesis, a simple ontology population method using patterns is proposed.
Patterns are commonly occurring phrases that are typically used to express a
given relation. We combine such patterns and known instances into search en-
gine queries. For example if
Anton Philips is a known instance, was a is pattern
expressing the relation between a person and his profession, we combine the two
into the query
“Anton Philips was a”. Subsequently, we extract instances and re-
lations from the retrieved documents. The use of the constructed queries shows to
be an effective mechanism to access highly relevant texts on the one hand and to
identify relations on the other hand.
After discussing a general approach to populate an ontology, we focus on two
subproblems: the identification of effective patterns and the recognition of the in-
stances of the defined classes in texts. The presented approach contains a boot-
strapping mechanism, as learned instances and patterns are used to formulate new
search engine queries.
The approach is illustrated with several case-studies. In order to benchmark our
method, we extract
facts from the web. The second part focuses on the extraction
of
inferable information, i.e. information that is not present as such on the web, but
can be derived by combining data from multiple documents. The last part of the
thesis focuses on the discovery of community-based information. By combining
content of multiple documents, the
wisdom of the crowds, we create descriptions
for instances such as books and popular artists.
We show that we can reliably extract information from the web using simple
techniques. Furthermore, making use of the redundancy of information on the web,
the recall of relevant information is high for the studied domains.

Acknowledgements
First of all, I would like to thank Jan Korst, my daily supervisor at Philips
Research. I have greatly enjoyed our countless discussions and cups of coffee.
Together we learned a lot about language, music, history and cooking. Jan has
concisely read all my writings and has stimulated me to sharpen my argumentations
and to aim for a deeper understanding.
I am grateful to my promotor Emile Aarts for his guidance and enthusiasm.
Emile always knew to surprise me with a new insight, a different perspective or a
challenging question. Our discussions were both fun and stimulating.
The members of the core committee, Franciska de Jong, Paul De Bra and Paul
Buitelaar have thoroughly read the draft version of my thesis. I would like to thank
them for their valuable feedback and remarks.
The
Media Interaction group of Philips Research, currently named User Ex-
periences, has shown me that there is more to an algorithm than its correctness
and efficiency. I thank my roommates for making me feel at home in the office.
Zharko, Evelien, Edgar, Dario, Maria, Greg, Th´er`ese, Annick, Carolien, Marco,
Javed, Bram, Marjolein and Peter: thank you. A special thanks goes to Dragan
for all the discussions and reviewing. As being a member of subsequently the
in-
telligent algorithms, the interaction algorithms and the computational intelligence
cluster I found myself in a stimulating and knowledgeable research environment.
The algoritmenclub provided a inspiring platform to broaden my scope. I would
like to thank all cluster and club members. I thank the management of Philips Re-
search, in particular Maurice Groten and Reinder Haakma, for providing a great
working environment.
Finally, I’d like to thank friends and family, especially Coen, Janneke, Hans
and Stijn.
171

Biography
Gijs Geleijnse was born in Breda on June 12, 1979. In 1997 he received his VWO
diploma from the Newmancollege in Breda. In the same year he started his com-
puting science study at Eindhoven University of Technology. In the summer of
2004 he graduated with honors, with a specialization in formal methods. His mas-
ter’s thesis, supervised by Rob Nederpelt, handles the comparison between two
formal languages for mathematics.
In September 2004, he started as a
Holst junior in the Media Interaction Group
of Philips Research in Eindhoven. His work was carried out in the context of the
Dutch
BSIK
program
MultimediaN. His research – that resulted in this thesis, sev-
eral papers, inventions and patent applications – was carried out under supervision
of Jan Korst and Emile Aarts.
In August 2008, Gijs jointed the User Experiences group of Philips Research
as a research scientist.
172

Document Outline

Contents
1. Introduction
2. A pattern-based approach to web information extraction
3. Two subproblems in extracting information from the web using patterns
4. Evalutation : extrracting factural information from the web
5. Application : extracting inferable information from the web
6. Discovering information by extracting community data
7. Conclusions
Bibliography
Publications
Summary
Acknowledgements
Biography

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 49 50 51 52 53 54 55 56 57