Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Structured Data Extraction from Unstructured Text
Thesis title in Czech: Extrakcia štruktúrovaných dát z neštruktúrovaného textu
Thesis title in English: Structured Data Extraction from Unstructured Text
Key words: extrakcia štrukturovaných dát, extrakčné pravidlá, (semi)automatická indukcia wrapperov
English key words: structured data extraction, extraction rules, (semi)automatic wrapper induction
Academic year of topic announcement: 2011/2012
Thesis type: diploma thesis
Thesis language: angličtina
Department: Department of Software Engineering (32-KSI)
Supervisor: doc. Mgr. Martin Nečaský, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 04.11.2011
Date of assignment: 07.11.2011
Confirmed by Study dept. on: 03.07.2013
Date and time of defence: 09.09.2013 00:00
Date of electronic submission:01.08.2013
Date of submission of printed version:02.08.2013
Date of proceeded defence: 09.09.2013
Opponents: RNDr. Michal Kopecký, Ph.D.
 
 
 
Guidelines
The author of this thesis will approach the problem of automatic structured data extraction from a semi-formatted plain text [1,2,3]. The input is a collection of text documents, an ontology describing the data domain for which the data should be extracted and a configuration file with extraction rules. Basic methods are currently being implemented in the scope of a student software project. In the thesis, the author will compare the success of his method of data extraction with methods published in the current literature.
References
[1] Dayne Freitag , Andrew McCallum, Information Extraction with HMM Structures Learned by Stochastic Optimization, Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, p.584-589, July 30-August 03, 2000

[2] AnHai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, Chaitanya Gokhale, Jiansheng Huang, Warren Shen, and Ba-Quy Vuong. 2009. Information extraction challenges in managing unstructured data. SIGMOD Rec. 37, 4 (March 2009), 14-20

[3] Ronen Feldman, James Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. 2006. ISBN 978-0521836579.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html