Structured Data Extraction from Unstructured Text
Thesis title in Czech: | Extrakcia štruktúrovaných dát z neštruktúrovaného textu |
---|---|
Thesis title in English: | Structured Data Extraction from Unstructured Text |
Key words: | extrakcia štrukturovaných dát, extrakčné pravidlá, (semi)automatická indukcia wrapperov |
English key words: | structured data extraction, extraction rules, (semi)automatic wrapper induction |
Academic year of topic announcement: | 2011/2012 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Software Engineering (32-KSI) |
Supervisor: | doc. Mgr. Martin Nečaský, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 04.11.2011 |
Date of assignment: | 07.11.2011 |
Confirmed by Study dept. on: | 03.07.2013 |
Date and time of defence: | 09.09.2013 00:00 |
Date of electronic submission: | 01.08.2013 |
Date of submission of printed version: | 02.08.2013 |
Date of proceeded defence: | 09.09.2013 |
Opponents: | RNDr. Michal Kopecký, Ph.D. |
Guidelines |
The author of this thesis will approach the problem of automatic structured data extraction from a semi-formatted plain text [1,2,3]. The input is a collection of text documents, an ontology describing the data domain for which the data should be extracted and a configuration file with extraction rules. Basic methods are currently being implemented in the scope of a student software project. In the thesis, the author will compare the success of his method of data extraction with methods published in the current literature. |
References |
[1] Dayne Freitag , Andrew McCallum, Information Extraction with HMM Structures Learned by Stochastic Optimization, Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, p.584-589, July 30-August 03, 2000
[2] AnHai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, Chaitanya Gokhale, Jiansheng Huang, Warren Shen, and Ba-Quy Vuong. 2009. Information extraction challenges in managing unstructured data. SIGMOD Rec. 37, 4 (March 2009), 14-20 [3] Ronen Feldman, James Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. 2006. ISBN 978-0521836579. |