Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Web Data Extraction

Thesis title in Czech:	Extrakce dat z webu
Thesis title in English:	Web Data Extraction
Key words:	systém na extrakci dat z webu, webový wrapper, omezené prostředí, rozšíření webového prohlížeče
English key words:	web data extraction system, web wrapper, safe execution, restricted environment, web browser extension
Academic year of topic announcement:	2015/2016
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Department of Software Engineering (32-KSI)
Supervisor:	doc. RNDr. Irena Holubová, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	20.07.2016
Date of assignment:	21.07.2016
Confirmed by Study dept. on:	26.07.2016
Date and time of defence:	12.09.2016 10:30
Date of electronic submission:	28.07.2016
Date of submission of printed version:	28.07.2016
Date of proceeded defence:	12.09.2016
Opponents:	Mgr. Marek Polák, Ph.D.

Guidelines

The vast majority of the information on the Internet is designed for human-consumption and, therefore, has no specific structure. The area of web data extraction thus focuses on extracting important information from the unstructured data into a structured form by special programs called web wrappers.
This work will focus on the area of restricted and safe execution of web wrappers executed in a restricted environment, e.g., in web browsers. First, the author will analyze the existing approaches and evaluate their capabilities and open problems. On the basis of the findings, the author will propose, implement, and evaluate own solution targeting the selected issues with an emphasis on modularity and extensibility.

References

Alberto HF Laender, Berthier A Ribeiro-Neto, Altigran S da Silva, and Juliana S Teixeira. A brief survey of web data extraction tools. ACM Sigmod Record, 31(2):84–93, 2002.

Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. Web data extraction, applications and techniques: A survey. Knowledge based systems, 70:301–323, 2014.

Nicholas Kushmerick. Finite-state approaches to web information extraction. In Information Extraction in the Web Era, pages 77–91. Springer, 2003.

Arnaud Sahuguet and Fabien Azavant. Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering, 36(3):283–316, 2001.

Mary Elaine Califf and Raymond J Mooney. Bottom-up relational learning of pattern matching rules for information extraction. The Journal of Machine Learning Research, 4:177–210, 2003.

Giovanni Grasso, Tim Furche, and Christian Schallhart. Effective web scraping with OXPath. In Proceedings of the 22nd international conference on World Wide Web companion, pages 23–26. International World Wide Web Conferences Steering Committee, 2013.

Tim Furche, Georg Gottlob, Giovanni Grasso, Omer Gunes, Xiaoanan Guo, Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, and Cheng Wang. DIADEM: domain-centric, intelligent, automated data extraction methodology. In Proceedings of the 21st international konference companion on World Wide Web, pages 267–270. ACM, 2012.