Web Data Extraction
Thesis title in Czech: | Extrakce dat z webu |
---|---|
Thesis title in English: | Web Data Extraction |
Key words: | systém na extrakci dat z webu, webový wrapper, omezené prostředí, rozšíření webového prohlížeče |
English key words: | web data extraction system, web wrapper, safe execution, restricted environment, web browser extension |
Academic year of topic announcement: | 2015/2016 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Software Engineering (32-KSI) |
Supervisor: | doc. RNDr. Irena Holubová, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 20.07.2016 |
Date of assignment: | 21.07.2016 |
Confirmed by Study dept. on: | 26.07.2016 |
Date and time of defence: | 12.09.2016 10:30 |
Date of electronic submission: | 28.07.2016 |
Date of submission of printed version: | 28.07.2016 |
Date of proceeded defence: | 12.09.2016 |
Opponents: | Mgr. Marek Polák, Ph.D. |
Guidelines |
The vast majority of the information on the Internet is designed for human-consumption and, therefore, has no specific structure. The area of web data extraction thus focuses on extracting important information from the unstructured data into a structured form by special programs called web wrappers.
This work will focus on the area of restricted and safe execution of web wrappers executed in a restricted environment, e.g., in web browsers. First, the author will analyze the existing approaches and evaluate their capabilities and open problems. On the basis of the findings, the author will propose, implement, and evaluate own solution targeting the selected issues with an emphasis on modularity and extensibility. |
References |
Alberto HF Laender, Berthier A Ribeiro-Neto, Altigran S da Silva, and Juliana S Teixeira. A brief survey of web data extraction tools. ACM Sigmod Record, 31(2):84–93, 2002.
Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. Web data extraction, applications and techniques: A survey. Knowledge based systems, 70:301–323, 2014. Nicholas Kushmerick. Finite-state approaches to web information extraction. In Information Extraction in the Web Era, pages 77–91. Springer, 2003. Arnaud Sahuguet and Fabien Azavant. Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering, 36(3):283–316, 2001. Mary Elaine Califf and Raymond J Mooney. Bottom-up relational learning of pattern matching rules for information extraction. The Journal of Machine Learning Research, 4:177–210, 2003. Giovanni Grasso, Tim Furche, and Christian Schallhart. Effective web scraping with OXPath. In Proceedings of the 22nd international conference on World Wide Web companion, pages 23–26. International World Wide Web Conferences Steering Committee, 2013. Tim Furche, Georg Gottlob, Giovanni Grasso, Omer Gunes, Xiaoanan Guo, Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, and Cheng Wang. DIADEM: domain-centric, intelligent, automated data extraction methodology. In Proceedings of the 21st international konference companion on World Wide Web, pages 267–270. ACM, 2012. |