AI-based Structured Web Data Extraction
Thesis title in Czech: | Extrakce strukturovaných dat z webu pomocí umělé inteligence |
---|---|
Thesis title in English: | AI-based Structured Web Data Extraction |
Key words: | extrakce strukturovaných dat z webu|scrapování webu|automatické scrapování|umělá inteligence |
English key words: | structured web information extraction|web content mining|web scraping|wrapper generation|artificial intelligence |
Academic year of topic announcement: | 2021/2022 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Software Engineering (32-KSI) |
Supervisor: | RNDr. Jakub Klímek, Ph.D. |
Author: | Mgr. Jan Joneš - assigned and confirmed by the Study Dept. |
Date of registration: | 26.10.2021 |
Date of assignment: | 27.10.2021 |
Confirmed by Study dept. on: | 29.03.2022 |
Date and time of defence: | 15.06.2022 09:00 |
Date of electronic submission: | 05.05.2022 |
Date of submission of printed version: | 16.05.2022 |
Date of proceeded defence: | 15.06.2022 |
Opponents: | Mgr. Ladislav Peška, Ph.D. |
Guidelines |
Useful information from the web is commonly extracted using scrapers [1].
These scrapers usually need to be manually created and maintained by programmers, which can lead to high costs. Thus, the goal of this thesis is to develop a machine learning model to aid in the creation of automated scrapers. The student will perform research of related work [2, 3] and based on the findings, the student will design, train, document and evaluate a model capable of identifying the important information on previously unseen websites. It will target pages with similar structure over different websites, e.g., product pages or articles. The model trained on a public dataset [2] and results on that dataset will be published in an appropriate repository supporting open access and all related code will be published on GitHub ensuring reproducibility of the work. In addition, a model will also be trained on a private dataset provided by the company Apify. The trained model will not be published, but it will be evaluated and compared to the model trained on the public dataset. |
References |
[1] Azir, Mohd Amir Bin Mohd, and Kamsuriah Binti Ahmad. "Wrapper approaches for web data extraction: A review." 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI). IEEE, 2017.
[2] Hao, Qiang, et al. "From one tree to a forest: a unified solution for structured web data extraction." Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011. [3] Zhou, Yichao, et al. "Simplified DOM Trees for Transferable Attribute Extraction from the Web." arXiv preprint arXiv:2101.02415 (2021). |