Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

AI-based Structured Web Data Extraction

Thesis title in Czech:	Extrakce strukturovaných dat z webu pomocí umělé inteligence
Thesis title in English:	AI-based Structured Web Data Extraction
Key words:	extrakce strukturovaných dat z webu\|scrapování webu\|automatické scrapování\|umělá inteligence
English key words:	structured web information extraction\|web content mining\|web scraping\|wrapper generation\|artificial intelligence
Academic year of topic announcement:	2021/2022
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Department of Software Engineering (32-KSI)
Supervisor:	RNDr. Jakub Klímek, Ph.D.
Author:	Mgr. Jan Joneš - assigned and confirmed by the Study Dept.
Date of registration:	26.10.2021
Date of assignment:	27.10.2021
Confirmed by Study dept. on:	29.03.2022
Date and time of defence:	15.06.2022 09:00
Date of electronic submission:	05.05.2022
Date of submission of printed version:	16.05.2022
Date of proceeded defence:	15.06.2022
Opponents:	Mgr. Ladislav Peška, Ph.D.

Guidelines

Useful information from the web is commonly extracted using scrapers [1].
These scrapers usually need to be manually created and maintained by programmers, which can lead to high costs.
Thus, the goal of this thesis is to develop a machine learning model to aid in the creation of automated scrapers.
The student will perform research of related work [2, 3] and based on the findings, the student will design, train, document and evaluate a model capable of identifying the important information on previously unseen websites.
It will target pages with similar structure over different websites, e.g., product pages or articles.
The model trained on a public dataset [2] and results on that dataset will be published in an appropriate repository supporting open access and all related code will be published on GitHub ensuring reproducibility of the work.
In addition, a model will also be trained on a private dataset provided by the company Apify.
The trained model will not be published, but it will be evaluated and compared to the model trained on the public dataset.

References

[1] Azir, Mohd Amir Bin Mohd, and Kamsuriah Binti Ahmad. "Wrapper approaches for web data extraction: A review." 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI). IEEE, 2017.
[2] Hao, Qiang, et al. "From one tree to a forest: a unified solution for structured web data extraction." Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011.
[3] Zhou, Yichao, et al. "Simplified DOM Trees for Transferable Attribute Extraction from the Web." arXiv preprint arXiv:2101.02415 (2021).