AI-based Structured Web Data Extraction
Název práce v češtině: | Extrakce strukturovaných dat z webu pomocí umělé inteligence |
---|---|
Název v anglickém jazyce: | AI-based Structured Web Data Extraction |
Klíčová slova: | extrakce strukturovaných dat z webu|scrapování webu|automatické scrapování|umělá inteligence |
Klíčová slova anglicky: | structured web information extraction|web content mining|web scraping|wrapper generation|artificial intelligence |
Akademický rok vypsání: | 2021/2022 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Katedra softwarového inženýrství (32-KSI) |
Vedoucí / školitel: | RNDr. Jakub Klímek, Ph.D. |
Řešitel: | Mgr. Jan Joneš - zadáno a potvrzeno stud. odd. |
Datum přihlášení: | 26.10.2021 |
Datum zadání: | 27.10.2021 |
Datum potvrzení stud. oddělením: | 29.03.2022 |
Datum a čas obhajoby: | 15.06.2022 09:00 |
Datum odevzdání elektronické podoby: | 05.05.2022 |
Datum odevzdání tištěné podoby: | 16.05.2022 |
Datum proběhlé obhajoby: | 15.06.2022 |
Oponenti: | Mgr. Ladislav Peška, Ph.D. |
Zásady pro vypracování |
Useful information from the web is commonly extracted using scrapers [1].
These scrapers usually need to be manually created and maintained by programmers, which can lead to high costs. Thus, the goal of this thesis is to develop a machine learning model to aid in the creation of automated scrapers. The student will perform research of related work [2, 3] and based on the findings, the student will design, train, document and evaluate a model capable of identifying the important information on previously unseen websites. It will target pages with similar structure over different websites, e.g., product pages or articles. The model trained on a public dataset [2] and results on that dataset will be published in an appropriate repository supporting open access and all related code will be published on GitHub ensuring reproducibility of the work. In addition, a model will also be trained on a private dataset provided by the company Apify. The trained model will not be published, but it will be evaluated and compared to the model trained on the public dataset. |
Seznam odborné literatury |
[1] Azir, Mohd Amir Bin Mohd, and Kamsuriah Binti Ahmad. "Wrapper approaches for web data extraction: A review." 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI). IEEE, 2017.
[2] Hao, Qiang, et al. "From one tree to a forest: a unified solution for structured web data extraction." Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011. [3] Zhou, Yichao, et al. "Simplified DOM Trees for Transferable Attribute Extraction from the Web." arXiv preprint arXiv:2101.02415 (2021). |