Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Web page analyzer for scraping

Název práce v češtině:	Analyzátor webových stránek pro extrakci dat
Název v anglickém jazyce:	Web page analyzer for scraping
Klíčová slova:	extrakce dat z webu\|analyzátor webových stránek
Klíčová slova anglicky:	web scraping\|page analyser
Akademický rok vypsání:	2021/2022
Typ práce:	bakalářská práce
Jazyk práce:	angličtina
Ústav:	Katedra teoretické informatiky a matematické logiky (32-KTIML)
Vedoucí / školitel:	RNDr. Kateřina Macková
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	02.02.2022
Datum zadání:	03.02.2022
Datum potvrzení stud. oddělením:	11.05.2023
Datum a čas obhajoby:	29.06.2023 09:00
Datum odevzdání elektronické podoby:	11.05.2023
Datum odevzdání tištěné podoby:	11.05.2023
Datum proběhlé obhajoby:	29.06.2023
Oponenti:	Mgr. Tomáš Petříček, Ph.D.

Zásady pro vypracování

Web scraping is a technique used to automatically extract data from the internet. This data can then be further used for a variety of applications in data and market analysis. Data is typically distributed across the whole web page, residing in multiple sources such as HTML of website, JSON-LD data, schema.org data etc. The problem is that not all data can be extracted immediately from the HTML after the first loading of the website, as some data is loaded dynamically after using different types of requests. One way to perform these requests is to use tools like Playwright or Puppeteer which use a lot of resources. The goal is to replicate those requests using a plain HTTP client.

The student will study the necessary literature for working with browsers, web automation and currently used techniques for data extraction. He will create software that will analyze a given web page, locate desired data in the communication between the client and the server and guide users on how to extract them automatically and effectively.

Seznam odborné literatury

[1] Seppe KLM vanden Broucke, Seppe & Baesens, Bart. (2018). Practical Web Scraping for Data Science. 10.1007/978-1-4842-3582-9.
[2] Daniel Glez-Peña, Daniel & Lourenco, Anália & López-Fernández, Hugo & Reboiro-Jato, Miguel & Fdez-Riverola, Florentino. (2013). Web scraping technologies in an API world. Briefings in bioinformatics. 15. 10.1093/bib/bbt026.
[3] Rabiyatou Diouf, Rabiyatou & Sarr, Edouard & Sall, Ousmane & Birregah, Babiga & Bousso, M. & Mbaye, Sény. (2019). Web Scraping: State-of-the-Art and Areas of Application. 6040-6042. 10.1109/BigData47090.2019.9005594.
[4] Patel, Jay. (2020). Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale. 10.1007/978-1-4842-6576-5.
[5] Zhao, Bo. (2017). Web Scraping. 10.1007/978-3-319-32001-4_483-1.
[6] Christopher Olston, Christopher & Najork, Marc. (2010). Web Crawling. Foundations and Trends in Information Retrieval. 4. 175-246. 10.1561/1500000017.