Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Interactive web crawling and data extraction

Název práce v češtině:	Interaktivní procházení webu a extrakce dat
Název v anglickém jazyce:	Interactive web crawling and data extraction
Klíčová slova:	Web crawling, Web data extraction, Web scraping, AJAX, RIA, Rich Internet Application, browser automation
Klíčová slova anglicky:	Web crawling, Web data extraction, Web scraping, AJAX, RIA, Rich Internet Application, browser automation
Akademický rok vypsání:	2017/2018
Typ práce:	diplomová práce
Jazyk práce:	angličtina
Ústav:	Katedra distribuovaných a spolehlivých systémů (32-KDSS)
Vedoucí / školitel:	Mgr. Pavel Ježek, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	13.02.2018
Datum zadání:	13.02.2018
Datum potvrzení stud. oddělením:	13.02.2018
Datum a čas obhajoby:	10.09.2018 00:00
Datum odevzdání elektronické podoby:	20.07.2018
Datum odevzdání tištěné podoby:	20.07.2018
Datum proběhlé obhajoby:	10.09.2018
Oponenti:	doc. Mgr. Martin Nečaský, Ph.D.

Zásady pro vypracování

Rise of new rich internet applications brings new challenges to the area of web crawling and web data extraction. The need for data extraction from web pages for data mining purposes has significantly increased. Current tools allow users process only certain types of non-dynamic pages. Universal tools are usually provided as programming library, therefore user needs to possess a certain level of technical expertise. Debugging of crawler and data extractor solution is usually provided as a simple logging, which is not an intuitive process. These factors do not allow usage of these tools for business users.

The thesis should analyze current techniques used for web crawling and web data extraction for rich internet application. The thesis should design solution which would enable users to visually define web extraction for web page. As the issue of crawling and web data extraction of static pages is considered solved, the goal of the thesis is to try to solve rich internet application crawling and extraction. Implementation of a proposed crawling and extraction approach should be provided as an inherent part of the thesis.

Seznam odborné literatury

* Alberto HF Laender, Berthier A Ribeiro-Neto, Altigran S Da Silva, and Juliana S Teixeira.
A brief survey of web data extraction tools.
ACM Sigmod Record, 31(2):84–93, 2002.

* Sriram Raghavan and Hector Garcia-Molina.
Crawling the hidden web.
Technical report, Stanford, 2000.

* Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner.
Web data extraction, applications and techniques: A survey.
Knowledge-based systems, 70:301–323, 2014.

* Suryakant Choudhary, Mustafa Emre Dincturk, Seyed M Mirtaheri, Ali Moosavi, Gregor Von Bochmann, Guy-Vincent Jourdan, and Iosif Viorel Onut.
Crawling rich internet applications: the state of the art.
In Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, pages 146–160. IBM Corp., 2012.