Interactive web crawling and data extraction
Název práce v češtině: | Interaktivní procházení webu a extrakce dat |
---|---|
Název v anglickém jazyce: | Interactive web crawling and data extraction |
Klíčová slova: | Web crawling, Web data extraction, Web scraping, AJAX, RIA, Rich Internet Application, browser automation |
Klíčová slova anglicky: | Web crawling, Web data extraction, Web scraping, AJAX, RIA, Rich Internet Application, browser automation |
Akademický rok vypsání: | 2017/2018 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Katedra distribuovaných a spolehlivých systémů (32-KDSS) |
Vedoucí / školitel: | Mgr. Pavel Ježek, Ph.D. |
Řešitel: | skrytý - zadáno a potvrzeno stud. odd. |
Datum přihlášení: | 13.02.2018 |
Datum zadání: | 13.02.2018 |
Datum potvrzení stud. oddělením: | 13.02.2018 |
Datum a čas obhajoby: | 10.09.2018 00:00 |
Datum odevzdání elektronické podoby: | 20.07.2018 |
Datum odevzdání tištěné podoby: | 20.07.2018 |
Datum proběhlé obhajoby: | 10.09.2018 |
Oponenti: | doc. Mgr. Martin Nečaský, Ph.D. |
Zásady pro vypracování |
Rise of new rich internet applications brings new challenges to the area of web crawling and web data extraction. The need for data extraction from web pages for data mining purposes has significantly increased. Current tools allow users process only certain types of non-dynamic pages. Universal tools are usually provided as programming library, therefore user needs to possess a certain level of technical expertise. Debugging of crawler and data extractor solution is usually provided as a simple logging, which is not an intuitive process. These factors do not allow usage of these tools for business users.
The thesis should analyze current techniques used for web crawling and web data extraction for rich internet application. The thesis should design solution which would enable users to visually define web extraction for web page. As the issue of crawling and web data extraction of static pages is considered solved, the goal of the thesis is to try to solve rich internet application crawling and extraction. Implementation of a proposed crawling and extraction approach should be provided as an inherent part of the thesis. |
Seznam odborné literatury |
* Alberto HF Laender, Berthier A Ribeiro-Neto, Altigran S Da Silva, and Juliana S Teixeira.
A brief survey of web data extraction tools. ACM Sigmod Record, 31(2):84–93, 2002. * Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden web. Technical report, Stanford, 2000. * Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. Web data extraction, applications and techniques: A survey. Knowledge-based systems, 70:301–323, 2014. * Suryakant Choudhary, Mustafa Emre Dincturk, Seyed M Mirtaheri, Ali Moosavi, Gregor Von Bochmann, Guy-Vincent Jourdan, and Iosif Viorel Onut. Crawling rich internet applications: the state of the art. In Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, pages 146–160. IBM Corp., 2012. |