Thesis (Selection of subject)Thesis (Selection of subject)(version: 278)
Assignment details
   Login via CAS
OCR for tabular data
Thesis title in Czech: OCR pro tabulková data
Thesis title in English: OCR for tabular data
Key words: OCR, digitalizace, archivace, účetní data
English key words: OCR, digitalization, archivation, accounting data
Academic year of topic announcement: 2018/2019
Type of assignment: Bachelor's thesis
Thesis language: angličtina
Department: Department of Software Engineering (32-KSI)
Supervisor: Mgr. Miroslav Kratochvíl
Author: hidden - assigned by the advisor
Date of registration: 25.08.2018
Date of assignment: 28.08.2018
Guidelines
Digitalization is a process of converting content of legacy media to a digital, computer-accessible form. Digitalization of text is currently well-established, supported by OCR-related image processing techniques. Digitalization of tabular text data, which are common in business and accounting systems, is problematic for a simple OCR algorithm that does not concern the placement or relations of table cells.

The goal of this thesis is to implement a user-friendly software capable of converting image data to tabular form. Resulting software should be able to surpass common deficiencies in scanner-generated input, and output an intermediate textual representation of the page as a tabular data (including e.g. sub-tables, margins, colors, or non-tabular text or image elements), which can be easily converted to e.g. CSV, XLS or TeX format.
References
Kari Pulli (NVIDIA), Anatoly Baksheev, Kirill Kornyakov, Victor Eruhimov in Communications of the ACM, Real-time computer vision with OpenCV, June 2012

Gary Bradski in Dr. Dobbs Journal, The OpenCV Library, 2000

R.Fisher, S.Perkins, A.Walker, E.Wolfart, Hypermedia image processing reference, 2003

Craige Thomas, Extracting Table Data From PDFs with OCR, September 2011
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html