Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
OCR for tabular data
Thesis title in Czech: OCR pro tabulková data
Thesis title in English: OCR for tabular data
Key words: OCR, digitalizace, archivace, účetní data
English key words: OCR, digitalization, archivation, accounting data
Academic year of topic announcement: 2018/2019
Thesis type: Bachelor's thesis
Thesis language: angličtina
Department: Department of Software Engineering (32-KSI)
Supervisor: RNDr. Miroslav Kratochvíl, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 25.08.2018
Date of assignment: 28.08.2018
Confirmed by Study dept. on: 03.12.2018
Date and time of defence: 27.06.2019 09:30
Date of electronic submission:17.05.2019
Date of submission of printed version:17.05.2019
Date of proceeded defence: 27.06.2019
Opponents: Mgr. Vít Šefl
 
 
 
Guidelines
Digitalization is a process of converting content of legacy media to a digital, computer-accessible form. Digitalization of text is currently well-established, supported by OCR-related image processing techniques. Digitalization of tabular text data, which are common in business and accounting systems, is problematic for a simple OCR algorithm that does not concern the placement or relations of table cells.

The goal of this thesis is to implement a user-friendly software capable of converting image data to tabular form. Resulting software should be able to surpass common deficiencies in scanner-generated input, and output an intermediate textual representation of the page as a tabular data (including e.g. sub-tables, margins, colors, or non-tabular text or image elements), which can be easily converted to e.g. CSV, XLS or TeX format.
References
Kari Pulli (NVIDIA), Anatoly Baksheev, Kirill Kornyakov, Victor Eruhimov in Communications of the ACM, Real-time computer vision with OpenCV, June 2012

Gary Bradski in Dr. Dobbs Journal, The OpenCV Library, 2000

R.Fisher, S.Perkins, A.Walker, E.Wolfart, Hypermedia image processing reference, 2003

Craige Thomas, Extracting Table Data From PDFs with OCR, September 2011
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html