OCR for tabular data
Thesis title in Czech: | OCR pro tabulková data |
---|---|
Thesis title in English: | OCR for tabular data |
Key words: | OCR, digitalizace, archivace, účetní data |
English key words: | OCR, digitalization, archivation, accounting data |
Academic year of topic announcement: | 2018/2019 |
Thesis type: | Bachelor's thesis |
Thesis language: | angličtina |
Department: | Department of Software Engineering (32-KSI) |
Supervisor: | RNDr. Miroslav Kratochvíl, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 25.08.2018 |
Date of assignment: | 28.08.2018 |
Confirmed by Study dept. on: | 03.12.2018 |
Date and time of defence: | 27.06.2019 09:30 |
Date of electronic submission: | 17.05.2019 |
Date of submission of printed version: | 17.05.2019 |
Date of proceeded defence: | 27.06.2019 |
Opponents: | Mgr. Vít Šefl |
Guidelines |
Digitalization is a process of converting content of legacy media to a digital, computer-accessible form. Digitalization of text is currently well-established, supported by OCR-related image processing techniques. Digitalization of tabular text data, which are common in business and accounting systems, is problematic for a simple OCR algorithm that does not concern the placement or relations of table cells.
The goal of this thesis is to implement a user-friendly software capable of converting image data to tabular form. Resulting software should be able to surpass common deficiencies in scanner-generated input, and output an intermediate textual representation of the page as a tabular data (including e.g. sub-tables, margins, colors, or non-tabular text or image elements), which can be easily converted to e.g. CSV, XLS or TeX format. |
References |
Kari Pulli (NVIDIA), Anatoly Baksheev, Kirill Kornyakov, Victor Eruhimov in Communications of the ACM, Real-time computer vision with OpenCV, June 2012
Gary Bradski in Dr. Dobbs Journal, The OpenCV Library, 2000 R.Fisher, S.Perkins, A.Walker, E.Wolfart, Hypermedia image processing reference, 2003 Craige Thomas, Extracting Table Data From PDFs with OCR, September 2011 |