Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Natural Language Correction With Focus on Czech

Thesis title in Czech:	Automatická korekce textu se zaměřením na češtinu
Thesis title in English:	Natural Language Correction With Focus on Czech
Key words:	automatická korekce textu\|oprava gramatiky\|generování diakritiky\|datasety\|zpracování přirozeného jazyka
English key words:	natural language correction\|grammatical error correction\|diacritics restoration\|datasets\|Czech
Academic year of topic announcement:	2016/2017
Thesis type:	dissertation
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	RNDr. Milan Straka, Ph.D.
Author:	Mgr. Jakub Náplava, Ph.D. - assigned and confirmed by the Study Dept.
Date of registration:	20.09.2017
Date of assignment:	20.09.2017
Confirmed by Study dept. on:	03.10.2017
Date and time of defence:	28.06.2022 13:00
Date of electronic submission:	01.04.2022
Date of submission of printed version:	01.04.2022
Date of proceeded defence:	28.06.2022
Opponents:	Roman Grundkiewicz
	Mgr. et Mgr. Ondřej Dušek, Ph.D.

Guidelines

In recent years, deep neural networks have been used to solve complex machine-learning problems and have achieved significant state-of-the-art results in many areas. Since 2014 deep neural networks have been utilized also in natural text processing, improving state-of-the-art results in machine translation, dependency parsing, named entity recognition and in many other text processing applications.

One such interesting (and also very useful) text processing application is natural language correction, which aims to correct a variety of errors in input text, ranging from simple spelling errors and missing diacritical marks, to complex errors like syntactic grammatical errors, or even stylistic and semantic errors.

Deep neural networks are (to our best knowledge) state-of-the-art in English grammatical error correction (Chollampatt et al., 2016; Ziang Xie et al., 2016) and in Czech diacritization, spelling error correction and grammatical error correction (Naplava, 2017), providing models that can be trained in end-to-end fashion and require only annotated training data and large plain-text corpus. However, many challenges remain unsolved -- for instance, devising training methods overcoming lack of annotated data, better utilization of unannotated data, designing neural network architectures capable of complex error correction (grammatical, stylistic, semantic errors) or constructing a single model capable of correcting a large variety of error types, to name a few. Furthermore, to allow practical usage, runtime performance of existing models has to be improved.

The goal of the thesis is to improve the natural language correction performance, most likely by utilizing deep learning methods.

References

- Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Jurafsky, Andrew Y. Ng: Neural Language Correction with Character-Based Attention. https://arxiv.org/abs/1603.09727

- Shamil Chollampatt, Kaveh Taghipour, Hwee Tou Ng: Neural Network Translation Models for Grammatical Error Correction. https://arxiv.org/abs/1606.00189

- Jason Lee, Kyunghyun Cho, Thomas Hofmann: Fully Character-Level Neural Machine Translation without Explicit Segmentation. https://arxiv.org/abs/1610.03017

- Jakub Náplava: Natural Language Correction, Master thesis, 2017. To be submitted.

- Michal Richter: Advanced Czech Spellchecker, Master thesis, 2010. https://is.cuni.cz/webapps/zzp/detail/45334/