Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Oprava gramatiky v češtině

Název práce v češtině:	Oprava gramatiky v češtině
Název v anglickém jazyce:	Czech Grammar Error Correction
Klíčová slova:	oprava gramatiky\|GECCC\|čeština
Klíčová slova anglicky:	grammar error correction\|GECCC\|Czech
Akademický rok vypsání:	2022/2023
Typ práce:	diplomová práce
Jazyk práce:
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	RNDr. Milan Straka, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	14.04.2023
Datum zadání:	27.06.2023
Datum potvrzení stud. oddělením:	13.10.2023
Konzultanti:	Mgr. Jakub Náplava, Ph.D.

Zásady pro vypracování

Grammatical Error Correction (GEC) is a long-studied task with most research conducted on English. Recently, this task was also examined for Czech by Náplava, et al. (2022), who created a new dataset (GECCC), analyzed metrics, and evaluated Transformer-based neural models on it. The goal of this diploma thesis is to follow up on their work and create an easy-to-deploy state-of-the-art GEC model for Czech, and examine its behavior under typical Czech errors. The development of a state-of-the-art model comprises multiple steps (experiments). The diploma thesis road plan is outlined below:

1. Implement the data-generation and model-training pipelines. The data-generation pipeline should be implemented so that it is easy to add/remove a noising rule (e.g., delete a word or replace a word by its spell-checker suggestion). The model-training pipeline should use the popular Transformers library with an emphasis on easy model replaceability (apart from the traditional Bert-BASE model, the ByteT5 model should also be evaluated) – 1st semester (out of 3 semesters in total).

2. Examine how the model behaves not only in the terms of bare scores on the GECCC dataset but also identify typical Czech errors (e.g. mně / mě; s / z) and test models on them. Should the model work badly on them, the GEC model could be re-trained with such error types included in the training set. Furthermore, one-to-one word replacement using Morflex and Derinet for pre-training data generation should be experimented with. Optionally a neural model trained in the opposite direction to correction could be trained and used to create pre-training data – 2nd semester (out of 3 semesters in total).

3. Finally, multiple models should be combined in an ensemble. One of the possible ways is to follow Qorib, et al. (2022) – 3rd semester (out of 3 semesters in total).

Seznam odborné literatury

Christopher Bryant, et al. "Grammatical Error Correction: A Survey of the State of the Art." https://arxiv.org/pdf/2211.05166.pdf - an extensive GEC survey from November 2022.

Náplava, Jakub, et al. "Czech grammar error correction with a large and diverse corpus." Transactions of the Association for Computational Linguistics 10 (2022): 452-467. https://aclanthology.org/2022.tacl-1.26/

Qorib, Muhammad, Seung-Hoon Na, and Hwee Tou Ng. "Frustratingly easy system combination for grammatical error correction." Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. https://aclanthology.org/2022.naacl-main.143/