Témata prací (Výběr práce)Témata prací (Výběr práce)(verze: 368)
Detail práce
   Přihlásit přes CAS
Oprava gramatiky v češtině
Název práce v češtině: Oprava gramatiky v češtině
Název v anglickém jazyce: Czech Grammar Error Correction
Klíčová slova: oprava gramatiky|GECCC|čeština
Klíčová slova anglicky: grammar error correction|GECCC|Czech
Akademický rok vypsání: 2022/2023
Typ práce: diplomová práce
Jazyk práce:
Ústav: Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel: RNDr. Milan Straka, Ph.D.
Řešitel: skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení: 14.04.2023
Datum zadání: 27.06.2023
Datum potvrzení stud. oddělením: 13.10.2023
Konzultanti: Mgr. Jakub Náplava, Ph.D.
Zásady pro vypracování
Grammatical Error Correction (GEC) is a long-studied task with most research conducted on English. Recently, this task was also examined for Czech by Náplava, et al. (2022), who created a new dataset (GECCC), analyzed metrics, and evaluated Transformer-based neural models on it. The goal of this diploma thesis is to follow up on their work and create an easy-to-deploy state-of-the-art GEC model for Czech, and examine its behavior under typical Czech errors. The development of a state-of-the-art model comprises multiple steps (experiments). The diploma thesis road plan is outlined below:

1. Implement the data-generation and model-training pipelines. The data-generation pipeline should be implemented so that it is easy to add/remove a noising rule (e.g., delete a word or replace a word by its spell-checker suggestion). The model-training pipeline should use the popular Transformers library with an emphasis on easy model replaceability (apart from the traditional Bert-BASE model, the ByteT5 model should also be evaluated) – 1st semester (out of 3 semesters in total).

2. Examine how the model behaves not only in the terms of bare scores on the GECCC dataset but also identify typical Czech errors (e.g. mně / mě; s / z) and test models on them. Should the model work badly on them, the GEC model could be re-trained with such error types included in the training set. Furthermore, one-to-one word replacement using Morflex and Derinet for pre-training data generation should be experimented with. Optionally a neural model trained in the opposite direction to correction could be trained and used to create pre-training data – 2nd semester (out of 3 semesters in total).

3. Finally, multiple models should be combined in an ensemble. One of the possible ways is to follow Qorib, et al. (2022) – 3rd semester (out of 3 semesters in total).
Seznam odborné literatury
Christopher Bryant, et al. "Grammatical Error Correction: A Survey of the State of the Art." https://arxiv.org/pdf/2211.05166.pdf - an extensive GEC survey from November 2022.

Náplava, Jakub, et al. "Czech grammar error correction with a large and diverse corpus." Transactions of the Association for Computational Linguistics 10 (2022): 452-467. https://aclanthology.org/2022.tacl-1.26/

Qorib, Muhammad, Seung-Hoon Na, and Hwee Tou Ng. "Frustratingly easy system combination for grammatical error correction." Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. https://aclanthology.org/2022.naacl-main.143/
 
Univerzita Karlova | Informační systém UK