Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Generování českých poetických slok a jejich evaluace
Thesis title in Czech: Generování českých poetických slok a jejich evaluace
Thesis title in English: Generation of Czech poetic strophes and their evaluation
Key words: česká poezie|zpracování přirozeného jazyka|neuronové sítě|automatická evaluace
English key words: Czech poetry|natural language processing|neural networks|automatic evaluation
Academic year of topic announcement: 2023/2024
Thesis type: diploma thesis
Thesis language:
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: Mgr. Rudolf Rosa, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 05.12.2023
Date of assignment: 05.12.2023
Confirmed by Study dept. on: 05.12.2023
Date of electronic submission:21.04.2024
Advisors: Mgr. Tomáš Musil
Guidelines
The aim of the thesis is to devise a method to generate Czech poetic strophes that meet formal constraints of poetry while being interesting to read.
An important component of the work is to also devise automated methods to evaluate the quality of the generated texts.

The suggested approach is to use current pretrained transformer-based neural language models, adapted and fine-tuned for the tasks as needed.
For the generation of strophes, the suggestion is to use generative LMs (such as GPT-2).
For the subsequent evaluation of the generated texts, the suggestion is to use masked LMs (such as BERT) as encoders for a multi-class classifier and/or a regressor.

Adequate training data will be required to train the models. The main suggestion is to use The corpus of Czech verse, which contains Czech poetry annotated (among other features) for rhyme schema, meter, year of publishing, and author. Other datasets may also be available and potentially useful.
References
- Petr Plecháč and Robert Kolár. 2015. The corpus of Czech verse. Studia metrica et poetica, 2(1), 107-118.

- Arturo Oncevay and Kervy Rivas Rojas. 2020. Revisiting Neural Language Modelling with Syllables: 2010.12881

- Orekhov, B, Fischer, F. Neural reading. Orbis Litter. 2020; 75: 230–246. https://doi.org/10.1111/oli.12274

- Kai-Ling Lo and Rami Ariss and Philipp Kurz. 2022. GPoeT-2: A GPT-2 Based Poem Generator: 2205.08847

- Linting Xue and Aditya Barua and Noah Constant and Rami Al-Rfou and Sharan Narang and Mihir Kale and Adam Roberts and Colin Raffel. 2022. ByT5: Towards a token-free future with pre-trained byte-to-byte models: 2105.13626

- Jonas Belouadi and Steffen Eger. 2023. ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models: 2212.10474

- CHALOUPSKÝ, Lukáš. Automatic generation of medical reports from chest X-rays in Czech. Diplomová práce, vedoucí Rosa, Rudolf. Praha: Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky, 2022.

- Wietse de Vries and Malvina Nissim. 2021. As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 836–846, Online. Association for Computational Linguistics.

- Pytorch documentation: https://pytorch.org/docs/stable/index.html

- Hugging Face Transformers documentation: https://huggingface.co/docs/transformers/index

- Hugging Face Tokenizers documentation: https://huggingface.co/docs/tokenizers/index
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html