Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Generování českých poetických slok a jejich evaluace

Název práce v češtině:	Generování českých poetických slok a jejich evaluace
Název v anglickém jazyce:	Generation of Czech poetic strophes and their evaluation
Klíčová slova:	česká poezie\|zpracování přirozeného jazyka\|neuronové sítě\|automatická evaluace
Klíčová slova anglicky:	Czech poetry\|natural language processing\|neural networks\|automatic evaluation
Akademický rok vypsání:	2023/2024
Typ práce:	diplomová práce
Jazyk práce:
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	Mgr. Rudolf Rosa, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	05.12.2023
Datum zadání:	05.12.2023
Datum potvrzení stud. oddělením:	05.12.2023
Datum odevzdání elektronické podoby:	21.04.2024
Konzultanti:	Mgr. Tomáš Musil

Zásady pro vypracování

The aim of the thesis is to devise a method to generate Czech poetic strophes that meet formal constraints of poetry while being interesting to read.
An important component of the work is to also devise automated methods to evaluate the quality of the generated texts.

The suggested approach is to use current pretrained transformer-based neural language models, adapted and fine-tuned for the tasks as needed.
For the generation of strophes, the suggestion is to use generative LMs (such as GPT-2).
For the subsequent evaluation of the generated texts, the suggestion is to use masked LMs (such as BERT) as encoders for a multi-class classifier and/or a regressor.

Adequate training data will be required to train the models. The main suggestion is to use The corpus of Czech verse, which contains Czech poetry annotated (among other features) for rhyme schema, meter, year of publishing, and author. Other datasets may also be available and potentially useful.

Seznam odborné literatury

- Petr Plecháč and Robert Kolár. 2015. The corpus of Czech verse. Studia metrica et poetica, 2(1), 107-118.

- Arturo Oncevay and Kervy Rivas Rojas. 2020. Revisiting Neural Language Modelling with Syllables: 2010.12881

- Orekhov, B, Fischer, F. Neural reading. Orbis Litter. 2020; 75: 230–246. https://doi.org/10.1111/oli.12274

- Kai-Ling Lo and Rami Ariss and Philipp Kurz. 2022. GPoeT-2: A GPT-2 Based Poem Generator: 2205.08847

- Linting Xue and Aditya Barua and Noah Constant and Rami Al-Rfou and Sharan Narang and Mihir Kale and Adam Roberts and Colin Raffel. 2022. ByT5: Towards a token-free future with pre-trained byte-to-byte models: 2105.13626

- Jonas Belouadi and Steffen Eger. 2023. ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models: 2212.10474

- CHALOUPSKÝ, Lukáš. Automatic generation of medical reports from chest X-rays in Czech. Diplomová práce, vedoucí Rosa, Rudolf. Praha: Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky, 2022.

- Wietse de Vries and Malvina Nissim. 2021. As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 836–846, Online. Association for Computational Linguistics.

- Pytorch documentation: https://pytorch.org/docs/stable/index.html

- Hugging Face Transformers documentation: https://huggingface.co/docs/transformers/index

- Hugging Face Tokenizers documentation: https://huggingface.co/docs/tokenizers/index