Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Multi-Label klasifikace pro rozlišení podobných jazyků

Název práce v češtině:	Multi-Label klasifikace pro rozlišení podobných jazyků
Název v anglickém jazyce:	Multi-Label Language Identification for Similar Languages
Klíčová slova:	identifickace jazyka\|zpracování přirozeného jazyky\|podobné jazyky
Klíčová slova anglicky:	language identification\|natural language processing\|similar languages
Akademický rok vypsání:	2024/2025
Typ práce:	diplomová práce
Jazyk práce:
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	Mgr. Jindřich Libovický, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	19.03.2025
Datum zadání:	24.03.2025
Datum potvrzení stud. oddělením:	26.03.2025

Zásady pro vypracování

Language identification is an important step in collecting and filtering data for language modeling and machine translation. Most existing methods treat it as a multi-class classification problem, identifying the most probable language of a text. However, a more useful question for data filtering is: "Can this text be in language X?"—especially for ambiguous cases where a sentence is valid in multiple languages. This thesis will address this challenge.

In particular, it will explore multi-label classification for language identification, focusing on similar languages (e.g., Czech-Slovak, Danish-Norwegian). The student will:

* Create a challenge set of ambiguous sentences, some valid in multiple languages.
* Evaluate existing models (e.g., dedicated models: fastText LID, OpenLID, and generative LLMs) on the dataset.
* Design, conduct, and evaluate experiments to train multi-label models based on state-of-the-art pre-trained encoders and compare their effectiveness.

The core of the thesis is empirical research, particularly the design and evaluation of computational experiments for better language identification.

Seznam odborné literatury

Burchell, L., Birch, A., Bogoychev, N., & Heafield, K. (2023, July). An Open Dataset and Model for Language Identification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 865-879).

Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., ... & Adeyemi, M. (2022). Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10, 50-72.

Costa-Jussà, M. R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., ... & NLLB Team. (2022). No Language Left Behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.