Vysvětlitelná evaluace generování textu založená na velkých jazykových modelech a chybové analýze
Název práce v češtině: | Vysvětlitelná evaluace generování textu založená na velkých jazykových modelech a chybové analýze |
---|---|
Název v anglickém jazyce: | Explainable LLM-based evaluation for NLG using error analysis |
Klíčová slova: | generování přirozeného jazyka|evaluace|velké jazykové modely|zpracování přirozeného jazyka |
Klíčová slova anglicky: | natural language generation|evaluation|large language models|natural language processing |
Akademický rok vypsání: | 2023/2024 |
Typ práce: | diplomová práce |
Jazyk práce: | čeština |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | Ing. Mateusz Lango, Ph.D. |
Řešitel: | skrytý![]() |
Datum přihlášení: | 16.07.2024 |
Datum zadání: | 16.07.2024 |
Datum potvrzení stud. oddělením: | 22.07.2024 |
Datum a čas obhajoby: | 04.02.2025 09:00 |
Datum odevzdání elektronické podoby: | 10.01.2025 |
Datum odevzdání tištěné podoby: | 09.01.2025 |
Datum proběhlé obhajoby: | 04.02.2025 |
Oponenti: | Mgr. Jindřich Libovický, Ph.D. |
Konzultanti: | Mgr. et Mgr. Ondřej Dušek, Ph.D. |
Zásady pro vypracování |
Traditional evaluation metrics based on n-gram overlap, such as BLEU or ROUGE, are still frequently used in natural language generation (NLG) evaluation. However, these metrics do not correlate well with human judgments (Liu et al. 2016), require reference texts, and primarily focus on formal rather than semantic similarities. With the adoption of deep neural networks in NLP, several model-based evaluation methods have been proposed (Zhang et al., 2019; Yuan et al., 2021; Zhong et al., 2022). These methods show higher correlations with expert annotations, but they are usually specialized for a single task or a small number of related tasks and require expensive data collection and training. Recently, large language models (LLMs) have been applied for automatic NLG evaluation, offering a more flexible and better-performing alternative (Liu et al., 2023; Kocmi and Federmann, 2023; Xu et al., 2023). However, many of these methods rely on proprietary LLMs, which are usually prohibitively expensive and suffer from transparency and reproducibility issues. An alternative approach involves fine-tuning open-source LLMs, resulting in more transparent and efficient evaluation models. Nonetheless, these models often lack sufficient degree of explainability and do not provide uncertainty estimates.
This thesis investigates the use of open-source LLMs for explainable, reference-free NLG evaluation. It will experiment with fine-tuning evaluator models to provide fine-grained error analyses across diverse NLG tasks and evaluation aspects. In addition to textual explanations and numeric scores, the thesis will experiment with including uncertainty estimates to improve the reliability and interpretability of the results. Training data will be sourced from a combination of existing datasets, model outputs and synthetic assessments provided by larger open-source LLM(s). Various prompting techniques will be employed to obtain high-quality synthetic data. The proposed approach will be assessed using meta-evaluation benchmarks, with the correlation between model-provided scores and human judgements serving as the main metric. |
Seznam odborné literatury |
T. Kocmi and C. Federmann, “Large Language Models Are State-of-the-Art Evaluators of Translation Quality,” in Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, Jun. 2023, pp 193–203. https://aclanthology.org/2023.eamt-1.19
C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau, “How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing of the Association for Computational Linguistics. Austin, Texas, Nov. 2016, pp 2122–2132. https://aclanthology.org/D16-1230 Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing of the Association for Computational Linguistics, Singapore, Dec. 2023, pp 2511–2522. https://aclanthology.org/2023.emnlp-main.153 W. Xu, D. Wang, L. Pan, Z. Song, M. Freitag, W. Wang, and L. Li, “INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing of the Association for Computational Linguistics, Singapore, Dec. 2023, pp 5967–5994. https://aclanthology.org/2023.emnlp-main.310 W. Yuan, G. Neubig, and P. Liu, “BARTScore: Evaluating Generated Text as Text Generation.” arXiv, Jun. 2021. https://doi.org/10.48550/arXiv.2106.11520 T. Zhang, V. Kishore, F. Wu, K.Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT.” arXiv, Apr. 2019. https://doi.org/10.48550/arXiv.1904.09675 M. Zhong, Y. Liu, D. Yin, Y. Mao, Y. Jiao, P. Liu, C. Zhu, H. Ji, and J. Han, “Towards a Unified Multi-Dimensional Evaluator for Text Generation,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing of the Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, Dec. 2022, pp 2023–2038. https://aclanthology.org/2022.emnlp-main.137 |