Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Predicting Word Importance Using Pre-Trained Language Models

Thesis title in Czech:	Predikcia dôležitosti slov pomocou predtrénovaných jazykových modelov
Thesis title in English:	Predicting Word Importance Using Pre-Trained Language Models
Key words:	dôležitosť slov\|jazykové modelovanie
English key words:	word importance\|language modeling
Academic year of topic announcement:	2023/2024
Thesis type:	Bachelor's thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	Mgr. Dávid Javorský
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	30.10.2023
Date of assignment:	03.11.2023
Confirmed by Study dept. on:	03.11.2023
Date and time of defence:	05.09.2024 09:00
Date of electronic submission:	17.07.2024
Date of submission of printed version:	17.07.2024
Date of proceeded defence:	05.09.2024
Opponents:	Mgr. Dominik Macháček, Ph.D.



Advisors:	doc. RNDr. Ondřej Bojar, Ph.D.

Guidelines

Words are the smallest discrete units of a language that have a particular meaning, and their contribution in decision-making processes of neural models, or in human brains, is undoubtedly unequal.

The goal of this thesis is therefore to examine a small set of possible definitions of word importance (with a focus on semantic importance), and to train a neural model capable of assigning these importance scores to each input word.

This will be accomplished by leveraging a masked language modeling approach (i.e. finetuning pre-trained language models) and repurposing its paradigm: Instead of predicting which words are missing, our objective will be to predict what words are inserted.

As part of the evaluation, the goal of the thesis is to implement an annotation tool for collecting gold labels for importance scores (linked to our suggested definitions of importance) and to compare them to the acquired scores using our proposed method. A valuable extension of this work would be to evaluate the importance scores on a downstream task, e.g. keyword identification.

References

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

Dávid Javorský, Ondřej Bojar, and François Yvon. 2023. Assessing Word Importance Using Models Trained for Semantic Tasks. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8846–8856, Toronto, Canada. Association for Computational Linguistics.

Sushant Kafle and Matt Huenerfauth. 2018. A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).

Martinc, Matej, Blaž Škrlj, and Senja Pollak. "TNT-KID: Transformer-based neural tagger for keyword identification." Natural Language Engineering 28.4 (2022): 409-448.