Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Globální a lokální omezení pro modely v oblasti zpracování přirozeného jazyka

Název práce v češtině:	Globální a lokální omezení pro modely v oblasti zpracování přirozeného jazyka
Název v anglickém jazyce:	Global and Local Constraints for Neural Models of Natural Language Processing
Akademický rok vypsání:	2021/2022
Typ práce:	disertační práce
Jazyk práce:
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	doc. RNDr. Ondřej Bojar, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	20.09.2021
Datum zadání:	20.09.2021
Datum potvrzení stud. oddělením:	21.09.2021

Zásady pro vypracování

State-of-the-art methods in natural language processing rely on deep neural models. Complex text and speech processing tasks like machine translation, speech recognition or various types of dialogue processing are routinely carried out with these models, with quality being primarily affected by the amount of training data for the given task. The dependence on large amounts of training data brings a new problem, the inherent bias towards facts and formulations exemplified in the data: if the model is not quite sure or the input example is not covered well in the training data, the model will choose some "typical" solution. This is the best strategy to minimize the risk of error when deciding in uncertainty but leads to bad or disastrous results, esp. if the output necessary for the given usage is not in line with the distribution of outputs as covered in the training data.

In many situations, the user of a trained model can provide not just the input but also additional specific pieces of information which critically influence which outputs are desired and which are not, in various aspects. For instance, in speech translation knowing the gender of the speaker is a critical information when translating from a language which does not express gender often to a language which requires this information for every verb. Another useful constraint is that the output has to match certain presentation criteria, be it the overall length or some articulation into shorter units compared to what the generally available parallel texts exhibit. Some of these constraints are global in the sense that the whole run of the model during a given session should reflect them, some of these constraints are local in the sense that the model's previous output affects their values. The co-reference of pronouns or term translation choice which needs to be consistent across the whole document are examples of such local constraints.

Current neural models do not lend themselves easily for such an external guidance or constraining.
The goal of the thesis is thus to propose and empirically evaluate techniques that would allow to inject additional more or less formalized knowledge and constraints to a trained model to steer its behaviour (global constraining), as well as to study techniques promoting correct handling of longer-distance phenomena where the model makes a decision and then adheres to it within the given output sequence and over a number of subsequent predictions.

Each of the considered techniques will be studied empirically, using a sizeable dataset for a particular task. The primary intended applications will lie in the area of machine translation and speech recognition or speech translation. Summarization, dialogue systems or their components would be also an interesting target.

While generally-applicable solutions would be ideal, it is conceivable that different tasks would perform best using different techniques. Providing as unified picture as possible is thus an inherent part of the work.

Seznam odborné literatury

Lakew, S. M., Di Gangi, M., & Federico, M. (2019). Controlling the output length of neural machine translation. arXiv preprint arXiv:1910.10408.

Matt Post and David Vilar. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1314–1324, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.

Karakanta, A., Negri, M., & Turchi, M. (2020). Is 42 the Answer to Everything in Subtitling-oriented Speech Translation?. arXiv preprint arXiv:2006.01080.

Liu, D., Niehues, J., & Spanakis, G. (2020). Adapting End-to-End Speech Recognition for Readable Subtitles. Proceedings of the 17th International Workshop on Spoken Language Translation (IWSLT 2020).

Rico Sennrich, Barry Haddow, and Alexandra Birch. Controlling politeness in neural machine translation via side constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 35–40, San Diego, California, June 2016. Association for Computational Linguistics.

Lajanugen Logeswaran, Honglak Lee, and Samy Bengio. Content preserving text generation with attribute controls. CoRR, abs/1811.01135, NIPS, 2018.

Angela Fan, David Grangier, and Michael Auli. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 45–54, Melbourne, Australia, July 2018. Association for Computational Linguistics.

Sudha Rao and Joel Tetreault. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.

Petra Barančíková and Ondřej Bojar. Costra 1.1: An inquiry into geometric properties of sentence spaces. In Petr Sojka, Ivan Kopeček, Karel Pala, and Aleš Horák, editors, Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Brno, Czech Republic, September 8-11, 2020, Proceedings, volume 12284 of Lecture Notes in Computer Science, pages 135–143. Springer, 2020.
http://hdl.handle.net/11234/1-3248

Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. 2018. Marian: Fast Neural Machine Translation in C++. In Proceedings of ACL 2018, System Demonstrations, pages 116–121, Melbourne, Australia. Association for Computational Linguistics.