Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Spojité reprezentace vět v neuronovém strojovém překladu

Thesis title in Czech:	Spojité reprezentace vět v neuronovém strojovém překladu
Thesis title in English:	Continuous Sentence Representations in Neural Machine Translation
Key words:	věty, reprezentace, neuronový strojový překlad
English key words:	sentence, representation, neural machine translation
Academic year of topic announcement:	2017/2018
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Ondřej Bojar, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	19.04.2018
Date of assignment:	19.04.2018
Confirmed by Study dept. on:	11.05.2018
Date and time of defence:	18.06.2018 09:00
Date of electronic submission:	11.05.2018
Date of submission of printed version:	11.05.2018
Date of proceeded defence:	18.06.2018
Opponents:	Mgr. Rudolf Rosa, Ph.D.

Guidelines

Deep learning has brought new possibilities in learning continuous abstract representations for processed units. Neural machine translation (NMT) is one of such areas. Empirically, NMT has clearly surpassed other approaches to machine translation but it is still rather unclear whether the improvement NMT provides is thanks to better "understanding" of the translated sentences (i.e. representations of words and sentences that correspond to their meaning as perceived by humans), or whether it is simply due to better "shallow" modelling of how the input sequence of words should be transformed to obtain the target sequence.

The goal of the thesis is to explore the continuous representations of sentences learned by NMT and to empirically test to what extent some aspects of sentence meaning are captured in the representation. One particular complication is that the current best-performing NMT model architectures rely on the so-called 'attention mechanism' or 'attention'. This mechanism allows the model to keep reconsidering all the individual source words while the output is generated. There is thus no more any single point in the architecture where the sentence representation would be stored. The thesis must either resolve this problem or, as a fall-back option, rely on the non-attentional approaches, despite their lower performance in translation quality.

The primary language pair to be used for the creation of NMT models will be English-to-Czech translation to benefit from the existing large corpus CzEng. Optionally, other language pairs can be considered. The limiting factor are the available datasets that allow to test various aspects of meaning in the learned representations.

References

Ondřej Bojar, et al. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered. In Text, Speech, and Dialogue: 19th International Conference, TSD 2016, number 9924 in Lecture Notes in Computer Science, pages 231–238, 2016. ISBN 978-3-319-45509-9.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In EMNLP, 2014.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3294–3302, Cambridge, MA, USA, 2015.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, 2017.