Spojité reprezentace vět v neuronovém strojovém překladu
Thesis title in Czech: | Spojité reprezentace vět v neuronovém strojovém překladu |
---|---|
Thesis title in English: | Continuous Sentence Representations in Neural Machine Translation |
Key words: | věty, reprezentace, neuronový strojový překlad |
English key words: | sentence, representation, neural machine translation |
Academic year of topic announcement: | 2017/2018 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | doc. RNDr. Ondřej Bojar, Ph.D. |
Author: | hidden![]() |
Date of registration: | 19.04.2018 |
Date of assignment: | 19.04.2018 |
Confirmed by Study dept. on: | 11.05.2018 |
Date and time of defence: | 18.06.2018 09:00 |
Date of electronic submission: | 11.05.2018 |
Date of submission of printed version: | 11.05.2018 |
Date of proceeded defence: | 18.06.2018 |
Opponents: | Mgr. Rudolf Rosa, Ph.D. |
Guidelines |
Deep learning has brought new possibilities in learning continuous abstract representations for processed units. Neural machine translation (NMT) is one of such areas. Empirically, NMT has clearly surpassed other approaches to machine translation but it is still rather unclear whether the improvement NMT provides is thanks to better "understanding" of the translated sentences (i.e. representations of words and sentences that correspond to their meaning as perceived by humans), or whether it is simply due to better "shallow" modelling of how the input sequence of words should be transformed to obtain the target sequence.
The goal of the thesis is to explore the continuous representations of sentences learned by NMT and to empirically test to what extent some aspects of sentence meaning are captured in the representation. One particular complication is that the current best-performing NMT model architectures rely on the so-called 'attention mechanism' or 'attention'. This mechanism allows the model to keep reconsidering all the individual source words while the output is generated. There is thus no more any single point in the architecture where the sentence representation would be stored. The thesis must either resolve this problem or, as a fall-back option, rely on the non-attentional approaches, despite their lower performance in translation quality. The primary language pair to be used for the creation of NMT models will be English-to-Czech translation to benefit from the existing large corpus CzEng. Optionally, other language pairs can be considered. The limiting factor are the available datasets that allow to test various aspects of meaning in the learned representations. |
References |
Ondřej Bojar, et al. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered. In Text, Speech, and Dialogue: 19th International Conference, TSD 2016, number 9924 in Lecture Notes in Computer Science, pages 231–238, 2016. ISBN 978-3-319-45509-9.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In EMNLP, 2014. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3294–3302, Cambridge, MA, USA, 2015. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, 2017. |