Spojité reprezentace vět v neuronovém strojovém překladu
Název práce v češtině: | Spojité reprezentace vět v neuronovém strojovém překladu |
---|---|
Název v anglickém jazyce: | Continuous Sentence Representations in Neural Machine Translation |
Klíčová slova: | věty, reprezentace, neuronový strojový překlad |
Klíčová slova anglicky: | sentence, representation, neural machine translation |
Akademický rok vypsání: | 2017/2018 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | doc. RNDr. Ondřej Bojar, Ph.D. |
Řešitel: | skrytý![]() |
Datum přihlášení: | 19.04.2018 |
Datum zadání: | 19.04.2018 |
Datum potvrzení stud. oddělením: | 11.05.2018 |
Datum a čas obhajoby: | 18.06.2018 09:00 |
Datum odevzdání elektronické podoby: | 11.05.2018 |
Datum odevzdání tištěné podoby: | 11.05.2018 |
Datum proběhlé obhajoby: | 18.06.2018 |
Oponenti: | Mgr. Rudolf Rosa, Ph.D. |
Zásady pro vypracování |
Deep learning has brought new possibilities in learning continuous abstract representations for processed units. Neural machine translation (NMT) is one of such areas. Empirically, NMT has clearly surpassed other approaches to machine translation but it is still rather unclear whether the improvement NMT provides is thanks to better "understanding" of the translated sentences (i.e. representations of words and sentences that correspond to their meaning as perceived by humans), or whether it is simply due to better "shallow" modelling of how the input sequence of words should be transformed to obtain the target sequence.
The goal of the thesis is to explore the continuous representations of sentences learned by NMT and to empirically test to what extent some aspects of sentence meaning are captured in the representation. One particular complication is that the current best-performing NMT model architectures rely on the so-called 'attention mechanism' or 'attention'. This mechanism allows the model to keep reconsidering all the individual source words while the output is generated. There is thus no more any single point in the architecture where the sentence representation would be stored. The thesis must either resolve this problem or, as a fall-back option, rely on the non-attentional approaches, despite their lower performance in translation quality. The primary language pair to be used for the creation of NMT models will be English-to-Czech translation to benefit from the existing large corpus CzEng. Optionally, other language pairs can be considered. The limiting factor are the available datasets that allow to test various aspects of meaning in the learned representations. |
Seznam odborné literatury |
Ondřej Bojar, et al. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered. In Text, Speech, and Dialogue: 19th International Conference, TSD 2016, number 9924 in Lecture Notes in Computer Science, pages 231–238, 2016. ISBN 978-3-319-45509-9.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In EMNLP, 2014. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3294–3302, Cambridge, MA, USA, 2015. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, 2017. |