Témata prací (Výběr práce)Témata prací (Výběr práce)(verze: 368)
Detail práce
   Přihlásit přes CAS
Spojité reprezentace vět v neuronovém strojovém překladu
Název práce v češtině: Spojité reprezentace vět v neuronovém strojovém překladu
Název v anglickém jazyce: Continuous Sentence Representations in Neural Machine Translation
Klíčová slova: věty, reprezentace, neuronový strojový překlad
Klíčová slova anglicky: sentence, representation, neural machine translation
Akademický rok vypsání: 2017/2018
Typ práce: diplomová práce
Jazyk práce: angličtina
Ústav: Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel: doc. RNDr. Ondřej Bojar, Ph.D.
Řešitel: skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení: 19.04.2018
Datum zadání: 19.04.2018
Datum potvrzení stud. oddělením: 11.05.2018
Datum a čas obhajoby: 18.06.2018 09:00
Datum odevzdání elektronické podoby:11.05.2018
Datum odevzdání tištěné podoby:11.05.2018
Datum proběhlé obhajoby: 18.06.2018
Oponenti: Mgr. Rudolf Rosa, Ph.D.
 
 
 
Zásady pro vypracování
Deep learning has brought new possibilities in learning continuous abstract representations for processed units. Neural machine translation (NMT) is one of such areas. Empirically, NMT has clearly surpassed other approaches to machine translation but it is still rather unclear whether the improvement NMT provides is thanks to better "understanding" of the translated sentences (i.e. representations of words and sentences that correspond to their meaning as perceived by humans), or whether it is simply due to better "shallow" modelling of how the input sequence of words should be transformed to obtain the target sequence.

The goal of the thesis is to explore the continuous representations of sentences learned by NMT and to empirically test to what extent some aspects of sentence meaning are captured in the representation. One particular complication is that the current best-performing NMT model architectures rely on the so-called 'attention mechanism' or 'attention'. This mechanism allows the model to keep reconsidering all the individual source words while the output is generated. There is thus no more any single point in the architecture where the sentence representation would be stored. The thesis must either resolve this problem or, as a fall-back option, rely on the non-attentional approaches, despite their lower performance in translation quality.

The primary language pair to be used for the creation of NMT models will be English-to-Czech translation to benefit from the existing large corpus CzEng. Optionally, other language pairs can be considered. The limiting factor are the available datasets that allow to test various aspects of meaning in the learned representations.
Seznam odborné literatury
Ondřej Bojar, et al. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered. In Text, Speech, and Dialogue: 19th International Conference, TSD 2016, number 9924 in Lecture Notes in Computer Science, pages 231–238, 2016. ISBN 978-3-319-45509-9.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In EMNLP, 2014.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3294–3302, Cambridge, MA, USA, 2015.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, 2017.
 
Univerzita Karlova | Informační systém UK