Feature Selection for Factored Phrase-Based Machine Translation
Thesis title in Czech: | Feature Selection for Factored Phrase-Based Machine Translation |
---|---|
Thesis title in English: | Feature Selection for Factored Phrase-Based Machine Translation |
Key words: | strojový překlad, faktorové modely, výběr rysů |
English key words: | machine translation, factored models, feature selection |
Academic year of topic announcement: | 2010/2011 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | doc. RNDr. Ondřej Bojar, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 11.11.2010 |
Date of assignment: | 11.11.2010 |
Date and time of defence: | 07.09.2012 09:00 |
Date of electronic submission: | 01.08.2012 |
Date of submission of printed version: | 02.08.2012 |
Date of proceeded defence: | 07.09.2012 |
Opponents: | Mgr. Martin Popel, Ph.D. |
Guidelines |
Factored phrase-based models allow to incorporate additional features to explicitly handle various language phenomena in machine translation (MT). There is a three-sided trade-off between model complexity, available data and improvement in translation quality: On the one hand, the more features are in the model, the better the chance of capturing all necessary details of linguistic constructions. On the other hand, the more features we use, the more data or the more complex (and more expensive) back-off techniques are required to overcome data sparseness.
The goal of the thesis is study this three-sided trade-off by designing and implementing an automatic strategy to explore the space of possible setups of factored phrase-based translation. Relevant measures include relative importance of the features towards MT quality, theoretical complexity of the model given vocabulary sizes of the features, empirical computational complexity of the setup in terms of disk space, memory space, and training and translation time. Ideally, the thesis would suggest a heuristic estimate of expected computational cost and/or MT quality based on the configuration of the model, training and test data. As a working example, the thesis will focus on English-to-Czech MT and partially explore factored setups, including esp. new features inspired or directly derived from the tectogrammatical annotation. |
References |
Philipp Koehn and Hieu Hoang: Factored Translation Models. Proc. of EMNLP. 2007
Ondřej Bojar: English-to-Czech Factored Machine Translation. Proceedings of the Second Workshop on Statistical Machine Translation, ACL. 2007. David Talbot and Miles Osborne: Modelling lexical redundancy for machine translation. Proc. of COLING/ACL. 2006. Alexandra Birch, Miles Osborne and Philipp Koehn: CCG Supertags in Factored Statistical Machine Translation. Proceedings of the Second Workshop on Statistical Machine Translation, ACL. 2007. Zdeněk Žabokrtský, Ondřej Bojar: TectoMT, Developer's Guide. ÚFAL/CKL Technical Report TR-2008-38. |