Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Feature Selection for Factored Phrase-Based Machine Translation

Thesis title in Czech:	Feature Selection for Factored Phrase-Based Machine Translation
Thesis title in English:	Feature Selection for Factored Phrase-Based Machine Translation
Key words:	strojový překlad, faktorové modely, výběr rysů
English key words:	machine translation, factored models, feature selection
Academic year of topic announcement:	2010/2011
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Ondřej Bojar, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	11.11.2010
Date of assignment:	11.11.2010
Date and time of defence:	07.09.2012 09:00
Date of electronic submission:	01.08.2012
Date of submission of printed version:	02.08.2012
Date of proceeded defence:	07.09.2012
Opponents:	Mgr. Martin Popel, Ph.D.

Guidelines

Factored phrase-based models allow to incorporate additional features to explicitly handle various language phenomena in machine translation (MT). There is a three-sided trade-off between model complexity, available data and improvement in translation quality: On the one hand, the more features are in the model, the better the chance of capturing all necessary details of linguistic constructions. On the other hand, the more features we use, the more data or the more complex (and more expensive) back-off techniques are required to overcome data sparseness.

The goal of the thesis is study this three-sided trade-off by designing and implementing an automatic strategy to explore the space of possible setups of factored phrase-based translation. Relevant measures include relative importance of the features towards MT quality, theoretical complexity of the model given vocabulary sizes of the features, empirical computational complexity of the setup in terms of disk space, memory space, and training and translation time. Ideally, the thesis would suggest a heuristic estimate of expected computational cost and/or MT quality based on the configuration of the model, training and test data. As a working example, the thesis will focus on English-to-Czech MT and partially explore factored setups, including esp. new features inspired or directly derived from the tectogrammatical annotation.

References

Philipp Koehn and Hieu Hoang: Factored Translation Models. Proc. of EMNLP. 2007
Ondřej Bojar: English-to-Czech Factored Machine Translation. Proceedings of the Second Workshop on Statistical Machine Translation, ACL. 2007.
David Talbot and Miles Osborne: Modelling lexical redundancy for machine translation. Proc. of COLING/ACL. 2006.
Alexandra Birch, Miles Osborne and Philipp Koehn: CCG Supertags in Factored Statistical Machine Translation. Proceedings of the Second Workshop on Statistical Machine Translation, ACL. 2007.
Zdeněk Žabokrtský, Ondřej Bojar: TectoMT, Developer's Guide. ÚFAL/CKL Technical Report TR-2008-38.