Large-Scale Discriminative Training for Machine Translation into Morphologically-Rich Languages
Thesis title in Czech: | Large-Scale Discriminative Training for Machine Translation into Morphologically-Rich Languages |
---|---|
Thesis title in English: | Large-Scale Discriminative Training for Machine Translation into Morphologically-Rich Languages |
Academic year of topic announcement: | 2012/2013 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | doc. RNDr. Ondřej Bojar, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 10.11.2011 |
Date of assignment: | 10.11.2011 |
Confirmed by Study dept. on: | 07.12.2012 |
Date and time of defence: | 21.01.2013 00:00 |
Date of electronic submission: | 07.12.2012 |
Date of submission of printed version: | 07.12.2012 |
Date of proceeded defence: | 21.01.2013 |
Opponents: | doc. Ing. Zdeněk Žabokrtský, Ph.D. |
Guidelines |
The aim of the thesis is to analyze the chances of improving machine translation (MT) quality when translating to morphologically rich languages (MRLs) using large-scale discriminative training methods.
When translating to a MRL, statistical (e.g. phrase-based) systems face two issues: the out of vocabulary rate (source and esp. target words were not seen in the training data in the forms necessary to use in a test sentence) and a harder choice of the correct form. This thesis should first analyze the proportion of these two problems for a selected language pair (e.g. English-to-Serbian) and then focus on improving the latter one, the choice of the word form. Standard phrase-based MT has only limited number of features that the system can consider when choosing target words. Margin Infused Relaxed Algorithm (MIRA) is a method that allows a significant enlargement of the features considered. The thesis will design new features that could improve the choice of word forms on the basis of rich linguistic annotation of either the source sentence or both the source sentence and the target hypothesis. In the first attempt, experiments will be carried out with English-to-Serbian translation using rich information only in English. If tools for automatic analysis (e.g. a reasonably good part-of-speech tagger) of Serbian become available, some target-side features can be also designed and tested. The implemented features will be later applied also in English-to-Czech translation where both source and target sides offer rich annotation. |
References |
Eva Hasler, Barry Haddow and Philipp Koehn. Margin Infused Relaxed Algorithm for Moses. The Prague Bulletin of Mathematical Linguistics, Volume 96, pages 69-78, October 2011.
Philipp Koehn. Statistical Machine Translation. Textbook, Cambridge University Press, January 2010. Eleftherios Avramidis and Philipp Koehn. Enriching Morphologically Poor Languages for Statistical Machine Translation. ACL 2008. Ondřej Bojar and Aleš Tamchyna. Improving Translation Model by Monolingual Data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 330-336, Edinburgh, Scotland, July 2011. Jörg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia. 2009. Ondřej Bojar and Zdeněk Žabokrtský. CzEng 0.9: Large Parallel Treebank with Rich Annotation. Prague Bulletin of Mathematical Linguistics, 92:63-83, 2009. Philipp Koehn. A Web-Based Interactive Computer Aided Translation Tool. ACL Software demonstration, 2009 http://tool.statmt.org/ |