Thesis (Selection of subject)Thesis (Selection of subject)(version: 381)
Thesis details
   Login via CAS
Large-Scale Discriminative Training for Machine Translation into Morphologically-Rich Languages
Thesis title in Czech: Large-Scale Discriminative Training for Machine Translation into Morphologically-Rich Languages
Thesis title in English: Large-Scale Discriminative Training for Machine Translation into Morphologically-Rich Languages
Academic year of topic announcement: 2012/2013
Thesis type: diploma thesis
Thesis language: angličtina
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: doc. RNDr. Ondřej Bojar, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 10.11.2011
Date of assignment: 10.11.2011
Confirmed by Study dept. on: 07.12.2012
Date and time of defence: 21.01.2013 00:00
Date of electronic submission:07.12.2012
Date of submission of printed version:07.12.2012
Date of proceeded defence: 21.01.2013
Opponents: doc. Ing. Zdeněk Žabokrtský, Ph.D.
 
 
 
Guidelines
The aim of the thesis is to analyze the chances of improving machine translation (MT) quality when translating to morphologically rich languages (MRLs) using large-scale discriminative training methods.

When translating to a MRL, statistical (e.g. phrase-based) systems face two issues: the out of vocabulary rate (source and esp. target words were not seen in the training data in the forms necessary to use in a test sentence) and a harder choice of the correct form. This thesis should first analyze the proportion of these two problems for a selected language pair (e.g. English-to-Serbian) and then focus on improving the latter one, the choice of the word form.

Standard phrase-based MT has only limited number of features that the system can consider when choosing target words. Margin Infused Relaxed Algorithm (MIRA) is a method that allows a significant enlargement of the features considered. The thesis will design new features that could improve the choice of word forms on the basis of rich linguistic annotation of either the source sentence or both the source sentence and the target hypothesis.

In the first attempt, experiments will be carried out with English-to-Serbian translation using rich information only in English. If tools for automatic analysis (e.g. a reasonably good part-of-speech tagger) of Serbian become available, some target-side features can be also designed and tested. The implemented features will be later applied also in English-to-Czech translation where both source and target sides offer rich annotation.
References
Eva Hasler, Barry Haddow and Philipp Koehn. Margin Infused Relaxed Algorithm for Moses. The Prague Bulletin of Mathematical Linguistics, Volume 96, pages 69-78, October 2011.

Philipp Koehn. Statistical Machine Translation. Textbook, Cambridge University Press, January 2010.

Eleftherios Avramidis and Philipp Koehn. Enriching Morphologically Poor Languages for Statistical Machine Translation. ACL 2008.

Ondřej Bojar and Aleš Tamchyna. Improving Translation Model by Monolingual Data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 330-336, Edinburgh, Scotland, July 2011.

Jörg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia. 2009.

Ondřej Bojar and Zdeněk Žabokrtský. CzEng 0.9: Large Parallel Treebank with Rich Annotation. Prague Bulletin of Mathematical Linguistics, 92:63-83, 2009.

Philipp Koehn. A Web-Based Interactive Computer Aided Translation Tool. ACL Software demonstration, 2009
http://tool.statmt.org/
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html