Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Analýza chyb a možností zlepšení frázového strojového překladu z angličtiny do urdštiny

Thesis title in Czech:	Analýza chyb a možností zlepšení frázového strojového překladu z angličtiny do urdštiny
Thesis title in English:	Analyzing Errors and Chances of Improving English to Urdu Phrase-Based Translation
Key words:	frázový překlad, jazyky svolným slovosledem, typy chyb v překladu
English key words:	Phrase-based translation, Free-word order languages, error scheme
Academic year of topic announcement:	2009/2010
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Ondřej Bojar, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	09.11.2009
Date of assignment:	09.11.2009
Confirmed by Study dept. on:	29.04.2013
Date and time of defence:	06.09.2010 00:00
Date of electronic submission:	06.09.2010
Date of proceeded defence:	06.09.2010
Opponents:	RNDr. Daniel Zeman, Ph.D.

Guidelines

The aim of the thesis is to analyze errors in English to Urdu phrase-based or hierarchical phrase-based machine translation, and to propose and evaluate a few possible improvements in translation quality.

The first step consists of setting up and running a suitable MT system, e.g. Moses or Joshua, including the necessary collection of a small training and evaluation parallel corpus. A thorough manual analysis of the system output of the given test corpus should indicate the most severe problems of the translation quality. The thesis should then attempt to tackle the identified issues by e.g.: (1) pre-processing of input English, such as word reordering, (2) preprocessing the training corpus in order to reduce unnecessary lexical ambiguity, (3) using additional factors (in Moses factored translation) to better model target-side morphological coherence. For any of the options, either rule-based or statistical approaches may be applied. The utility of the proposed modifications to the translation pipeline have to be evaluated by both automatic MT metrics as well as human judgments on a small subset of the test corpus.

References

Philipp Koehn and Hieu Hoang: Factored Translation Models. Proc. of EMNLP. 2007

Ondřej Bojar: English-to-Czech Factored Machine Translation. Proceedings of the Second Workshop on Statistical Machine Translation, ACL. 2007.

Alexandra Birch, Miles Osborne and Philipp Koehn: CCG Supertags in Factored Statistical Machine Translation. Proceedings of the Second Workshop on Statistical Machine Translation, ACL. 2007.

Ondřej Bojar, Pavel Straňák, Daniel Zeman: English-Hindi Translation in 21 Days, in Proc. of the 6th International Conference On Natural Language Processing (ICON-2008) NLP Tools Contest, International Institute of Information Technologies, Hyderabad, Pune, India, 2008.

Peng Xu, Jaeho Kang, Michael Ringgaard and Franz Och: Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages. Proc. of HLT/NAACL 2009.