SubjectsSubjects(version: 945)
Course, academic year 2023/2024
   Login via CAS
Statistical Machine Translation - NPFL087
Title: Statistický strojový překlad
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2020
Semester: summer
E-Credits: 5
Hours per week, examination: summer s.:2/2, C+Ex [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: taught
Language: Czech, English
Teaching methods: full-time
Teaching methods: full-time
Additional information: https://ufal.mff.cuni.cz/courses/npfl087
Guarantor: doc. RNDr. Ondřej Bojar, Ph.D.
Class: DS, matematická lingvistika
Informatika Mgr. - Matematická lingvistika
Classification: Informatics > Computer and Formal Linguistics
Is incompatible with: NPFX087
Is interchangeable with: NPFX087
Annotation -
Last update: T_UFAL (05.05.2017)
Participants of the seminar will get closely acquainted with methods of machine translation (MT) that rely on automatic processing of (large) training data as well as with open-source implementations of these methods. We will cover a wide range of approaches organized along two axes: the level of linguistic analysis (uninformed, utilizing morphology, surface and deep syntax) and the depth of machine learning methods used (classical statistical MT that decomposes input into pieces and neural MT that models the task end to end).
Aim of the course -
Last update: T_UFAL (05.05.2017)

The goal is to provide (1) a big overview of successful approaches to MT since 1990, including the recent developments due to deep learning after 2015, and (2) detailed technical knowledge and practical experience with one of the approaches or some MT-related tool according to the student's choice. The second goal often leads to the publication of the student's work at a relevant workshop.

Course completion requirements -
Last update: doc. RNDr. Ondřej Bojar, Ph.D. (17.06.2019)

Key requirements:

Work on a project (alone or in a group of two to three).

Present project results (~30-minute talk).

Write a report (~4-page scientific paper).

Contributions to the grade:

10% homework and activity,

30% written exam,

50% project report,

10% project presentation.

The 'credit' (zapocet) is given based on the continuous work on the project throughout the semester. The 'credit' is not required prior to the written exam.

Final Grade: ≥50% good, ≥70% very good, ≥90% excellent.

Literature -
Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (29.01.2019)
  • Philipp Koehn: Statistical Machine Translation. Cambridge University Press. ISBN: 978-0521874151, 2009.
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst: Moses: Open Source Toolkit for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007. http://www.statmt.org/moses/
  • Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi, Ondřej Bojar, Chris Callison-Burch, Brooke Cowan, Chris Dyer, Hieu Hoang, Richard Zens, Alexandra Constantin, Christine Moran, and Evan Herbst: Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Confusion Network Decoding. Technical report, Johns Hopkins University, Center for Speech and Language Processing, 2006. http://ufal.mff.cuni.cz/~bojar/publications/2006-FILE-koehn_etal_jhuws_2006-2006-jhu-report.pdf
  • Ondřej Bojar: Exploiting Linguistic Data in Machine Translation. PhD thesis, ÚFAL, MFF UK, Prague, Czech Republic, October 2008. http://ufal.mff.cuni.cz/~bojar/publications/2008-FILE-bojar_phd-FINAL.pdf
  • Bonnie J. Dorr, Pamela Jordan, John W. Benoit: A Survey of Current Paradigms in Machine Translation, 1998.
  • Philipp Koehn, Franz Josef Och and Daniel Marcu: Statistical Phrase-Based Translation. 2003. http://people.csail.mit.edu/people/koehn/publications/phrase2003.pdf
  • Zhifei Li, Chris Callison-Burch, Sanjeev Khudanpur, Wren Thornton: Decoding in Joshua: Open Source, Parsing-Based Machine Translation. PBML 91, 2009. http://ufal.mff.cuni.cz/pbml/91/art-li.pdf
  • Vamshi Ambati, Alon Lavier: Improving Syntax-Driven Translation Models by Re-structuring Divergent and Nonisomorphic Parse Tree Structures. In Proceedings of AMTA 2008, 235-244. http://www.mt-archive.info/AMTA-2008-Ambati.pdf
  • A další vybrané články z konferencí (ACL, COLING ap.), technické zprávy ÚFAL/CKL.
Syllabus -
Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (29.01.2019)
  1. Evaluating machine translation quality (manually and automatically). Empirical confidence bounds and reliability of MT metrics in general.
  2. Machine translation as a problem in information theory. Translation model, language model, general log-linear model. The space of partial hypotheses and search in the space (the "decoding"), phrase-based translation. Open-source toolkit Moses.
  3. Neural MT overview: a direct model of translation probability, subword units, embeddings, sequence-to-sequence model. Open-source toolkits such as Neural Monkey, Nematus, OpenNMT, Marian.
  4. Parallel texts, alignment (sentence and word aligment, IBM models 1 to 3). Open source tools for corpus preparation and alignment (hunalign, GIZA++).
  5. Neural MT details: attention in sequence-to-sequence models, self-attentive models.
  6. Optimization: Tuning parameters of log-linear model (Minimum Error Rate Training, MERT). Specifics of training of neural MT.
  7. Advanced NMT models: multi-task training, multi-lingual translation, multi-modal translation.
  8. Morphological pre-processing, utilizing morphological information in phrase-based and neural MT.
  9. Phrase-structure syntax in MT, translation based on (context-free) parsing. Generic hypergraph search.
  10. Shallow and deep dependency syntax in MT, including tectogrammatical layer and TectoMT.
  11. Presentation of students' contributions.
Students' contribution and grading:
  • Individuals or groups of two to three students choose a topic early in the term, set up some experiments, implement a modification of an existing MT system or run baseline experiments with an available prototype of an alternative MT method. Each of the projects is concluded by writing up a report and presenting the results in the lectures.
  • The tutorials ("cviceni") of the subject are devoted to practical application of the algorithms and toolkits described as well as for consulting students' projects.
  • The final grading reflects: the knowledge of discussed topics, the project report paper and the project presentation.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html