Subjects

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Competing in Machine Translation - NPFL101

Title:	Soutěžní strojový překlad
Guaranteed by:	Institute of Formal and Applied Linguistics (32-UFAL)
Faculty:	Faculty of Mathematics and Physics
Actual:	from 2020
Semester:	winter
E-Credits:	3
Hours per week, examination:	winter s.:0/2, C [HT]
Capacity:	unlimited
Min. number of students:	unlimited
4EU+:	no
Virtual mobility / capacity:	no
State of the course:	taught
Language:	Czech, English
Teaching methods:	full-time
Additional information:	https://ufal.mff.cuni.cz/courses/npfl101
Note:	you can enroll for the course repeatedly course can be enrolled in outside the study plan enabled for web enrollment

Guarantor:	doc. RNDr. Ondřej Bojar, Ph.D.
Teacher(s):	doc. RNDr. Ondřej Bojar, Ph.D.
Class:	Informatika Bc. Informatika Mgr. - Matematická lingvistika
Classification:	Informatics > Computer and Formal Linguistics
Pre-requisite :	NSWI095

Opinion survey results Examination dates WS schedule Noticeboard

Annotation -

The seminar can serve as a supplement of Unix classes or a very practical introduction to some aspects of computational linguistics. We will collectively improve existing tools and systems for statistical machine translation, including neural machine translation, and take part in competitions like http://www.statmt.org/wmt18/. Our primary focus will be on Czech and English but other languages will be considered based on the interest of participants. Practically speaking, the seminar consists of scripting and operating a diverse collection of research tools and tackling a wide range of techn

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (25.01.2018)

Course completion requirements -

You can enroll into the NPFL101 seminar repeatedly, i.e. in more than one year.

Every year, the key requirement, for which you will receive the credit, is to submit a report describing your project for the seminar. Depending on your particular project, we may also agree on a presentation at the seminar, which then contributes the content to your report.

The report shall be at least 2-4 pages long and include proper introduction (the "big picture" of what your work is contributing to), technical details, as well as a standard conclusion. You can work on your project alone or in a small group, as agreed at the seminar.

If the resulting project leads to a workshop or a conference paper, there is no need to write a separate report.

The submission of the report is an iterative process, you send me a draft, and I will typically ask you for minor or greater revisions. We iterate, until the report is well written and rounded and I accept it. In this sense, the report can be "submitted" many times.

Last update: Bojar Ondřej, doc. RNDr., Ph.D. (10.10.2017)

Literature -

Bojar Ondřej, Chatterjee Rajen, Federmann Christian, Graham Yvette, Haddow Barry, Huang Shujian, Huck Matthias, Koehn Philipp, Liu Qun, Logacheva Varvara, Monz Christof, Negri Matteo, Post Matt, Rubino Raphael, Specia Lucia, Turchi Marco: Findings of the 2017 Conference on Machine Translation (WMT17). In: Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-945626-96-8, pp. 169-214, 2017.

http://www.statmt.org/wmt18/

Philipp Koehn: Statistical Machine Translation. Cambridge University Press. ISBN: 978-0521874151, 2009.

also including the chapter on neural MT: https://arxiv.org/abs/1709.07809

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (25.01.2018)

Syllabus -

At the seminar, we will improve machine translation systems (especially translation into Czech) and take part in the annual translation competitions like http://www.statmt.org/wmt18/. Our machines systems have repeatedly achieved relatively good results, and we won in the three consecutive years of 2013-2015, beating Google Translate among others.

Statistical machine translation is a challenging task especially in terms of the volume of data processed. It is quite common to work in parallel on dozens of computers, and can easily need 100 GB of disk and 100 GB of RAM for a single experiment. Neural machine translation then requires GPUs with at least 8 GB of RAM and training for days or weeks.

We will rely on existing tools that are implemented in a mixture of languages such as Python, C/C++, Perl, Bash, and others. Very often, we will parallelize the calculations on the computing cluster of the department or MetaCentrum, including powerful graphics cards (GPUs).

During the semester, we will collectively improve open-source machine translation systems. People interested in natural language processing or deep learning will focus on analyzing or designing tricks and modifying models for better translation quality; general software engineers can focus on the infrastructure of the experimentation environment or the optimization of existing tools.

The seminar assumes only high school knowledge of the formal description of natural languages.

The seminar will take place at the Unix laboratory.

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (25.01.2018)