SubjectsSubjects(version: 811)
Course, academic year 2017/2018
   Login via CAS
Competing in Machine Translation - NPFL101
Czech title: Soutěžní strojový překlad
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2015
Semester: winter
E-Credits: 3
Hours per week, examination: winter s.:0/2 C [hours/week]
Capacity: unlimited
Min. number of students: unlimited
State of the course: taught
Language: Czech, English
Teaching methods: full-time
Note: you can enroll for the course repeatedly
předmět je možno zapsat mimo plán
povolen pro zápis po webu
Guarantor: RNDr. Ondřej Bojar, Ph.D.
Pre-requisite : NSWI095
Annotation -
Last update: T_UFAL (09.05.2012)

The seminar can serve as a supplement of Unix classes or a very practical introduction to some aspects of computational linguistics. We will collectively improve a statistical machine translation system (esp. between Czech and English) and take part in competitions like http://www.statmt.org/wmt12/. In the environment of Unix, we will be scripting a large collection of available tools and tackling a wide range of technical issues, including the necessity to parallelize computations with large datasets.
Terms of passing the course
Last update: RNDr. Ondřej Bojar, Ph.D. (10.10.2017)

You can enroll into the NPFL101 seminar repeatedly, i.e. in more than one year.

Every year, the key requirement, for which you will receive the credit, is to submit a report describing your project for the seminar. Depending on your particular project, we may also agree on a presentation at the seminar, which then contributes the content to your report.

The report shall be at least 2-4 pages long and include proper introduction (the "big picture" of what your work is contributing to), technical details, as well as a standard conclusion. You can work on your project alone or in a small group, as agreed at the seminar.

If the resulting project leads to a workshop or a conference paper, there is no need to write a separate report.

The submission of the report is an iterative process, you send me a draft, and I will typically ask you for minor or greater revisions. We iterate, until the report is well written and rounded and I accept it. In this sense, the report can be "submitted" many times.

Literature -
Last update: T_UFAL (09.05.2012)

Chris Callison-Burch, Philipp Koehn, Christof Monz and Omar Zaidan: Findings of the 2011 Workshop on Statistical Machine Translation. EMNLP 2011 Workshop on Statistical Machine Translation. Edinburgh.

http://www.statmt.org/wmt11/

Philipp Koehn: Statistical Machine Translation. Cambridge University Press. ISBN: 978-0521874151, 2009.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst: Moses: Open Source Toolkit for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.

http://www.statmt.org/moses/

Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi, Ondřej Bojar, Chris Callison-Burch, Brooke Cowan, Chris Dyer, Hieu Hoang, Richard Zens, Alexandra Constantin, Christine Moran, and Evan Herbst: Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Confusion Network Decoding. Technical report, Johns Hopkins University, Center for Speech and Language Processing, 2006.

http://ufal.mff.cuni.cz/~bojar/publications/2006-FILE-koehn_etal_jhuws_2006-2006-jhu-report.pdf

Syllabus -
Last update: T_UFAL (09.05.2012)

(To further motivate the participation, the syllabus has been translated from Czech only by machine.)

The seminar will improve machine translation systems (especially the Czech translation) and to participate in the annual competition with them in translation, http://www.statmt.org/wmt12/. With our system, shall consult regularly on the front walls of the competition, Google Translate, but safe from Czech commercial systems.

Statistical machine translation is a particularly demanding task in terms of volume of processed data. Quite generally, therefore, working in parallel on dozens of computers and is not a problem for one experiment effectively use 100 gigabytes drive and 100 GB of RAM. With a small model may translate the same software on the OLPC (One Laptop per Child).

To the maximum extent relying on existing tools that are implemented in a mixture of languages such as Perl, C/C++, Bash, Python, Java. That is why I would like to welcome seminar and pure software engineers, even without any knowledge or interest in computer linguistics.

During the semester we will collectively improve free software implementation of training and the actual translation. In addition to toy models that can be prepared and run on individual computers in the lab will try to create a makeshift lab in the cluster, and compute in parallel. Some efforts will have to pay Cylinders and space so that we can effectively (and parallel) use, without too much strain on the network. Those interested in natural language processing focuses on the design tricks and editing models for better quality translation, others help with the infrastructure and possibly with existing optimization tools.

The seminar assumes knowledge of only secondary formal description of natural languages.

The seminar will take place in the Unix lab.

 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html