Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Towards Machine Translation Based on Monolingual Texts

Název práce v češtině:	Strojový překlad na základě jednojazyčných textů
Název v anglickém jazyce:	Towards Machine Translation Based on Monolingual Texts
Klíčová slova:	strojový překlad\|neřízené učení\|hluboké neuronové sítě\|nízkozdrojové jazyky\|zpracování přirozeného jazyka
Klíčová slova anglicky:	machine translation\|unsupervised learning\|deep neural networks\|low-resource languages\|natural language processing
Akademický rok vypsání:	2017/2018
Typ práce:	disertační práce
Jazyk práce:	angličtina
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	doc. RNDr. Ondřej Bojar, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	27.09.2018
Datum zadání:	27.09.2018
Datum potvrzení stud. oddělením:	29.10.2018
Datum a čas obhajoby:	09.02.2024 09:00
Datum odevzdání elektronické podoby:	12.11.2023
Datum odevzdání tištěné podoby:	14.11.2023
Datum proběhlé obhajoby:	09.02.2024
Oponenti:	Dr. Cristina Espana-Bonet
	RNDr. Martin Čmejrek, Ph.D.

Zásady pro vypracování

The current state of the art in data-driven machine translation (both the classical statistical MT as well as neural MT, NMT) heavily relies on parallel data, i.e. texts that have been previously translated by humans. This type of resource is usually not a natural side product of other activities, except for limited domains such as nation-wide regulations in countries with more than one official language. Exceptionally, some news agencies provide stories in two or more languages but in general, the construction of parallel corpora is a rather expensive task.

A different breed are the so-called 'comparable corpora'. These contain different texts in various languages, but thanks to some particular commonality, it is known that the texts in all the languages describe the same events, things etc. While there may be no strictly parallel sentences available in a comparable corpus, it is sure that at least translations of individual words or phrases will be present. Examples of comparable corpora include Wikipedia or top news articles published around the same date. The literature offers many methods for extracting translational units from such text collections or providing them with some additional statistics.

The aim of the thesis is to investigate methods of training MT using monolingual texts. The goal can be approached from a rather wide range of angles, e.g. focusing on methods for finding and obtaining well-matching sources from the web, or relying on very large existing collections of text (e.g. CommonCrawl) and devising methods that extract the best-matching sentence pairs, or designing novel ways of NMT training that benefit from the monolingual data. The developed methods should be language-independent as much as possible, but during the development, the main focus will be on a few language pairs, e.g. Czech-Japanese.

An inherent part of the thesis is a careful evaluation in a range of settings: the mentioned Czech-Japanese could serve as a rather low-resource language pair, English-Czech, English-French or other relevant pair will be also tested to check the utility of the method for languages where large parallel data are readily available. The main domain will be news texts, but it would be also interesting to apply the system in a cross-domain setting or consider existing domain-adaptation techniques.

Seznam odborné literatury

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. CoRR, abs/1710.11041, 2017.

Jakub Kúdela, Irena Holubová, and Ondřej Bojar. Extracting parallel paragraphs from common crawl. The Prague Bulletin of Mathematical Linguistics, (107):36–59, 2017.

Alexandre Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2012.
Toward statistical machine translation without parallel corpora.
In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL '12, pages 130--140, Stroudsburg, PA, USA. Association for Computational Linguistics.

Philipp Koehn and Kevin Knight. 2002.
Learning a translation lexicon from monolingual corpora.
In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9, ULA '02, pages 9--16, Stroudsburg, PA, USA. Association for Computational Linguistics.

Matthew Garvey Snover. 2010.
Improving Statistical Machine Translation Using Comparable Corpora. Dissertation thesis. University of Maryland.

Ivan Vulić and Marie-Francine Moens. 2012.
Detecting highly confident word translations from comparable corpora without any prior knowledge.
In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL '12, pages 449--459, Stroudsburg, PA, USA. Association for Computational Linguistics.