Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Unsupervised and Semi-Supervised Multilingual Learning for Resource-Poor Languages

Thesis title in Czech:	Unsupervised and Semi-Supervised Multilingual Learning for Resource-Poor Languages
Thesis title in English:	Unsupervised and Semi-Supervised Multilingual Learning for Resource-Poor Languages
Key words:	přirozený jazyk, strojové učení, morfologie, syntaxe
English key words:	natural language, machine learning, morphology, syntax
Academic year of topic announcement:	2011/2012
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	RNDr. Daniel Zeman, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	23.10.2011
Date of assignment:	25.10.2011
Confirmed by Study dept. on:	11.11.2011
Date and time of defence:	07.09.2012 09:00
Date of electronic submission:	03.08.2012
Date of submission of printed version:	03.08.2012
Date of proceeded defence:	07.09.2012
Opponents:	doc. Mgr. Barbora Vidová Hladká, Ph.D.



Advisors:	doc. Ing. Zdeněk Žabokrtský, Ph.D.

Guidelines

The goal of the thesis is to explore methods of natural language analysis (e.g. part of speech tagging, morphology and syntax) to languages for which few or no linguistically annotated resources are available. Possible approaches include but are not limited to the following:

1. Unsupervised monolingual methods. Reimplement and test published algorithms for unsupervised learning of linguistic structure (POS tagging, parsing).
2. Multilingual learning: existing resources of resource-rich languages are reused for new languages by porting the structure across aligned parallel corpora.
Both approaches could also be combined, for instance two languages would be first tagged in an unsupervised fashion to get a common set of coarse-grained part-of-speech tags, then a parser would be projected from a resource-rich language using parallel alignment and the common tagset (as in McDonald et al. 2011).

The work should include objective evaluation on at least one language where annotated resources are available for testing purposes. Sample application to one or more resource-poor languages with subjective evaluation and discussion would be a plus.

References

Benjamin Snyder and Regina Barzilay: Unsupervised Multilingual Learning for Morphological Segmentation, ACL 2008.
Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay: Unsupervised Multilingual Learning for POS Tagging, EMNLP 2008.
Tahira Naseem, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay: Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches, JAIR 36 (2009).
Benjamin Snyder, Tahira Naseem, and Regina Barzilay: Unsupervised Multilingual Grammar Induction, ACL 2009.
Ryan McDonald, Slav Petrov, and Keith Hall: Multi-Source Transfer of Delexicalized Dependency Parsers, EMNLP 2011.
Daniel Zeman, Philip Resnik: Cross-Language Parser Adaptation Between Related Languages, NLPLPL / IJCNLP 2008.
Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, Oskan Kolak: Bootstrapping Parsers via Syntactic Projection across Parallel Texts, NL Engineering 11(03):311-325, 2005.