Unsupervised and Semi-Supervised Multilingual Learning for Resource-Poor Languages
Thesis title in Czech: | Unsupervised and Semi-Supervised Multilingual Learning for Resource-Poor Languages |
---|---|
Thesis title in English: | Unsupervised and Semi-Supervised Multilingual Learning for Resource-Poor Languages |
Key words: | přirozený jazyk, strojové učení, morfologie, syntaxe |
English key words: | natural language, machine learning, morphology, syntax |
Academic year of topic announcement: | 2011/2012 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | RNDr. Daniel Zeman, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 23.10.2011 |
Date of assignment: | 25.10.2011 |
Confirmed by Study dept. on: | 11.11.2011 |
Date and time of defence: | 07.09.2012 09:00 |
Date of electronic submission: | 03.08.2012 |
Date of submission of printed version: | 03.08.2012 |
Date of proceeded defence: | 07.09.2012 |
Opponents: | doc. Mgr. Barbora Vidová Hladká, Ph.D. |
Advisors: | doc. Ing. Zdeněk Žabokrtský, Ph.D. |
Guidelines |
The goal of the thesis is to explore methods of natural language analysis (e.g. part of speech tagging, morphology and syntax) to languages for which few or no linguistically annotated resources are available. Possible approaches include but are not limited to the following:
1. Unsupervised monolingual methods. Reimplement and test published algorithms for unsupervised learning of linguistic structure (POS tagging, parsing). 2. Multilingual learning: existing resources of resource-rich languages are reused for new languages by porting the structure across aligned parallel corpora. Both approaches could also be combined, for instance two languages would be first tagged in an unsupervised fashion to get a common set of coarse-grained part-of-speech tags, then a parser would be projected from a resource-rich language using parallel alignment and the common tagset (as in McDonald et al. 2011). The work should include objective evaluation on at least one language where annotated resources are available for testing purposes. Sample application to one or more resource-poor languages with subjective evaluation and discussion would be a plus. |
References |
Benjamin Snyder and Regina Barzilay: Unsupervised Multilingual Learning for Morphological Segmentation, ACL 2008.
Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay: Unsupervised Multilingual Learning for POS Tagging, EMNLP 2008. Tahira Naseem, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay: Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches, JAIR 36 (2009). Benjamin Snyder, Tahira Naseem, and Regina Barzilay: Unsupervised Multilingual Grammar Induction, ACL 2009. Ryan McDonald, Slav Petrov, and Keith Hall: Multi-Source Transfer of Delexicalized Dependency Parsers, EMNLP 2011. Daniel Zeman, Philip Resnik: Cross-Language Parser Adaptation Between Related Languages, NLPLPL / IJCNLP 2008. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, Oskan Kolak: Bootstrapping Parsers via Syntactic Projection across Parallel Texts, NL Engineering 11(03):311-325, 2005. |