Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat
Název práce v češtině: | Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat |
---|---|
Název v anglickém jazyce: | Creation of a Dependency Treebank for Yoruba using Parallel Data |
Klíčová slova: | závislostní syntax, universal dependencies, jazyky s nedostatečnými zdroji |
Klíčová slova anglicky: | dependency parsing, annotation, parallel data, projection, UDPipe, part-of-speech tagging, low-resource |
Akademický rok vypsání: | 2017/2018 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | RNDr. Daniel Zeman, Ph.D. |
Řešitel: | skrytý - zadáno a potvrzeno stud. odd. |
Datum přihlášení: | 08.03.2018 |
Datum zadání: | 11.03.2018 |
Datum potvrzení stud. oddělením: | 07.08.2018 |
Datum a čas obhajoby: | 11.09.2018 09:00 |
Datum odevzdání elektronické podoby: | 21.07.2018 |
Datum odevzdání tištěné podoby: | 20.07.2018 |
Datum proběhlé obhajoby: | 11.09.2018 |
Oponenti: | Mgr. Rudolf Rosa, Ph.D. |
Zásady pro vypracování |
The goal of the thesis is to create a small dependency treebank for Yoruba, a language with very little pre-existing machine-readable resources. The treebank will follow the Universal Dependencies annotation standard; however, certain language-specific guidelines for Yoruba will have to be specified. Known techniques for porting resources from resource-rich languages will be tested, in particular projection of annotation across parallel bilingual data. Manual annotation is not the main focus of this thesis; nevertheless, a small portion of the data will be verified manually in order to evaluate the annotation quality. |
Seznam odborné literatury |
* Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11:11–311.
* Daniel Zeman, Philip Resnik (2008): Cross-Language Parser Adaptation between Related Languages. In: IJCNLP 2008 Workshop on NLP for Less Privileged Languages, pp. 35-42, International Institute of Information Technology, Hyderabad, India * Željko Agić, Dirk Hovy, and Anders Søgaard (2015). If all you have is a bit of the Bible: Learning POS taggers for truly lowresource languages. In The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015). * Universal Dependencies v2 guidelines (2014-2018): http://universaldependencies.org/ |
Předběžná náplň práce v anglickém jazyce |
The goal of this thesis is to create a dependency treebank for Yorùbá, a language with very little pre-existing machine-readable resources. The treebank follows the Universal Dependencies (UD) annotation standard, certain language-specific guidelines for Yorùbá were specified. Known techniques for porting resources from resource-rich languages were tested, in particular projection of annotation across parallel bilingual data.
Manual annotation is not the main focus of this thesis; nevertheless, a small portion of the data was verified manually in order to evaluate the annotation quality. Also, a model was trained on the manual annotation using UDPipe. |