Universal Corpus and NLP tools
Thesis title in Czech: | |
---|---|
Thesis title in English: | Universal Corpus and NLP tools. |
Key words: | korpus, zpracování přirozeného jazyka |
English key words: | corpus, natural language processing |
Academic year of topic announcement: | 2015/2016 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | doc. RNDr. Pavel Pecina, Ph.D. |
Author: |
Guidelines |
A Universal Corpus can be understood as immensely big corpus that contains billions of words in thousands of languages having a common data format enabling annotation on various differen layers. Such a corpus would allow implementation and easy training of statistical NLP tools for a large scale of languages (e.g. part-of-speech taggers, parsers, etc.). However, building a complete Universal Corpus is a long-term project requiring a lot of effort. Thus, this research focuses on creating a smaller-scale prototype of Universal Corpus and a limited set of NLP tools based on data from this corpus.
The goals of the project include: 1) Theoretical research of existing approaches to build a corpus. 2) A proposal of an appropriate data model/format for Universal Corpus, 3) Bulding a prototype of Universal Corpus using data in tens of languages, 4) Implementation and evaluation of NLP tools for processing data in the Universal Corpus (Language Identification and Part-of-Speech tagging). |
References |
(1) Steven P. Abney and Steven Bird. The human language project: Building a universal corpus of the world's languages. In Association for Computational Linguistics (ACL), pages 88-97, Uppsala, Sweden, 2010.
(2) Steven Abney and Steven Bird. Towards a data model for the universal corpus. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC '11, pages 120-127, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. (3) Timothy Baldwin, Jonathan Pool, and Susan M. Colowick. Panlex and lextract: Translating all words of all languages of the world. In Conference on Computational Linguistic (COLING) (Demos), pages 37-40, Beijing, China, 2010. (4) Steven Bird. A scalable method for preserving oral literature from small languages. In International Conference on Asia-Pacifc Digital Libraries (ICADL), pages 5-14, Gold Coast, Australia, 2010. (5) Timothy Baldwin, Marco Lui. Fast, accurate standalone language identification toolkit. ALTA Conference 2012. |