Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Universal Corpus and NLP tools

Thesis title in Czech:
Thesis title in English:	Universal Corpus and NLP tools.
Key words:	korpus, zpracování přirozeného jazyka
English key words:	corpus, natural language processing
Academic year of topic announcement:	2015/2016
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Pavel Pecina, Ph.D.
Author:

Guidelines

A Universal Corpus can be understood as immensely big corpus that contains billions of words in thousands of languages having a common data format enabling annotation on various differen layers. Such a corpus would allow implementation and easy training of statistical NLP tools for a large scale of languages (e.g. part-of-speech taggers, parsers, etc.). However, building a complete Universal Corpus is a long-term project requiring a lot of effort. Thus, this research focuses on creating a smaller-scale prototype of Universal Corpus and a limited set of NLP tools based on data from this corpus.

The goals of the project include: 1) Theoretical research of existing approaches to build a corpus. 2) A proposal of an appropriate data model/format for Universal Corpus, 3) Bulding a prototype of Universal Corpus using data in tens of languages, 4) Implementation and evaluation of NLP tools for processing data in the Universal Corpus (Language Identification and Part-of-Speech tagging).

References

(1) Steven P. Abney and Steven Bird. The human language project: Building a universal corpus of the world's languages. In Association for Computational Linguistics (ACL), pages 88-97, Uppsala, Sweden, 2010.
(2) Steven Abney and Steven Bird. Towards a data model for the universal corpus. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC '11, pages 120-127, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
(3) Timothy Baldwin, Jonathan Pool, and Susan M. Colowick. Panlex and lextract: Translating all words of all languages of the world. In Conference on Computational Linguistic (COLING) (Demos), pages 37-40, Beijing, China, 2010.
(4) Steven Bird. A scalable method for preserving oral literature from small languages. In International Conference on Asia-Pacifc Digital Libraries (ICADL), pages 5-14, Gold Coast, Australia, 2010.
(5) Timothy Baldwin, Marco Lui. Fast, accurate standalone language identification toolkit. ALTA Conference 2012.