Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Universal Corpus and NLP tools
Thesis title in Czech:
Thesis title in English: Universal Corpus and NLP tools.
Key words: korpus, zpracování přirozeného jazyka
English key words: corpus, natural language processing
Academic year of topic announcement: 2015/2016
Thesis type: diploma thesis
Thesis language: angličtina
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: doc. RNDr. Pavel Pecina, Ph.D.
Author:
Guidelines
A Universal Corpus can be understood as immensely big corpus that contains billions of words in thousands of languages having a common data format enabling annotation on various differen layers. Such a corpus would allow implementation and easy training of statistical NLP tools for a large scale of languages (e.g. part-of-speech taggers, parsers, etc.). However, building a complete Universal Corpus is a long-term project requiring a lot of effort. Thus, this research focuses on creating a smaller-scale prototype of Universal Corpus and a limited set of NLP tools based on data from this corpus.

The goals of the project include: 1) Theoretical research of existing approaches to build a corpus. 2) A proposal of an appropriate data model/format for Universal Corpus, 3) Bulding a prototype of Universal Corpus using data in tens of languages, 4) Implementation and evaluation of NLP tools for processing data in the Universal Corpus (Language Identification and Part-of-Speech tagging).
References
(1) Steven P. Abney and Steven Bird. The human language project: Building a universal corpus of the world's languages. In Association for Computational Linguistics (ACL), pages 88-97, Uppsala, Sweden, 2010.
(2) Steven Abney and Steven Bird. Towards a data model for the universal corpus. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC '11, pages 120-127, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
(3) Timothy Baldwin, Jonathan Pool, and Susan M. Colowick. Panlex and lextract: Translating all words of all languages of the world. In Conference on Computational Linguistic (COLING) (Demos), pages 37-40, Beijing, China, 2010.
(4) Steven Bird. A scalable method for preserving oral literature from small languages. In International Conference on Asia-Pacifc Digital Libraries (ICADL), pages 5-14, Gold Coast, Australia, 2010.
(5) Timothy Baldwin, Marco Lui. Fast, accurate standalone language identification toolkit. ALTA Conference 2012.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html