Universal Corpus and NLP tools
Název práce v češtině: | |
---|---|
Název v anglickém jazyce: | Universal Corpus and NLP tools. |
Klíčová slova: | korpus, zpracování přirozeného jazyka |
Klíčová slova anglicky: | corpus, natural language processing |
Akademický rok vypsání: | 2015/2016 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | doc. RNDr. Pavel Pecina, Ph.D. |
Řešitel: |
Zásady pro vypracování |
A Universal Corpus can be understood as immensely big corpus that contains billions of words in thousands of languages having a common data format enabling annotation on various differen layers. Such a corpus would allow implementation and easy training of statistical NLP tools for a large scale of languages (e.g. part-of-speech taggers, parsers, etc.). However, building a complete Universal Corpus is a long-term project requiring a lot of effort. Thus, this research focuses on creating a smaller-scale prototype of Universal Corpus and a limited set of NLP tools based on data from this corpus.
The goals of the project include: 1) Theoretical research of existing approaches to build a corpus. 2) A proposal of an appropriate data model/format for Universal Corpus, 3) Bulding a prototype of Universal Corpus using data in tens of languages, 4) Implementation and evaluation of NLP tools for processing data in the Universal Corpus (Language Identification and Part-of-Speech tagging). |
Seznam odborné literatury |
(1) Steven P. Abney and Steven Bird. The human language project: Building a universal corpus of the world's languages. In Association for Computational Linguistics (ACL), pages 88-97, Uppsala, Sweden, 2010.
(2) Steven Abney and Steven Bird. Towards a data model for the universal corpus. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC '11, pages 120-127, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. (3) Timothy Baldwin, Jonathan Pool, and Susan M. Colowick. Panlex and lextract: Translating all words of all languages of the world. In Conference on Computational Linguistic (COLING) (Demos), pages 37-40, Beijing, China, 2010. (4) Steven Bird. A scalable method for preserving oral literature from small languages. In International Conference on Asia-Pacifc Digital Libraries (ICADL), pages 5-14, Gold Coast, Australia, 2010. (5) Timothy Baldwin, Marco Lui. Fast, accurate standalone language identification toolkit. ALTA Conference 2012. |