Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Universal Corpus and NLP tools

Název práce v češtině:
Název v anglickém jazyce:	Universal Corpus and NLP tools.
Klíčová slova:	korpus, zpracování přirozeného jazyka
Klíčová slova anglicky:	corpus, natural language processing
Akademický rok vypsání:	2015/2016
Typ práce:	diplomová práce
Jazyk práce:	angličtina
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	doc. RNDr. Pavel Pecina, Ph.D.
Řešitel:

Zásady pro vypracování

A Universal Corpus can be understood as immensely big corpus that contains billions of words in thousands of languages having a common data format enabling annotation on various differen layers. Such a corpus would allow implementation and easy training of statistical NLP tools for a large scale of languages (e.g. part-of-speech taggers, parsers, etc.). However, building a complete Universal Corpus is a long-term project requiring a lot of effort. Thus, this research focuses on creating a smaller-scale prototype of Universal Corpus and a limited set of NLP tools based on data from this corpus.

The goals of the project include: 1) Theoretical research of existing approaches to build a corpus. 2) A proposal of an appropriate data model/format for Universal Corpus, 3) Bulding a prototype of Universal Corpus using data in tens of languages, 4) Implementation and evaluation of NLP tools for processing data in the Universal Corpus (Language Identification and Part-of-Speech tagging).

Seznam odborné literatury

(1) Steven P. Abney and Steven Bird. The human language project: Building a universal corpus of the world's languages. In Association for Computational Linguistics (ACL), pages 88-97, Uppsala, Sweden, 2010.
(2) Steven Abney and Steven Bird. Towards a data model for the universal corpus. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC '11, pages 120-127, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
(3) Timothy Baldwin, Jonathan Pool, and Susan M. Colowick. Panlex and lextract: Translating all words of all languages of the world. In Conference on Computational Linguistic (COLING) (Demos), pages 37-40, Beijing, China, 2010.
(4) Steven Bird. A scalable method for preserving oral literature from small languages. In International Conference on Asia-Pacifc Digital Libraries (ICADL), pages 5-14, Gold Coast, Australia, 2010.
(5) Timothy Baldwin, Marco Lui. Fast, accurate standalone language identification toolkit. ALTA Conference 2012.