Assessing the impact of manual corrections in the Groningen Meaning Bank
Název práce v češtině: | Assessing the impact of manual corrections in the Groningen Meaning Bank |
---|---|
Název v anglickém jazyce: | Assessing the impact of manual corrections in the Groningen Meaning Bank |
Klíčová slova: | korpus, slovní druhy, anotace, opravy, NLP |
Klíčová slova anglicky: | corpus, part-of-speech, annotation, correction, NLP |
Akademický rok vypsání: | 2014/2015 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | doc. RNDr. Markéta Lopatková, Ph.D. |
Řešitel: | skrytý![]() |
Datum přihlášení: | 05.03.2015 |
Datum zadání: | 05.03.2015 |
Datum potvrzení stud. oddělením: | 16.03.2015 |
Datum a čas obhajoby: | 03.02.2016 09:00 |
Datum odevzdání elektronické podoby: | 03.12.2015 |
Datum odevzdání tištěné podoby: | 04.12.2015 |
Datum proběhlé obhajoby: | 03.02.2016 |
Oponenti: | doc. Mgr. Barbora Vidová Hladká, Ph.D. |
Zásady pro vypracování |
Developing large scale annotated text corpora usually comes at great cost. Employing human experts to annotate linguistic phenomena and ensuring inter-annotator agreement costs time and money, which is often limited. Nonetheless, there is a constant need for gold standard corpora suitable for many NLP tasks (e.g. as training data for supervised machine learning). The Groningen Meaning Bank (GMB) project (Basile et al., 2012) is developing a corpus of English texts with rich syntactic and semantic annotations, which are generated semi-automatically. Annotations in GMB come from two main sources. Initial annotations are provided by a set of NLP tools, the C&C tool-chain (Curran et al., 2007). Those annotations are corrected/refined by human annotators. Those human annotators can either be experts, that apply corrections in a wiki-like fashion, or non-experts, that help annotation by playing a game with a purpose, called Wordrobe. Within the GMB project, these corrections are called Bits of Wisdom or BOWs for short. At the moment there are more than 150,000 BOWs that fix tokenization, syntactic annotation (e.g. POS tags, NE tags) and semantic annotation.
The main question of the thesis is: How can the BOWs be used to effectively retrain the tools to eventually improve annotations on the whole GMB corpus? The underlying hypothesis is that (near) gold standard annotations can be acquired by applying an iterative bootstrapping approach to the annotation. The most obvious step in utilizing the information available from the BOWs is to simply retrain the statistical models that are used in the tagging process on the corrected data from the corpus. Improvement and performance can then be measured by the number of correct tags with respect to the BOWs. Besides a quantitative, a qualitative evaluation might prove beneficial in the course of determining useful factors that influence the training. It might, for example, be interesting to look for certain reoccurring patterns in the mistakes the taggers make and what techniques help to fix them. An alternative to using BOWs as evaluation measure is to use held-out data form the corpus as a test set. This data can, thanks to a great number of BOWs, be considered (almost) gold standard. The main focus of this work will be on the morphological and syntactic annotations of the corpus, since there already exists a sophisticated system for the tokenization. Correct syntactic annotation is moreover a major condition for a correct semantic annotation. The work will focus primarily on the annotation of Part-of-Speech (POS). POS annotation does not require any other layer of annotation and is thus not influenced by mistakes made on other layers. Correctness of the POS tags however influences the quality of annotation on other layers such as Named Entity (NE). Improving the POS tagging is hence deemed an interesting starting point. Once a set of techniques shows promising results on the POS tagging it might be interesting to try to transfer that technique to NE tagging. |
Seznam odborné literatury |
Basile, V., Bos, J., Evang, K., and Venhuizen, N. (2012). Developing a large semantically annotated corpus. In LREC, volume 12, pages 3196-3200.
Curran, J. R., Clark, S., and Bos, J. (2007). Linguistically motivated large-scale nlp with c&c and boxer. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 33-36. Association for Computational Linguistics. Groningen Meaning Bank http://gmb.let.rug.nl |