Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Assessing the impact of manual corrections in the Groningen Meaning Bank

Thesis title in Czech:	Assessing the impact of manual corrections in the Groningen Meaning Bank
Thesis title in English:	Assessing the impact of manual corrections in the Groningen Meaning Bank
Key words:	korpus, slovní druhy, anotace, opravy, NLP
English key words:	corpus, part-of-speech, annotation, correction, NLP
Academic year of topic announcement:	2014/2015
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Markéta Lopatková, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	05.03.2015
Date of assignment:	05.03.2015
Confirmed by Study dept. on:	16.03.2015
Date and time of defence:	03.02.2016 09:00
Date of electronic submission:	03.12.2015
Date of submission of printed version:	04.12.2015
Date of proceeded defence:	03.02.2016
Opponents:	doc. Mgr. Barbora Vidová Hladká, Ph.D.

Guidelines

Developing large scale annotated text corpora usually comes at great cost. Employing human experts to annotate linguistic phenomena and ensuring inter-annotator agreement costs time and money, which is often limited. Nonetheless, there is a constant need for gold standard corpora suitable for many NLP tasks (e.g. as training data for supervised machine learning). The Groningen Meaning Bank (GMB) project (Basile et al., 2012) is developing a corpus of English texts with rich syntactic and semantic annotations, which are generated semi-automatically. Annotations in GMB come from two main sources. Initial annotations are provided by a set of NLP tools, the C&C tool-chain (Curran et al., 2007). Those annotations are corrected/refined by human annotators. Those human annotators can either be experts, that apply corrections in a wiki-like fashion, or non-experts, that help annotation by playing a game with a purpose, called Wordrobe. Within the GMB project, these corrections are called Bits of Wisdom or BOWs for short. At the moment there are more than 150,000 BOWs that fix tokenization, syntactic annotation (e.g. POS tags, NE tags) and semantic annotation.
The main question of the thesis is:
How can the BOWs be used to effectively retrain the tools to eventually improve annotations on the whole GMB corpus?
The underlying hypothesis is that (near) gold standard annotations can be acquired by applying an iterative bootstrapping approach to the annotation.

The most obvious step in utilizing the information available from the BOWs is to simply retrain the statistical models that are used in the tagging process on the corrected data from the corpus. Improvement and performance can then be measured by the number of correct tags with respect to the BOWs.
Besides a quantitative, a qualitative evaluation might prove beneficial in the course of determining useful factors that influence the training. It might, for example, be interesting to look for certain reoccurring patterns in the mistakes the taggers make and what techniques help to fix them.
An alternative to using BOWs as evaluation measure is to use held-out data form the corpus as a test set. This data can, thanks to a great number of BOWs, be considered (almost) gold standard.

The main focus of this work will be on the morphological and syntactic annotations of the corpus, since there already exists a sophisticated system for the tokenization. Correct syntactic annotation is moreover a major condition for a correct semantic annotation. The work will focus primarily on the annotation of Part-of-Speech (POS). POS annotation does not require any other layer of annotation and is thus not influenced by mistakes made on other layers. Correctness of the POS tags however influences the quality of annotation on other layers such as Named Entity (NE). Improving the POS tagging is hence deemed an interesting starting point. Once a set of techniques shows promising results on the POS tagging it might be interesting to try to transfer that technique to NE tagging.

References

Basile, V., Bos, J., Evang, K., and Venhuizen, N. (2012). Developing a large semantically annotated corpus. In LREC, volume 12, pages 3196-3200.
Curran, J. R., Clark, S., and Bos, J. (2007). Linguistically motivated large-scale nlp with c&c and boxer. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 33-36. Association for Computational Linguistics.
Groningen Meaning Bank http://gmb.let.rug.nl