Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDT
Thesis title in Czech: | Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDT |
---|---|
Thesis title in English: | Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDT |
Key words: | závislostní korpusy, detekce chyb, oprava chyb, variační n-gramy |
English key words: | dependency treebanks, error detection, error correction, variation n-grams |
Academic year of topic announcement: | 2013/2014 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | doc. Ing. Zdeněk Žabokrtský, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 27.03.2014 |
Date of assignment: | 27.03.2014 |
Confirmed by Study dept. on: | 02.04.2014 |
Date and time of defence: | 05.06.2015 00:00 |
Date of electronic submission: | 07.05.2015 |
Date of submission of printed version: | 07.05.2015 |
Date of proceeded defence: | 05.06.2015 |
Opponents: | RNDr. David Mareček, Ph.D. |
Guidelines |
The goal of the work is to increase the quality of the multilingual treebank HamleDT, which contains dependency syntactic structures for thirty languages. At first, the student will study annotation conventions used in the particular resources integrated in HamleDT. After collecting empirical observations concerning annotation and transformation flaws present in the current version of HamleDT, the student will design criteria for measuring the quality of the HamleDT data from two viewpoints: the data should be maximally consistent within each language, and at the same time the annotation principles used for the individual languages should be unified as much as possible (with the obvious limitations imposed by the typological differences among the languages). The student will implement software tools for detecting and correcting HamleDT inconsistencies and will evaluate their impact on the data quality using statistical measures.
|
References |
Daniel Zeman, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, Jan Hajič: HamleDT: To Parse or Not to Parse?. In:Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Copyright © European Language Resources Association, İstanbul, Turkey, ISBN 978-2-9517408-7-7, pp. 2735-2741, 2012
Markus Dickinson and W. Detmar Meurers (2003). Detecting Inconsistencies in Treebanks. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT 2003). Växjö, Sweden. Václav Novák, Magda Ševčíková: Unsupervised Detection of Annotation Inconsistencies Using Apriori Algorithm. In: Proceedings of the Third Linguistic Annotation Workshop (LAW III) , Copyright © Association for Computational Linguistics, Suntec, Singapore, ISBN 978-1-932432-52-7, pp. 138-141, 2009 Adriane Boyd, Markus Dickinson, and Detmar Meurers (2007). Increasing the recall of corpus annotation error detection. Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (TLT 2007). Bergan, Norway. |