Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDT
Thesis title in Czech: Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDT
Thesis title in English: Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDT
Key words: závislostní korpusy, detekce chyb, oprava chyb, variační n-gramy
English key words: dependency treebanks, error detection, error correction, variation n-grams
Academic year of topic announcement: 2013/2014
Thesis type: diploma thesis
Thesis language: angličtina
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: doc. Ing. Zdeněk Žabokrtský, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 27.03.2014
Date of assignment: 27.03.2014
Confirmed by Study dept. on: 02.04.2014
Date and time of defence: 05.06.2015 00:00
Date of electronic submission:07.05.2015
Date of submission of printed version:07.05.2015
Date of proceeded defence: 05.06.2015
Opponents: RNDr. David Mareček, Ph.D.
 
 
 
Guidelines
The goal of the work is to increase the quality of the multilingual treebank HamleDT, which contains dependency syntactic structures for thirty languages. At first, the student will study annotation conventions used in the particular resources integrated in HamleDT. After collecting empirical observations concerning annotation and transformation flaws present in the current version of HamleDT, the student will design criteria for measuring the quality of the HamleDT data from two viewpoints: the data should be maximally consistent within each language, and at the same time the annotation principles used for the individual languages should be unified as much as possible (with the obvious limitations imposed by the typological differences among the languages). The student will implement software tools for detecting and correcting HamleDT inconsistencies and will evaluate their impact on the data quality using statistical measures.

References
Daniel Zeman, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, Jan Hajič: HamleDT: To Parse or Not to Parse?. In:Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Copyright © European Language Resources Association, İstanbul, Turkey, ISBN 978-2-9517408-7-7, pp. 2735-2741, 2012

Markus Dickinson and W. Detmar Meurers (2003). Detecting Inconsistencies in Treebanks. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT 2003). Växjö, Sweden.

Václav Novák, Magda Ševčíková: Unsupervised Detection of Annotation Inconsistencies Using Apriori Algorithm. In: Proceedings of the Third Linguistic Annotation Workshop (LAW III) , Copyright © Association for Computational Linguistics, Suntec, Singapore, ISBN 978-1-932432-52-7, pp. 138-141, 2009

Adriane Boyd, Markus Dickinson, and Detmar Meurers (2007). Increasing the recall of corpus annotation error detection. Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (TLT 2007). Bergan, Norway.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html