Finding errors and inconsistencies in the CorefUD coreference dataset
Název práce v češtině: | Hledání chyb a nekonzistencí v koreferenčním datasetu CorefUD |
---|---|
Název v anglickém jazyce: | Finding errors and inconsistencies in the CorefUD coreference dataset |
Klíčová slova: | koreference|detekce chyb v anotaci|CorefUD |
Klíčová slova anglicky: | coreference|annotation error detection|CorefUD |
Akademický rok vypsání: | 2022/2023 |
Typ práce: | bakalářská práce |
Jazyk práce: | angličtina |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | Mgr. Martin Popel, Ph.D. |
Řešitel: | skrytý![]() |
Datum přihlášení: | 20.12.2022 |
Datum zadání: | 20.12.2022 |
Datum potvrzení stud. oddělením: | 08.03.2023 |
Datum a čas obhajoby: | 05.09.2024 09:00 |
Datum odevzdání elektronické podoby: | 18.07.2024 |
Datum odevzdání tištěné podoby: | 18.07.2024 |
Datum proběhlé obhajoby: | 05.09.2024 |
Oponenti: | Mgr. Michal Novák, Ph.D. |
Zásady pro vypracování |
The goal of this thesis is to find and correct errors in coreference annotation in CorefUD 1.0.
CorefUD (https://ufal.mff.cuni.cz/corefud/) is a collection of datasets in 11 languages which include coreference annotation. This annotation contains several types of errors, some of which are included also in the original datasets, others are caused by bugs in the conversion from the original format. This thesis will attempt to find, classify and correct several types of these errors. The implementation will use the Udapi framework for Python (https://udapi.github.io). Possible error types include: wrong or inconsistent span of coreference mention, errors or inconsistencies caused by limitations of the original dataset format (e.g. with regards to discontinuous mentions, singletons or coordination structures), mismatches between dependency parsing and mention spans, and suspicious configurations of multiple mentions (crossing or interleaved spans). |
Seznam odborné literatury |
Nedoluzhko Anna, Novák Michal, Popel Martin, Žabokrtský Zdeněk, Zeldes Amir, Zeman Daniel: CorefUD 1.0: Coreference Meets Universal Dependencies. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Copyright © European Language Resources Association, Marseille, France, ISBN 979-10-95546-72-6, pp. 4859-4872, 2022. https://aclanthology.org/2022.lrec-1.520/
Nedoluzhko Anna, Novák Michal, Popel Martin, Žabokrtský Zdeněk, Zeman Daniel: Is one head enough? Mention heads in coreference annotations compared with UD-style heads. In: Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021), Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-955917-14-8, pp. 101-114, 2021. https://aclanthology.org/2021.depling-1.10/ Popel Martin, Žabokrtský Zdeněk, Nedoluzhko Anna, Novák Michal, Zeman Daniel: Do UD Trees Match Mention Spans in Coreference Annotations?. In: Findings of the Association for Computational Linguistics: EMNLP 2021, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-955917-10-0, pp. 3570-3576, 2021. https://aclanthology.org/2021.findings-emnlp.303/ Václav Novák, Magda Razimova. "Unsupervised detection of annotation inconsistencies using apriori algorithm." Proceedings of the Third Linguistic Annotation Workshop (LAW III). 2009. https://aclanthology.org/W09-3024.pdf |