Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Automatické vyhledávání chyb v syntakticky anotovaných korpusech

Thesis title in Czech:	Automatické vyhledávání chyb v syntakticky anotovaných korpusech
Thesis title in English:	Automatic Error Detection in Treebanks
Academic year of topic announcement:	2008/2009
Thesis type:	diploma thesis
Thesis language:
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. Ing. Zdeněk Žabokrtský, Ph.D.
Author:

References

1. Dokumentace k Pražskému závislostnímu korpusu (čeština), korpusu
Penn Treebank (angličtina), korpusu Tiger Treebank (němčina)
dostupná na WWW.

2. Štěpánek, Jan: Závislostní zachycení větné struktury v anotovaném
syntaktickém korpusu (nástroje pro zajištění konzistence dat),
disertační práce, MFF UK 2006

Preliminary scope of work

V současné době existují syntakticky anotované korpusy pro několik
jazyků (angličtina, čeština, němčina, maďarština, čínština) a pro
další vznikají. Ani u ručně značkovaných
korpusů nelze bohužel zaručit stoprocentní správnost anotace, ať už z
důvodu chyby anotátora, postupných úprav anotačního schématu,
vágních anotačních instrukcí atd., cílem práce je proto navrhnout a
implementovat metodu pro automatické odhalování a klasifikaci
chyb v těchto korpusech. Lze využít hypotézu,
že pokrývají-li dva syntaktické podstromy identickou (nebo v nějakém
ohledu podobnou) posloupnost slov vstupní věty, měly by samy být až na ohodnocení
kořenového uzlu rovněž identické (nebo podobné). Metoda bude
testována a vyhodnocena na existujících korpusech pro nejméně čtyři jazyky.

Preliminary scope of work in English

Nowadays, there are syntactically annotated corpora (treebanks) available
for various languages (such as English, Czech, German, Hungarian, Chinese), and
new treebanks are still being created. Unfortunately it is not possible to guarantee
100% correctness even in the case of manual annotation, be the errors caused
by annotators' wrong decisions, unstable annotation scheme, vague annotation instructions etc.
The goal of this work is to design and implement a system for automatic
detection and classification of treebank errors. The following hypothesis can
be used: if two syntactic subtrees cover the same (or similar, in some aspect) sequences of
words, then the subtrees should be identical (or similar) too, perhaps with exception
of root labels. The system should be tested and evaluated using
the existing treebank for at least four languages.