Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Multilingual Semantic Annotation with Focus on a Low-Resource Language
Thesis title in Czech: Vícejazyčná sémantická anotace se zaměřením na jazyk s nedostatečnými zdroji
Thesis title in English: Multilingual Semantic Annotation with Focus on a Low-Resource Language
Key words: sémantika|syntax|jazykové zdroje|anotovaný korpus|jazyky s nedostatečnými zdroji
English key words: semantics|syntax|language resources|annotated corpus|low-resource languages
Academic year of topic announcement: 2020/2021
Thesis type: dissertation
Thesis language: angličtina
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: RNDr. Daniel Zeman, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 18.02.2021
Date of assignment: 18.02.2021
Confirmed by Study dept. on: 25.05.2021
Guidelines
Semantic analysis (parsing) of natural language relies on richly annotated language data (corpus), which is not easy to obtain. Semantically annotated corpora exist for a few languages, such as Czech and English, while the majority of languages lack resources of this kind. The topic of this PhD project has two interrelated branches. In the linguistic branch, a scheme of semantic annotation will be proposed that is applicable to multiple typologically diverse languages, and selected semantic phenomena will be annotated on data from a language that lacks substantial resources of this kind. The selection of phenomena to address will be done during the initial stage of the project; options include anchored named entities, semantic roles, coreference etc.

In the computational branch, multilingual and cross-lingual transfer techniques will be explored that can leverage existing resources in other languages and help with bootstrapping the resources in the target language. Such techniques have been proposed and tested (with varying level of success) for morphological tagging, surface-syntactic parsing and semantic role labeling. In this work they should be tested and adapted for the language phenomena addressed in the linguistic branch of the project. The expected target language is Persian, but applicability of the annotation scheme and the transfer techniques to other languages should be born in mind and occasionally tested. Existing rich resources in Czech and English can be used as inspiration for the annotation scheme and as source for the transfer experiments.
References
Jan Hajič, Eva Hajičová, Marie Mikulová, Jiří Mírovský. 2017. Prague Dependency Treebank. In: Handbook of Linguistic Annotation. Springer, pages 555-594.

Zdeněk Žabokrtský, Daniel Zeman, Magda Ševčíková. 2020. Sentence Meaning Representations Across Languages: What Can We Learn from Existing Frameworks? Computational Linguistics, vol. 46, no. 3, pages 605-665, September 2020, https://www.mitpressjournals.org/toc/coli/46/3

Kira Droganova, Daniel Zeman. 2019. Towards Deep Universal Dependencies. In: Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, Syntaxfest 2019), pages 144-152, Association for Computational Linguistics, Paris, France, ISBN 978-1-950737-63-5

Maryam Aminian, Mohammad Sadegh Rasooli, Mona Diab. 2020. Multitask Learning for Cross-Lingual Transfer of Broad-coverage Semantic Dependencies. In: Proceedings of EMNLP.

Rui Cai, Mirella Lapata. 2020. Alignment-free Cross-lingual Semantic Role Labeling. In: Proceedings of EMNLP.

Angel Daza, Anette Frank. 2020. X-SRL: A Parallel Cross-Lingual Semantic Role Labeling Dataset. In: Proceedigns of EMNLP.

Jan A. Botha, Zifei Shan, Daniel Gillick. 2020. Entity Linking in 100 Languages. In: Proceedings of EMNLP.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html