Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Modelování kompozit pro vícejazyčné zdroje jazykových dat

Thesis title in Czech:	Modelování kompozit pro vícejazyčné zdroje jazykových dat
Thesis title in English:	Modelling compounds for multilingual language data resources
Key words:	kompozitum, slovotvorba, základové slovo, zdroj jazykových dat, vícejazyčný
English key words:	compound, word-formation, base word, language data resource, multilingual
Academic year of topic announcement:	2020/2021
Thesis type:	dissertation
Thesis language:	čeština
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. Mgr. Magda Ševčíková, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	08.09.2020
Date of assignment:	08.09.2020
Confirmed by Study dept. on:	30.09.2020
Date and time of defence:	27.09.2024 10:40
Date of electronic submission:	31.07.2024
Date of submission of printed version:	01.08.2024
Date of proceeded defence:	27.09.2024
Opponents:	RNDr. Jiří Hana, Ph.D.
	prof. Nabil Hathout

Guidelines

Compounds, defined generally as words based on more than one word (e.g., En. sun+flower > sunflower, Cz. ryba ‘fish’+lov ‘hunt’ > rybolov ‘fishery’), are an inherent part of existing language data resources. Their delimitation, though, differs largely across languages, depending on the grammatical structure of the languages as well as on the particular linguistic tradition (Lieber & Štekauer 2011, Štekauer et al. 2012). The goal of the thesis is to elaborate a workable definition and representation of compound words that would be robust and general enough for a number of typologically diverse languages in a way that is both understandable by humans and implementable for multilingual language data resources (e.g., Kyjánek et al. 2019).
The thesis will deal with the identification of base words for compounds, aiming at delineating boundaries between compounding and other word-formation processes (in particular, derivation and blending) and between compounding and syntax (cf. Russian город-сад ‘garden-city’ or English examples with multiple spelling variants flowerpot / flower-pot / flower pot). The intra-word analysis will focus on both syntactic and semantic relationships between the compound parts; cf. German [Schule+Jahr]+Ende > Schuljahresende, Cz. modrý ‘blue’+oko ‘eye’ > modrooký ‘blue-eyed’ (Scalise & Vogel 2010, Štichauer 2013). By extending the multilingual resources with a coherent compound annotation and classification, the resulting data will be exploitable in linguistic typological studies as well as Natural Language Processing tasks, e.g., when dealing with out-of-vocabulary words.

References

Kyjánek, L., Žabokrtský Z., Ševčíková M. & Vidra J. (2019). Universal Derivations Kickoff: A Collection of Harmonized Derivational Resources for Eleven Languages. In Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology (DeriMo 2019). Praha: ÚFAL MFF UK, pp. 101-110.
Lieber, R. & Štekauer, P. (2011). The Oxford handbook of compounding. Oxford: Oxford University Press.
Scalise, S. & Vogel, I. (eds.; 2010). Cross-disciplinary issues in compounding. Amsterdam: Benjamins.
Štekauer, P., Valera, S. & Körtvélyessy, L. (2012). Word-Formation in the World’s Languages. Cambridge: Cambridge University Press.
Štichauer, P. (2013). Je možná nová klasifikace českých kompozit? Časopis pro moderní filologii, 95(2), 113–128.