Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Splitting word compounds
Thesis title in Czech: Metody pro rozdělování slovních složenin
Thesis title in English: Splitting word compounds
Key words: zpracování přorozeného jazyka, slovní složeniny, rozdělování slovních složenin
English key words: natural language processing, word compounds, decompouding
Academic year of topic announcement: 2015/2016
Thesis type: diploma thesis
Thesis language: angličtina
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: doc. RNDr. Pavel Pecina, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 06.04.2016
Date of assignment: 07.04.2016
Confirmed by Study dept. on: 17.05.2016
Date and time of defence: 30.01.2017 00:00
Date of electronic submission:04.01.2017
Date of submission of printed version:04.01.2017
Date of proceeded defence: 30.01.2017
Opponents: RNDr. Jaroslava Hlaváčová, Ph.D.
 
 
 
Guidelines
Languages such as German, Dutch, and Swedish tend to form (arbitrarily) long compounds which poses problems for many tasks of Natural Language Processing (including Machine Translation, Information Retrieval, etc.). In many cases it can be helpful to split compounds into smaller parts ("decompounding") before further processing. Decompounding systems have been built for specific languages (mostly German), examples of existing systems are JWord Splitter, Banana Splitter, the ASV Toolbox (Biemann et al., 2008) or the method presented in Larson et al. (2000). The goal of this thesis is to build up on previous work, develop and evaluate a method for word decompouding, which can be easily adapted to new languages, based on unsupervised or semi-supervised approaches. Based on the results, the work can be extended by methods for the reverse operation -- creating compounds in output of Machine Translation.
References
Larson, Martha, Daniel Willett, Joachim Köhler and Gerhard Rigoll. Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches. Sixth International Conference on Spoken Language Processing, ICSLP 2000 / INTERSPEECH 2000, Beijing, China, October 16-20, 2000

Chris Biemann, Uwe Quasthoff, Gerhard Heyer and Florian Holz. ASV Toolbox: a Modular Collection of Language Exploration Tools. In proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco, 2008.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html