Languages such as German, Dutch, and Swedish tend to form (arbitrarily) long compounds which poses problems for many tasks of Natural Language Processing (including Machine Translation, Information Retrieval, etc.). In many cases it can be helpful to split compounds into smaller parts ("decompounding") before further processing. Decompounding systems have been built for specific languages (mostly German), examples of existing systems are JWord Splitter, Banana Splitter, the ASV Toolbox (Biemann et al., 2008) or the method presented in Larson et al. (2000). The goal of this thesis is to build up on previous work, develop and evaluate a method for word decompouding, which can be easily adapted to new languages, based on unsupervised or semi-supervised approaches. Based on the results, the work can be extended by methods for the reverse operation -- creating compounds in output of Machine Translation.
Seznam odborné literatury
Larson, Martha, Daniel Willett, Joachim Köhler and Gerhard Rigoll. Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches. Sixth International Conference on Spoken Language Processing, ICSLP 2000 / INTERSPEECH 2000, Beijing, China, October 16-20, 2000
Chris Biemann, Uwe Quasthoff, Gerhard Heyer and Florian Holz. ASV Toolbox: a Modular Collection of Language Exploration Tools. In proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco, 2008.