Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Transcriptor

Thesis title in Czech:	Transkriptor
Thesis title in English:	Transcriptor
Key words:	transkripce, transliterace, fonetická abeceda
English key words:	transcription, transliteration, phonetic alphabet
Academic year of topic announcement:	2014/2015
Thesis type:	Bachelor's thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	RNDr. Daniel Zeman, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	16.03.2015
Date of assignment:	16.03.2015
Confirmed by Study dept. on:	23.03.2015

Guidelines

Transcription of natural language text from one script to another is needed for various tasks such as:

- transcription of foreign personal or geographical names to be used in other than original language
- pronunciation guide for foreigners
- input method on computers and other devices lacking keybord for the target script

Transcription, in contrast to transliteration, does not necessarily mean a 1-1 mapping between sets of characters. Transcription focuses on capturing the pronunciation using the spelling rules of another script AND language. For instance, transcription of the Russian name Чайковский into the Latin script may result in Chaikovsky, Tchaïkovski, Tschaikowski or Čajkovskij, among others, depending on the target language. The focus on pronunciation can be exploited if we decompose transcription into modeling pronunciation of all the languages involved, using the International Phonetic Alphabet (IPA). We could model the mapping between sequences of characters in language L1 and sequences of IPA symbols. Then we could combine the models so that L1 → IPA → L2 would render the desired transcription L1 → L2.

The goal of the thesis is to test the approach on at least three languages, two of which use the Latin script and one using a different script. A minimal solution involves the following:

- Design and implement a general system that transcribes text according to user-supplied rules.
- As an interface to the transcription system, implement a web-based application. It should provide means for designing transcription rules, importing sets of rules and applying sets of rules to user-supplied text or existing websites.
- Create a rule-based model of pronunciation of each language (i.e. bi-directional mapping Lx ↔ IPA).
- Create (or find online) test data with transcriptions for evaluation purposes.
- Use the models to test and evaluate all 6 (or more in case of more languages) transcription directions. Analyze the results in the thesis.

An optional enhancement would be to add a machine-learning module that would learn transcription rules and/or context of their application from human-transcribed training data. A pre-existing, downloadable library implementing a machine-learning algorithm can be used for this purpose; the student would implement the pre- and postprocessing of the data. The focus of this enhancement would be on research rather than programming: What is the best way of preparing the training model in order to get good transcription rules.

References

Min Zhang, A Kumaran, Haizhou Li: Whitepaper of NEWS 2011 Shared Task on Machine Transliteration, 2011