Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Orthography Standardization in Arabic Dialects

Thesis title in Czech:	Normalizace pravopisu v arabských dialektech
Thesis title in English:	Orthography Standardization in Arabic Dialects
Key words:	kontrola pravopisu\|automatické opravy\|arabština\|dialekt
English key words:	spell checking\|automatic corrections\|Arabic\|dialect
Academic year of topic announcement:	2020/2021
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	RNDr. Daniel Zeman, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	22.04.2021
Date of assignment:	22.04.2021
Confirmed by Study dept. on:	27.04.2021
Date and time of defence:	08.09.2021 09:00
Date of electronic submission:	09.08.2021
Date of submission of printed version:	06.08.2021
Date of proceeded defence:	08.09.2021
Opponents:	Mgr. Pavel Straňák, Ph.D.

Guidelines

The Arab world enjoys a wide array of dialects, which are the non-standard varieties of Arabic natively spoken — and increasingly written on social media — across the Arab world. Dialectal Arabic (DA) differs significantly from Modern Standard Arabic (MSA), which is the medium of choice for news, administrative, and literary topics. Many varieties of DA are not even mutually intelligible. Therefore it is important that computational processing methods are not restricted to MSA and work well with DA, too. One great obstacle is that DA lacks a standard orthography system, since it is mainly spoken. (An existing proposal of standard DA spelling, called CODA*, is used for research purposes but not by ordinary users of DA.) Additionally, Arabic speakers tend to code-switch between MSA and their dialects when they write. The data available for research purposes mainly come in the form of very noisy social media web scrapes; in the lack of standard orthography, authors invent their own spelling as they write.

The goal of the thesis is to propose, implement and evaluate a method that will automatically convert noisy DA text to a pre-defined standard based on CODA*. Possible ways of achieving the goal include treating the problem as machine translation from noisy text to the standard form, using sequence-to-sequence recurrent neural networks and related machine learning techniques. These data-driven methods can be compared to rule-based approaches. Data for testing will be obtained from recently released DA corpora, such as the Gumar Corpus. The method will be tested at least with one DA variety (although testing it on multiple dialects is recommended if time permits).

References

* Nizar Habash et al. (2019). Unified Guidelines and Resources for Arabic Dialect Orthography

* Salam Khalifa et al. (2018). A Morphologically Annotated Corpus of Emirati Arabic. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1607

* Pravallika Etoori, Manoj Chinnakotla, Radhika Mamidi (2018). Automatic Spelling Correction for Resource-Scarce Languages using Deep Learning. In: Proceedings of ACL 2018, Student Research Workshop. Melbourne, Australia. https://www.aclweb.org/anthology/P18-3021

* Marcelo Yuji Himoro, Antonio Pareja-Lora (2020). Towards a Spell Checker for Zamboanga Chavacano Orthography. In: Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020). Marseille, France. https://www.aclweb.org/anthology/2020.lrec-1.327