Orthography Standardization in Arabic Dialects
Název práce v češtině: | Normalizace pravopisu v arabských dialektech |
---|---|
Název v anglickém jazyce: | Orthography Standardization in Arabic Dialects |
Klíčová slova: | kontrola pravopisu|automatické opravy|arabština|dialekt |
Klíčová slova anglicky: | spell checking|automatic corrections|Arabic|dialect |
Akademický rok vypsání: | 2020/2021 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | RNDr. Daniel Zeman, Ph.D. |
Řešitel: | skrytý - zadáno a potvrzeno stud. odd. |
Datum přihlášení: | 22.04.2021 |
Datum zadání: | 22.04.2021 |
Datum potvrzení stud. oddělením: | 27.04.2021 |
Datum a čas obhajoby: | 08.09.2021 09:00 |
Datum odevzdání elektronické podoby: | 09.08.2021 |
Datum odevzdání tištěné podoby: | 06.08.2021 |
Datum proběhlé obhajoby: | 08.09.2021 |
Oponenti: | Mgr. Bc. Pavel Straňák, Ph.D. |
Zásady pro vypracování |
The Arab world enjoys a wide array of dialects, which are the non-standard varieties of Arabic natively spoken — and increasingly written on social media — across the Arab world. Dialectal Arabic (DA) differs significantly from Modern Standard Arabic (MSA), which is the medium of choice for news, administrative, and literary topics. Many varieties of DA are not even mutually intelligible. Therefore it is important that computational processing methods are not restricted to MSA and work well with DA, too. One great obstacle is that DA lacks a standard orthography system, since it is mainly spoken. (An existing proposal of standard DA spelling, called CODA*, is used for research purposes but not by ordinary users of DA.) Additionally, Arabic speakers tend to code-switch between MSA and their dialects when they write. The data available for research purposes mainly come in the form of very noisy social media web scrapes; in the lack of standard orthography, authors invent their own spelling as they write.
The goal of the thesis is to propose, implement and evaluate a method that will automatically convert noisy DA text to a pre-defined standard based on CODA*. Possible ways of achieving the goal include treating the problem as machine translation from noisy text to the standard form, using sequence-to-sequence recurrent neural networks and related machine learning techniques. These data-driven methods can be compared to rule-based approaches. Data for testing will be obtained from recently released DA corpora, such as the Gumar Corpus. The method will be tested at least with one DA variety (although testing it on multiple dialects is recommended if time permits). |
Seznam odborné literatury |
* Nizar Habash et al. (2019). Unified Guidelines and Resources for Arabic Dialect Orthography
* Salam Khalifa et al. (2018). A Morphologically Annotated Corpus of Emirati Arabic. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1607 * Pravallika Etoori, Manoj Chinnakotla, Radhika Mamidi (2018). Automatic Spelling Correction for Resource-Scarce Languages using Deep Learning. In: Proceedings of ACL 2018, Student Research Workshop. Melbourne, Australia. https://www.aclweb.org/anthology/P18-3021 * Marcelo Yuji Himoro, Antonio Pareja-Lora (2020). Towards a Spell Checker for Zamboanga Chavacano Orthography. In: Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020). Marseille, France. https://www.aclweb.org/anthology/2020.lrec-1.327 |