Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Orthography Standardization in Arabic Dialects

Název práce v češtině:	Normalizace pravopisu v arabských dialektech
Název v anglickém jazyce:	Orthography Standardization in Arabic Dialects
Klíčová slova:	kontrola pravopisu\|automatické opravy\|arabština\|dialekt
Klíčová slova anglicky:	spell checking\|automatic corrections\|Arabic\|dialect
Akademický rok vypsání:	2020/2021
Typ práce:	diplomová práce
Jazyk práce:	angličtina
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	RNDr. Daniel Zeman, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	22.04.2021
Datum zadání:	22.04.2021
Datum potvrzení stud. oddělením:	27.04.2021
Datum a čas obhajoby:	08.09.2021 09:00
Datum odevzdání elektronické podoby:	09.08.2021
Datum odevzdání tištěné podoby:	06.08.2021
Datum proběhlé obhajoby:	08.09.2021
Oponenti:	Mgr. Pavel Straňák, Ph.D.

Zásady pro vypracování

The Arab world enjoys a wide array of dialects, which are the non-standard varieties of Arabic natively spoken — and increasingly written on social media — across the Arab world. Dialectal Arabic (DA) differs significantly from Modern Standard Arabic (MSA), which is the medium of choice for news, administrative, and literary topics. Many varieties of DA are not even mutually intelligible. Therefore it is important that computational processing methods are not restricted to MSA and work well with DA, too. One great obstacle is that DA lacks a standard orthography system, since it is mainly spoken. (An existing proposal of standard DA spelling, called CODA*, is used for research purposes but not by ordinary users of DA.) Additionally, Arabic speakers tend to code-switch between MSA and their dialects when they write. The data available for research purposes mainly come in the form of very noisy social media web scrapes; in the lack of standard orthography, authors invent their own spelling as they write.

The goal of the thesis is to propose, implement and evaluate a method that will automatically convert noisy DA text to a pre-defined standard based on CODA*. Possible ways of achieving the goal include treating the problem as machine translation from noisy text to the standard form, using sequence-to-sequence recurrent neural networks and related machine learning techniques. These data-driven methods can be compared to rule-based approaches. Data for testing will be obtained from recently released DA corpora, such as the Gumar Corpus. The method will be tested at least with one DA variety (although testing it on multiple dialects is recommended if time permits).

Seznam odborné literatury

* Nizar Habash et al. (2019). Unified Guidelines and Resources for Arabic Dialect Orthography

* Salam Khalifa et al. (2018). A Morphologically Annotated Corpus of Emirati Arabic. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1607

* Pravallika Etoori, Manoj Chinnakotla, Radhika Mamidi (2018). Automatic Spelling Correction for Resource-Scarce Languages using Deep Learning. In: Proceedings of ACL 2018, Student Research Workshop. Melbourne, Australia. https://www.aclweb.org/anthology/P18-3021

* Marcelo Yuji Himoro, Antonio Pareja-Lora (2020). Towards a Spell Checker for Zamboanga Chavacano Orthography. In: Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020). Marseille, France. https://www.aclweb.org/anthology/2020.lrec-1.327