The Arabic language provides an extraordinary wealth of comparative material. It functions in four major registers: the daily language (Colloquial Arabic), the language of media (Modern Standard Arabic), the language of literature (Classical Arabic), and the religious language (Qurʾānic Arabic), all four with various degrees of overlapping. Further, the Colloquial Arabic exists in a multitude of dialects. These greatly differ not just regionally (e.g. North Africa / Eastern Mediterranean), but also within regions (e.g. Lebanon / Palestine), within countries (e.g. Nablus / al-Khalīl), and even within smaller areas (e.g. city / villages). Historically, the dialects tended not to be used in writing but only in speech. However, this dramatically changed with the advent of the digital age. On social media you will see many written expressions of Colloquial Arabic with greatly diverse orthography.
In the context of the recent AI developments, the following question arises: given that they are largely trained on Modern Standard Arabic, how good Arabic Large Language Models are at handling low-resource Arabic dialects? This question is explored in our RSE-funded project “Al-ʿĀmmīyah (Colloquial Arabic) and Generative AI – a snapshot of its emerging text-to-text abilities” (co-applicants: J. Zbrzezny (PI), E. Reiter, W. Zhao). The paper presents our experiments with three popular models (gpt-4o-mini, gpt-4o, gemini-2.0), and discusses how their outputs from and to dialect in particularly engineered prompts were evaluated by our Palestinian team in the West Bank (A. Hroub et al.). The preliminary results of the evaluation show (1) the weakness of the models in creating authentic written expressions of local dialects, but also (2) their ability to level dialects towards Modern Standard Arabic. This is particularly concerning in the context of the widespread use of Auto-corrections and Predictive text functions on smartphone keyboards especially among younger generations, which, in the long-term, will impact the inner diversity of Arabic, making it a linguistically poorer language
- Speaker
- Jakub Zbrzezny
- Venue
- Meston G05 and Microsoft Teams