- Technologies and Tools Used
- Introduction
- Use Cases
- Objectives
- Data Source
- Hypotheses
- Preparing the Arabic
- Our Datasets
- Arabic vs MSA models
- Dialect Classification Models
- n-grams
- Confusion Matrix
- Oversampling
- Under the hood
- Conclusions
- Key Learnings
- Next Steps
Pandas, numpy, matplotlib, seaborn, camel_tools, scipy, RegEx, WordCloud, sklearn, scikitplot, arabic_reshaper, bidi, os, codecs, imblearn, XGBoost
Unlike the languages which are more commonly used in modern NLP, Arabic is diglossic. This means that it has two registers - a formal and an informal. The formal style of Arabic, known as MSA (Modern Standard Arabic) is the same across the Arab world. The informal style or dialect, which is what people actually speak in, is very geographically specific.
Until recently it was rare that you would see dialectical Arabic written down since a book, or newspaper article, or academic paper would almost always be written in MSA. However, thanks in part to the advent of social media, dialect is seen in its written form far more often. This provides translators (especially machine translators) with an issue. How do you know which 'Arabic' you are translating?
To be able to use the language of an Arabic Text itself to ascertain which dialect it is written would: 1- Aid translators, 2- Allow to geographically locate based on language, 3- Allow dialectologists some insight into what makes up a dialect.
The goal of this project is to create a classification model which uses the text of an Arabic Tweet to ascertain 1- if it is written in MSA or dialect and 2- which dialectical region a dialect tweet comes from. I would like to outperform Baseline Accuracy in both cases.
The data for this project was derived from a corpus of tweets put together for the purpose of the 2nd NADI (Nuanced Arabic Dialect Identification Shared Task).
The data was collected through the Twitter API using the location data to establish province and country of origin.
(See Abdul-Mageed, M., Zhang, C., Elmadany, A., Bouamor, H., & Habash, N. (2021). NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task. ArXiv, abs/2103.08466.)
The Datasets I used were:
-
Dataset 1: Country-level MSA identification: A total of 21,000 tweets in MSA, covering 21 Arab countries.
-
Dataset 2: Country-level Dialect identification: A total of 21,000 tweets in dialect, covering 21 Arab countries.
My hypotheses are:
- It will be fairly easy to ascertain which tweets are in MSA and which are in dialect. This is because there are certain fairly common features of MSA which are very specific and aren't shared in any dialects.
- It will, however, be more difficult to differentiate between specific dialects because of shared features across dialects.
- In Bag of Words models certain common words will be used most in differentiating dialects. Words like 'فين', which means 'where' in Egyptian, and 'لماذا' which means 'why' in MSA will be obvious markers for a model.
Arabic and NLP don't get along very well. This is for a number of reasons:
-
Script The script goes right to left and letters take different forms depending on where they are in a word.
-
Diacritics Vowels are optional in Arabic and are more often than not omitted.
-
Orthographic Ambiguity Because of the optional vowels, it is often difficult to tell what a word is without context.
-
Morphological Complexity Words look very different depending on how they are used grammatically. This makes stemming very difficult. (Stemming is where you tell your model that although two words look different, you want to treat them as the same word - for example, ‘walk’ and ‘walks’.)
-
Orthographic Inconsistency The same words can be spelt differently (like doughnut/donut).
There are a number of things we can do to solve some of these issues and make our models better:
-
Transliteration This means mapping each Arabic letter onto a letter of the Roman alphabet.
-
Dediacritisation This means removing all the vowels from words, so there is no variation in words that are the same because some are vowelled and some aren't.
-
Normalisation Forcing all words to be spelt the same
-
Morphological Tokenisation Splitting words up according to their morphological make up.
For our first model we want to differentiate between MSA and dialect. I labelled the datasets accordingly and had a look at the data.
Below are two WordClouds where the size of a word represents its frequency.
What they demonstrate is that although there are some differences between the two registers, the most common words are the same - they are after all the same language.
I ran a few models with different types of tokenisations. The best model I produced was the Logistic Regression Model with Count Vectorisation (Accuracy Score 0.868 and F1 Score 0.897).
If we look under the hood of said model we discover a few interesting things. Here are the coefficients our model is using to classify each tweet, the larger a word is the more important it is to our model in classifying a tweet:
The words that it uses to identify a tweet as MSA are often religious terms. The words that it uses to ascertain if something is in dialect make sense to an Arabist. Amusingly 'hahahahhaha' and variations on it are also a good predictor for dialect.
The dataset split the dialectical data points according to country - this generated a severe class imbalance. In order to rectify this I grouped countries according to what I understood about Arabic dialect groups. This resulted in 6 Dialect Groups:
- Maghrebi
- Nile Basin
- Levantine
- Gulf
- Yemeni
- Iraqi
For these models I also experimented with n-grams of different lengths and with morphological tokenisation.
n-grams are the length of token that your model deals with. If your model works with 1-grams then you are feeding single words into your model, 2-grams will consider word pairs also and so on.
The ideal n-gram length was 1 or 2 - and there didn't seem to be much difference between them.
In order to reduce the dimensionality of my model I set a higher bar on which words/ word pairs were included in my model.
All my models beat baseline (0.245) by a considerable amount but my best model was the Naive Bayes Model with Count Vectorisation, n-gram frequency of 2, and morphological tokenisation. (0.586).
However if we take a look at how exactly this model is making its predictions we see there is a considerable reluctance on part of the model to predict the minority class - Yemeni.
I experimented with some Oversampling techniques to attempt to rectify this class imbalance but could not beat the previous best Accuracy Score.
This was undoubtedly because even though we were artificially replicating the minority classes our model had not witnessed enough variety of said classes to establish when to classify it. This is demonstrated in how, although it is predicting Yemen more often, it is predicting it incorrectly.
In order to get a sense of how our models were assessing our tweets let's have a look at our model's coefficients.
These words/ phrases are for the most part very sensible classifiers of their various dialects. A few things stand out:
- Strangely laughing appears as a pretty strong negative coefficient for Nile Basin Arabic.
- The negative coefficients are far stronger than the positive ones in Yemeni.
- The negative coefficients in Yemeni Arabic are not Yemeni words, they are words used in all dialects. This supports the theory that the model has simply not seen enough Yemeni tweets to ascertain what makes it unique.
- It is fairly simple to differentiate between MSA and dialect.
- The geographical variation in dialect lends itself readily to geographical classification.
- Our model suffered from class imbalance. Dialectical classification needs more data to work better.
- Oversampling does not solve class imbalance in this case.
- Morphological Tokenisation, n-grams, and reducing dimensionality improve model performance.
- Due to the extreme dimensionality of my model we were unable to run some of the more complex models/ GridSearches.
- I overcame the difficulty in translating NLP tools that are designed for far more morphologically and orthographically simpler languages.
- I attempted to overcome a class imbalance through Oversampling - although I was not successful I learnt that a lack in variety in data cannot be rectified by Random Oversampling.
- I noticed and, using a pipeline, sealed a leak in the folds of my cross validation when using Oversampling.
- Explored the efficacy of different length n-grams, morphological tokenisation and different models.
- In future I will use cloud computing to execute models with such massive dimensionality.
- I have learnt that Bag of Words models work fairly well for Arabic Dialect classification but due to limitations with the tools available for Arabic NLP it requires a lot more manual nudges on part of the Data Scientist.
- Clustering to attempt to test the hypothesis: geographical proximity correlates with language similarity.
- Experiment with Vector Embedding in Arabic.
- Model using AraBert and AraElectra
- Experiment with Deep Learning.