Dialectal Arabic Tools comprises the different modules developed in Qatar Computing Research Institute (QCRI) developed by the ALT team to handle Dialectal Arabic Segmentation, POS tagging, Diacritization and more
The segmentation module of Dialectal Arabic Tools
Dialectal Arabic Tools is compatible with: Python 2.7-3.5 or later.
Before you can use the dialectal Arabic tools you need to install a special version of keras that comprises a CRF layer. Use the following pash command to install it.
It is better to do installations within a virtual environment. The following web page shows how to create a virtual environment in a straightforward steps.
pip install git+git://github.com/phipleg/keras@crf
You can install Dialectal Arabic Tools by either,
- using pip (recommended)
- cloning "this" repo and and use setup.py
Use the following pash command to install the package from the python index,
pip install dialectal_arabic_tools
Clone the repo from the github website using the following command:
git clone https://github.com/qcri/dialectal_arabic_tools.git
Or download the compressed file of the project, extract it, change to the directory and run the following to install the Dialectal Arabic Tools using the following command:
python setup.py install
Dialectal Arabic Tools package is pretty easy to use. The following code snippets uses the dialectal segmention module to module a string of Arabic script encoded in UTF-8
,
>>> from dialectal_arabic_tools.segmentation import segmenter
>>> segmenter.segment_text(u"عنا تنتين بندورة جبلية وخمسة عروقة نعنع بيعملو سلطة .. شلوني معك؟")
'عنا تنتين بندور+ة جبلي+ة و+خمس+ة عروق+ة نعنع ب+يعمل+و سلط+ة شلون+ي مع+ك ؟'
Furthermore, you could use the segmentation module to segment a text file of Arabic script encoded in UTF-8
. Just use segment_file
insted of segment_text
.
The segment_file
function requires two two positional parameters, namely the file to be segmented and a file name to generate the output in.
>>> from dialectal_arabic_tools.segmentation import segmenter
>>> segmenter.segment_file(r'/path/to/text/file/you/need/to/segment.txt', r'output/file/path.txt')
Younes Samih, Mohamed Eldesouki, Mohammed Attia, Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Laura Kallmeyer, (2017), Learning from Relatives: Unified Dialectal Arabic Segmentation, Journal Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Pages 432-441.
Mohamed Eldesouki, Younes Samih, Ahmed Abdelali, Mohammed Attia, Hamdy Mubarak, Kareem Darwish, Kallmeyer Laura, (2017), Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM, arXiv preprint arXiv:1708.05891.
Younes Samih, Mohammed Attia, Mohamed Eldesouki, Ahmed Abdelali, Hamdy Mubarak, Laura Kallmeyer, Kareem Darwish, (2017), A Neural Architecture for Dialectal Arabic Segmentation, Journal Proceedings of the Third Arabic Natural Language Processing Workshop, Pages 46-54.
You can ask questions and join the development discussion:
- On the Dialectal Arabic Tools Google group.
- On the Dialectal Arabic Tools Slack channel. Use this link to request an invitation to the channel.
You can also post bug reports and feature requests (only) in Github issues. Make sure to read our guidelines first.