GitHub - Nabeelhassan/Urdu: Collection of Urdu datasets for POS, NER and NLP tasks

Summary Dataset

This a summary dataset. You can train abstractive summarization model using this dataset. It contains 3 files i.e. train, test and val. Data is in jsonl format.

Every line has these keys.

id
url
title
summary
text

You can easily read the data with pandas

import pandas as pd
test = pd.read_json("summary/urdu_test.jsonl", lines=True)

POS dataset

Urdu dataset for POS training. This is a small dataset and can be used for training parts of speech tagging for Urdu Language. Structure of the dataset is simple i.e.

word TAG
word TAG

The tagset used to build dataset is taken from Sajjad's Tagset To get large dataset, you need to purchase the license. Contact: [email protected]

NER Datasets

Following are the datasets used for NER tasks.

UNER Dataset

Happy to announce that UNER (Urdu Named Entity Recognition) dataset is available for NLP apps. Following are NER tags which are used to build the dataset:

PERSON
LOCATION
ORGANIZATION
DATE
NUMBER
DESIGNATION
TIME

If you want to read more about the dataset check this paper Urdu NER. NER Dataset is in utf-16 format.

MK-PUCIT Dataset

Latest for Urdu NER is available. Check this paper for more information MK-PUCIT.

Entities used in the dataset are

Other
Organization
Person
Location

MK-PUCIT author also provided the Dropbox link to download the data. Dropbox

IJNLP 2008 dataset

IJNLP dataset has following NER tags.

O
LOCATION
PERSON
TIME
ORGANIZATION
NUMBER
DESIGNATION

Jahangir dataset

Jahangir dataset has following NER tags.

O
PERSON
LOCATION
ORGANIZATION
DATE
TIME

Datasets for Sentiment Analysis

IMDB Urdu Movie Review Dataset.

This dataset is taken from IMDB Urdu. It was translated using Google Translator. It has only two labels i.e.

positive
negative

Roman Dataset

This dataset can be used for sentiment analysis for Roman Urdu. It has 3 classes for classification.

Neutral
Positive
Negative

If you need more information about this dataset checkout the link Roman Urdu Dataset.

Products & Services dataset

This dataset is collected from different sources like social media and web for various products and services for sentiment analysis. It contains 3 classes.

pos
neg
neu

Daraz Products dataset

This dataset consists of reviews taken from Daraz. You can use it for sentiment analysis as well as spam or ham classification. It contains following columns.

Product_ID
Date
Rating
Spam(1) and Not Spam(0)
Reviews
Sentiment
Features

Dataset is taken from kaggle daraz

Urdu Dataset

Here is a small dataset for sentiment analysis. It has following classifying labels

P
N
O

Link to the paper Paper GitHub link to data Urdu Corpus V1

News Datasets

Urdu News Dataset 1M

This dataset(news/urdu-news-dataset-1M.tar.xz) is taken from Urdu News Dataset 1M. It has 4 classes and can be used for classification and other NLP tasks. I have removed unnecessary columns.

Business & Economics
Entertainment
Science & Technology
Sports

Real-Fake News

This dataset(news/real_fake_news.tar.gz) is used for classification of real and fake news in Fake News Dataset Dataset contains following domain news.

Technology 
Education 
Business
Sports
Politics
Entertainment

News Headlines

Headlines(news/headlines.csv.tar.gz) dataset is taken from Urd News Headlines. Original dataset is in Excel format, I've converted to csv for experiments. Can be used for clustering and classification.

RAW corpus and models

COUNTER (COrpus of Urdu News TExt Reuse) Dataset

This dataset is collected from journalism and can be used for Urdu NLP research. Here is the link to the resource for more information COUNTER.

Urdu model for SpaCy

Urdu model for SpaCy is available now. You can use it to build NLP apps easily. Install the package in your working environment.

pip install ur_model-0.0.0.tar.gz

You can use it with following code.

import spacy
nlp = spacy.load("ur_model")
doc = nlp("میں خوش ہوں کے اردو ماڈل دستیاب ہے۔ ")

NLP Tutorials for Urdu

Checkout my articles related to Urdu NLP tasks

POS Tagging Urdu POS Tagging using MLP
NER How to build NER dataset for Urdu language?, Named Entity Recognition for Urdu
Word 2 Vector How to build Word 2 Vector for Urdu language
Word and Sentence Similarity Urdu Word and Sentence Similarity using SpaCy
Tokenization Urdu Tokenization using SpaCy
Urdu Language Model How to build Urdu language model in SpaCy

These articles are available on UrduNLP.

Some Helpful Tips

Download Single file from GitHub

If you want to get only raw files(text or code) then use curl command i.e.

curl -LJO https://github.com/mirfan899/Urdu/blob/master/ner/uner.txt

Concatenate files

cd data
cat */*.txt > file_name.txt

MK-PUCIT

Concatenate files of MK-PUCIT into single file using.

cat */*.txt > file_name.txt

Original dataset has a bug like Others and Other which are same entities, if you want to use the dataset from dropbox link, use following commands to clean it.

import pandas as pd
data = pd.read_csv('ner/mk-pucit.txt', sep='\t', names={"tag", "word"})
data.tag.replace({"Others":"Other"}, inplace=True)
# save according you need as csv or txt by changing the extension
data.to_csv("ner/mk-pucit.txt", index=False, header=False, sep='\t')

Now csv/txt file has format

word tag

Note

If you have a dataset(link) and want to contribute, feel free to create PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary Dataset

POS dataset

NER Datasets

UNER Dataset

MK-PUCIT Dataset

IJNLP 2008 dataset

Jahangir dataset

Datasets for Sentiment Analysis

IMDB Urdu Movie Review Dataset.

Roman Dataset

Products & Services dataset

Daraz Products dataset

Urdu Dataset

News Datasets

Urdu News Dataset 1M

Real-Fake News

News Headlines

RAW corpus and models

COUNTER (COrpus of Urdu News TExt Reuse) Dataset

Urdu model for SpaCy

NLP Tutorials for Urdu

Some Helpful Tips

Download Single file from GitHub

Concatenate files

MK-PUCIT

Note

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
counter		counter
ner		ner
news		news
pos		pos
sentiment		sentiment
spacy		spacy
summary		summary
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml

License

Nabeelhassan/Urdu

Folders and files

Latest commit

History

Repository files navigation

Summary Dataset

POS dataset

NER Datasets

UNER Dataset

MK-PUCIT Dataset

IJNLP 2008 dataset

Jahangir dataset

Datasets for Sentiment Analysis

IMDB Urdu Movie Review Dataset.

Roman Dataset

Products & Services dataset

Daraz Products dataset

Urdu Dataset

News Datasets

Urdu News Dataset 1M

Real-Fake News

News Headlines

RAW corpus and models

COUNTER (COrpus of Urdu News TExt Reuse) Dataset

Urdu model for SpaCy

NLP Tutorials for Urdu

Some Helpful Tips

Download Single file from GitHub

Concatenate files

MK-PUCIT

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages