ArabicText-NTG

Introduction

In order to evaluate the generation ability of pretrained models, we published a large-scale, multi-topic Arabic news title generation dataset: ArabicText-NTG. It contains over 1M samples of Arabic news with their titles, which is the largest in the Arab world. The titles and contents have been strictly cleaned and manually filtered as insurance of high quality. Furthermore, we provide baseline results and evaluation scripts for a fair comparison. Feel free to download the dataset at BAAIData.

Dataset Statistics

Split	# Docs	Avg content len (token/char)	Avg title len (token/char)
TRAIN	836.3K	260.7/1314.0	9.26/48.7
VALID	104.5K	259.0/1306.1	9.24/48.6
TEST	104.5K	258.2/1303.7	9.28/48.8
Total	1.04M	260.2/1312.2	9.26/48.7

Getting Started

Requirements

# to calculate the bleu score
pip install scarebleu
# to calculate rouge_score correctly, you need to install a patched version of rouge_score, which fix bugs in Arabic
pip install git+https://github.com/ARBML/rouge_score_ar

Usage

run scripts/eval.sh to get evaluation results.

Baselines

on the way...

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArabicText-NTG

Introduction

Dataset Statistics

Getting Started

Requirements

Usage

Baselines

About

Releases

Packages

Languages

License

cofe-ai/ArabicText-NTG

Folders and files

Latest commit

History

Repository files navigation

ArabicText-NTG

Introduction

Dataset Statistics

Getting Started

Requirements

Usage

Baselines

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages