The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformations augment text datasets in diverse ways, including: randomizing names and numbers, changing style/syntax, paraphrasing, KB-based paraphrasingΒ ... and whatever creative augmentation you contribute. We invite submissions of transformations to this framework by way of GitHub pull request.
Paper accepted at NEJLT 2023 here.
The framework organizers can be contacted at [email protected].
Table of contents
- Colab notebook
- Installation
- How do I create a transformation?
- How do I create a filter?
- Motivation
- Review Criteria for Accepting Submissions
- Some Ideas for Transformations
To quickly see transformations and filters in action, run through our colab notebook.
If you need inspiration for what transformations to implement, check out GEM-benchmark#75, where some ideas and previous papers are discussed. So far, contributions have focused on morphological inflections, character level changes, and random noise. The best new pull requests will be dissimilar from these existing contributions.
Requirements
- Python 3.7
Instructions
# When creating a new transformation, replace this with your forked repository (see below)
git clone https://github.com/GEM-benchmark/NL-Augmenter.git
cd NL-Augmenter
python setup.py sdist
pip install -e .
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
First, fork the repository in GitHub! π΄
Your fork will have its own location, which we will call PATH_TO_YOUR_FORK
.
Next, clone the forked repository and create a branch for your transformation, which here we will call my_awesome_transformation:
git clone $PATH_TO_YOUR_FORK
cd NL-Augmenter
git checkout -b my_awesome_transformation
We will base our transformation on an existing example. Create a new transformation directory by copying over an existing transformation. You can choose to copy from other transformation directories depending on the task you wish to create a transformation for. Check some of the existing pull requests and merged transformations first to avoid duplicating efforts or creating transformations too similar to previous ones.
cd nlaugmenter/transformations/
cp -r butter_fingers_perturbation my_awesome_transformation
cd my_awesome_transformation
- In the file
transformation.py
, rename the classButterFingersPerturbation
toMyAwesomeTransformation
and choose one of the interfaces from theinterfaces/
folder. See the full list of options here. - Now put all your creativity in implementing the
generate
method. If you intend to use external libraries, add them with their version numbers inrequirements.txt
- Update
my_awesome_transformation/README.md
to describe your transformation.
Testing and evaluating (Optional)
Once you are done, add at least 5 example pairs as test cases in the file test.json
so that no one breaks your code inadvertently.
Once the transformation is ready, test it:
pytest -s --t=my_awesome_transformation
If you would like to evaluate your transformation against a common π€HuggingFace model, we encourage you to check evaluation
Code Styling To standardized the code we use the black code formatter which will run at the time of pre-commit.
To use the pre-commit hook, install pre-commit
with pip install pre-commit
(should already be installed if you followed the above instructions).
Then run pre-commit install
to install the hook. On future commits, you should see the black code formatter is run on all python files you've staged for commit.
Once the tests pass and you are happy with the transformation, submit them for review. First, commit and push your changes:
git add transformations/my_awesome_transformation/*
git commit -m "Added my_awesome_transformation"
git push --set-upstream origin my_awesome_transformation
Finally, submit a pull request.
The last git push
command prints a URL that can be copied into a browser to initiate such a pull request.
Alternatively, you can do so from the GitHub website.
β¨ Congratulations, you've submitted a transformation to NL-Augmenter! β¨
We also accept pull-requests for creating filters which identify interesting subpopulations of a dataset. The process to add a new filter is just the same as above. All filter implementations require implementing .filter
instead of .generate
and need to be placed in the filters folder. So, just the way transformations can transform examples of text, filters can identify whether an example follows some pattern of text! The only difference is that while transformations return another example of the same input format, filters simply return True or False! For step-by-step instructions, follow these steps.
If you are interested in NL-Augmenter, you may also be interested in the BIG-bench large scale collaborative benchmark for language models.
After all pull-requests have been merged, 3 of the most creative implementations would be selected and featured on this README page and on the NL-Augmenter webpage.
@misc{dhole2021nlaugmenter,
title={NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation},
author={Kaustubh D. Dhole and Varun Gangal and Sebastian Gehrmann and Aadesh Gupta and Zhenhao Li and Saad Mahamood and Abinaya Mahendiran and Simon Mille and Ashish Srivastava and Samson Tan and Tongshuang Wu and Jascha Sohl-Dickstein and Jinho D. Choi and Eduard Hovy and Ondrej Dusek and Sebastian Ruder and Sajant Anand and Nagender Aneja and Rabin Banjade and Lisa Barthe and Hanna Behnke and Ian Berlot-Attwell and Connor Boyle and Caroline Brun and Marco Antonio Sobrevilla Cabezudo and Samuel Cahyawijaya and Emile Chapuis and Wanxiang Che and Mukund Choudhary and Christian Clauss and Pierre Colombo and Filip Cornell and Gautier Dagan and Mayukh Das and Tanay Dixit and Thomas Dopierre and Paul-Alexis Dray and Suchitra Dubey and Tatiana Ekeinhor and Marco Di Giovanni and Rishabh Gupta and Rishabh Gupta and Louanes Hamla and Sang Han and Fabrice Harel-Canada and Antoine Honore and Ishan Jindal and Przemyslaw K. Joniak and Denis Kleyko and Venelin Kovatchev and Kalpesh Krishna and Ashutosh Kumar and Stefan Langer and Seungjae Ryan Lee and Corey James Levinson and Hualou Liang and Kaizhao Liang and Zhexiong Liu and Andrey Lukyanenko and Vukosi Marivate and Gerard de Melo and Simon Meoni and Maxime Meyer and Afnan Mir and Nafise Sadat Moosavi and Niklas Muennighoff and Timothy Sum Hon Mun and Kenton Murray and Marcin Namysl and Maria Obedkova and Priti Oli and Nivranshu Pasricha and Jan Pfister and Richard Plant and Vinay Prabhu and Vasile Pais and Libo Qin and Shahab Raji and Pawan Kumar Rajpoot and Vikas Raunak and Roy Rinberg and Nicolas Roberts and Juan Diego Rodriguez and Claude Roux and Vasconcellos P. H. S. and Ananya B. Sai and Robin M. Schmidt and Thomas Scialom and Tshephisho Sefara and Saqib N. Shamsi and Xudong Shen and Haoyue Shi and Yiwen Shi and Anna Shvets and Nick Siegel and Damien Sileo and Jamie Simon and Chandan Singh and Roman Sitelew and Priyank Soni and Taylor Sorensen and William Soto and Aman Srivastava and KV Aditya Srivatsa and Tony Sun and Mukund Varma T and A Tabassum and Fiona Anting Tan and Ryan Teehan and Mo Tiwari and Marie Tolkiehn and Athena Wang and Zijian Wang and Gloria Wang and Zijie J. Wang and Fuxuan Wei and Bryan Wilie and Genta Indra Winata and Xinyi Wu and Witold WydmaΕski and Tianbao Xie and Usama Yaseen and M. Yee and Jing Zhang and Yue Zhang},
journal={Northern European Journal of Language Technology},
volume={9},
number={1},
year={2023}
}
Some transformations include components released under a different (permissive, open source) license. For license details, refer to the README.md
and any license files in the transformations's or filter's directory.