The purpose of this repo is to bring together the main cleaning processes in order to get the most value from a dirty linguistic dataset.
In such way, it's divided the concept of "cleaning process" into two:
- Normalizers: which goal is to give consistency to the entire dataset => EXAMPLE
- Validators: which goal is to check if a piece of the dataset is valuable or not => EXAMPLE
In the same line, there are another type of functions which are called "Helpers" and its goal is to avoid rewriting functions that are used mainly in the normalizers and the validators => EXAMPLE
Last but not least, it's important that you know that these processes are made to be able to process both monolingual and bilingual datasets => EXAMPLE
You can find a clear example of usage for monolingual datasets HERE and for bilingual datasets HERE.
What it's done is, on the one hand bringing all the desired normalizers into just one function with the purpose of defining what it's understood for "normalize" in this particular case. And, in the other hand, it's done a similar thing for "validate" with the particularity that in this case it's made to keep track of the invalid parts of the dataset.
Note: You want to use certain "cleaning processes" or other depending on the language(s) of the dataset to be treated.
Although the origin of this repo is made thinking mainly of the English-Spanish combination, it's expected to increase the amount of "cleaning processes" in order to reach the most languages as possible.
The contribution guidelines are as per the guide HERE.
- Fork this Repository
- Clone your forked repository
- Add your process
- Commit & Push
- Create a pull request
- Star this repository
- Wait for Pull Request to merge
- Celebrate, your first step into the open Source World and contribute more