Python package for natural language pre-processing with nltk and Hunspell.
Includes:
- Standardizing cases
- Standardizing symbols
- Removing extra whitespaces
- Stopwords removal
- Simple spelling corrections
- Lemmatization
Available utilities:
clean_cases
split_camel_cased
clean_invalid_symbols
clean_repeated_symbols
clean_spaces
remove_stopwords
fix_spelling
SpellChecker
lemmatize
clean
soft_clean
full_clean
Supported languages:
- Spanish
- English
Spell checking functions rely on dictionary files, placed by default on the dictionaries
directory. This collection of dictionaries was added as a git submodule for convenience.
Lemmatization in Spanish relies on lemma dictionary files, placed by default on the lemmas
directory. This collection was added as a git submodule for convenience. Feel free to propose your own!
To clone all submodules, use the following commands.
git submodule init
git submodule update
Further reference can be found here.
The stopwords
and wordnet
corpus for the nltk
package must be installed. A helper script is provided for easy setup. Simply run:
python setup.py
from textpreprocess.compound_cleaners.en import full_clean, soft_clean
text = ' thiss is a bery :''{ñdirti text! '
full_clean(text) # -> 'this very dirt text'
soft_clean(text) # -> 'this is a very dirty text'
Special thanks to Vicente Oyanedel M. for his work on the first version of this package.