The purpose of this library is to make text classification easily available. It relies on three components which work together:
- Preprocessor: Reads in data and stores feature vectors and predictions
- Featurizer: Extracts features out of text data
- Classifier: Uses extracted features to train a classification model and do inference on unseen instances
The documentation where the functionality of each of the components is explained can be found here.
To install this library, execute the following commands in your terminal:
git clone https://github.com/bogdankostic/TextClassification.git
cd TextClassification
pip install -r requirements.txt
pip install --editable .
Training a new classifier requires only three steps:
- Read data using a
Preprocessor
- Extract features using a
Featurizer
- Pass the data with extracted feature to a
Classifier
Code example:
from text_classification.preprocessor.csv_preprocessor import CSVPreprocessor
from text_classification.featurizer.tweet_featurizer import TweetFeaturizer
from text_classification.classifier.class_average import ClassAverageClassifier
preprocessor = CSVPreprocessor(train_filename="train.tsv")
featurizer = TweetFeaturizer()
featurizer.extract_features(preprocessor)
classifier = ClassAverageClassifier()
classifier.train(preprocessor)
This library has been built around the Hillary Clinton and Donald Trump Tweets dataset, which can be downloaded from here: https://www.kaggle.com/benhamner/clinton-trump-tweets
The example script that trains and evaluates a model on the dataset can be found in this here. This example achieves an accuracy and a macro-averaged F1-score of 0.57.
Tests can be executed using the command pytest
from within
the test folder TextClassification/test
.