The purpose of this project was to accomplish an assignment that was given during the Advanced NLP course at FHNW. The description of the assignment can be found in German here: the main idea was to train an NLP model doing topic classification based on a dataset of German News Articles.
The project structure follows the directory convention that can
befound here
with a the exception that source files are not located in /src folder. The reason for this is to not have src in import
statements import src.reporting
nor to deviate from Pip defaults.
Folder | Description |
---|---|
/data | Storage location for data. Raw data is downloaded to raw, after processing stored to processed. Trained models are stored in model subdirectory. |
/docs | Documentation that is not part of the code. |
/models | Location for model definitions. |
/notebooks | Jupyter notebooks are stored in the eda (Exploratory Data Analysis), modeling (Modeling) and evaluation (Evaluation) directories. |
/preprocessing | Extracted Python code that is used during data preprocessing. |
/reporting | Extracted Python code that is used during reporting. |
/tests | Module tests |
The dataset consists of roughly 10000 news articles that were classified in "Sport", "Kultur", "Web", "Wirtschaft", "Inland", "Etat", "International", "Panorama" and "Wissenschaft". The dataset was given to us in form of two datasets for training and testing.
I did Exploratory Data Analysis by analyzing detail data, missing data, categories, text length, language and looking at word clouds.
The modeling phase was started with join datasets where the given datasets ( training, test) were joined into a single one. Based on the class distribution I augmented some small amount of data to tackle class inbalance. Then tokenization, stemming, lemmatization was done followed by the stratified split into training, test and validation data.
We were told that a base Model based on TF-IDF is a proper standard. This notebook can be found here.
The first model was a Convolutional Neural Network using fasttext word embeddings and a convolutional layer.
The second model was a Recurrent Neural Network using fasttext word embeddings and an LSTM layer. I tried to do some tuning in based on tensorflow utilities.
In the end I ended up using a transformer based BERT model from huggingface.
For every model I persisted a sample file (e.g. BERT) that contained two arrays of a) predictions and b) expectations. This allowed me to analyse and compare the different models later on in Evaluation while also calculating different metrics based on that data.
It was hard to refactor the notebooks into normal python files due to several reasons:
- Preprocessing (cleaning, tokenization, etc.) and training are highly coupled and its hard to find a good way to extract code while keeping the flexibility for training different models. E.g. TF-IDF needs much different preprocessing (lemmatization, stemming, etc.) than RNNs, CNNs (word vectors) or transformers (pretrained tokenizers).
- Reloading: Reloading code that is outside Jupyter notebooks often requires restarting of kernel which makes it even harder to refactor.
It is a nice feature of Jupyter notebooks that cell outputs are stored within the files because it makes it easy to share and review files. Those cells however can be very annoying when put under version control due to the fact that upon almost every run changes are detected. An option would be to use Git smudge filters, however this would make our notebooks not reviewable anymore.
During preprocessing and training I was using different libraries: sklearn (data split, base model), tensorflow (RNN, CNN, Bert). It felt a bit awkward to train a model using tensorflow and evaluate it using sklearn libraries (sklearn confusion_matrix), I was aiming for using either one or the other.
I was using a Vertex AI workbench on Google Cloud Platform to code and train. My laptop does not have a suitable GPU and GCP provides GPU support for a decent amount of money. Additionally it lets you connect with Github repositories and kind of lets you follow proper development and project standards.
While using Vertex workbenches I experienced different issues:
- Stopping instances after work is a bad idea due to the fact that you might get into the problem of reallocating GPU resources.
- Even when continuously running the Workbench I was running into an automatic instance update where my boot disk got lost.
- Generate ssh key for checkout
ssh-keygen -t rsa -b 4096 -C "[email protected]"
- Clone Project
git clone [email protected]:raffaelschmid/nlp-topic-classification-german.git ~/code/nlp-topic-classification-german
- Add the following line to ~/.ipython/profile_default/ipython_kernel_config.py:
c.InteractiveShellApp.exec_lines
= [ 'import sys; sys.path.append("/home/jupyter/code/nlp-topic-classification-german")' ]
The setup was done using pip, project requirements should be stored using requirements.txt. In case of adding dependency use the following command:
# Attention: within fully equiped GCP template containers this might get big.
pip freeze > requirements.txt