We take lots of scanned images of documents of various type some taken on handheld devices, some using scanners, etc. So, it becomes increasingly important to organize these scanned documents, which requires reliable and high quality classification of these scanned document images into several categories like letter, form, etc.
This is a part of IndoML22(Indian Symposium of Machine Learning-2022) Datathon Challenge.
The training and validation data is provided in the Datathon which is a subset of 16000 grayscale images from the RVL-CDIP dataset with 1000 images belonging to each of the 16 categories in which the images are classified. The competition and the data is released in its Kaggle Competition.
Images span across 16 different categories(with their corresponding labels) from the training set as shown below:
A discussion about the data with few more images from both training and validation set displayed can be seen in the data overview notebook
The task is to build a model to classify the images correctly into it's respective category and the performance will be evaluated using the Mean F1-Score. The F1 score, commonly used in information retrieval, measures accuracy using the statistics precision
Precision is the ratio of true positives
The F1 metric weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously. Thus, moderately good performance on both will be favored over extremely good performance on one and poor performance on the other.
Various visual feature extraction based methods were applied using EfficientNetV2L pretrained model(trained on ImageNet). Two of them are:
- EfficientNet followed by FFN (EffNet)
- Partioned Image based EfficientNet followed by FFN (EffNet-4Piece)
- InceptionResNetV2 along with RoI based Vision Transformer Network (IncResNet-RoI-ViT) [Model Report]
- ResNet-VGG-InceptionResNetV2 along with PCA followed by FFN (ResVGGInc-PCA-4Piece) [Model Report]
The results of clustering of the learnt penultimate layer feature vector for the above two models for the training set is shown below:
EffNet (Mean-F1: 0.6) | EffNet-4Piece (Mean-F1: 0.68) |
---|---|
IncResNet-RoI-ViT (Mean-F1: 0.755) | ResVGGInc-PCA-4Piece (Mean-F1: 0.785) |
---|---|
- Refer to the
IndoML22
folder it containsREADME.txt
file which contains all the information about how to train the ViT model usingtrain.ipynb
and inferencing trained model usingtest.ipynb
. - Colab Notebooks: train notebook and test.ipynb. Going through
README.txt
as mentioned above will help better understand the directory structure. - Link to the Pretrained Model to be updated.