This project was focussed on the application of hybrid end-to-end models based on transformers to recognize text in Spanish printed sources from the seventeenth century.
This model combines ResNet-101 for feature extraction with a Transformer architecture for sequence modeling. Initially, ResNet-101 extracts visual features, which are then passed through a 1x1 convolutional layer to adapt their dimensionality. The Transformer processes these features along with positional encodings, capturing spatial and sequential information. It predicts token probabilities through linear layers, facilitating Optical Character Recognition tasks.
Deployment -> Hugging Face Spaces
Bentham Dataset -> Download Link
To preprocess the data, run the main() function in data_preprocess/bentham_transform.py
or download the preprocessed dataset here: Bentham Preprocessed Data
Epochs : 200
Pretrained weights can be downloaded here -> Pretrain Weights
Test Result:
Evaluation Metrics Used
CER - (Character Error Rate), WER - (Word Error Rate), SER - (Sequence Error Rate)
The pre-trained the Transformer model using the Bentham dataset, was fine-tuned on the specific dataset (Spanish Literature). Utilized Pytesseract to segment entire pages into individual lines, which were then preprocessed prior to training the model.
Epochs: 150
Fine Tuned weights can be downloaded here -> Fine Tune Weights
Loss vs Graph:
Test Results over all images:
Test PDF Page 15 P2 - P: Visual Result:
PDF age 16 P2