This project focuses on improving OCR for handwritten texts, extending TrOCR to handle full-page structured text (paragraphs and essays). It includes scripts and methods used throughout the development process, from initial experimentation to the final optimized methodology.
- Run the Application with UI
Launch the application with a graphical interface using the command:python ui.py
- Generate Synthetic Data
Generate labeled synthetic data using the provided notebook:- Open and run
synthetic_data_generation_final.ipynb
in your Jupyter Notebook or preferred environment.
- Open and run
The project used a systematic approach to train a model for detecting bounding boxes of high-quality text patches:
-
Dataset Creation:
- Divided images into overlapping patches of fixed height (~2× font size) and full width.
- Applied TrOCR to detect lines in patches.
- Filtered good patches based on confidence scores and removed duplicates using BLEU scores.
-
Model Training:
- Trained a YOLO model (initialized with YOLOv11 weights) on approximately 1,100 labeled examples.
- Trained for 100 epochs, achieving a final validation loss of <1.
- Enhanced OCR for Handwritten Text: Focuses on structured layouts while avoiding complex graphs or unstructured data.
- Synthetic Data Generation: Automated label generation using a brute-force patching approach and TrOCR.
- Efficient Text Detection: Optimized bounding box detection using a YOLO model to streamline the pipeline.
- Full-Page OCR Pipeline: Handles full-page text detection and recognition for structured text.
- Out-of-Vocabulary Characters: Some characters, like the Greek letter sigma, are mapped to visually or semantically similar known characters due to model limitations.