This Python application processes PDF files by extracting text from the pages and images, applying OCR (Optical Character Recognition) to images, and storing the results in a SQLite database. It provides a graphical user interface (GUI) built with Tkinter
to search and preview the PDF content, including OCR-extracted text.
- Extracts and stores text from PDF pages.
- Extracts images from PDF pages and applies OCR using Tesseract or EasyOCR.
- Stores image metadata and OCR-extracted text into the SQLite database.
- Provides a user-friendly GUI for searching through the stored data, including PDF text and OCR text.
- Allows for both single-file and folder-based (batch) PDF processing.
- Enables preview of PDFs with zoom and navigation features.
Before running the application, you need to install the following dependencies. You have two options depending on whether you're using Tesseract or EasyOCR for OCR functionality.
-
Install the required Python packages using
pip
based on your chosen OCR engine.-
For Tesseract users:
pip install -r requirements.txt
This will install the following packages:
PyMuPDF
Pillow
pytesseract
threaded
-
For EasyOCR users:
pip install -r requirements.txt
This will install the following packages:
PyMuPDF
Pillow
easyocr
numpy
threaded
-
-
Tesseract OCR Installation (if using Tesseract):
- On Ubuntu:
sudo apt install tesseract-ocr
- On MacOS (using Homebrew):
brew install tesseract
- On Windows: Download and install from Tesseract OCR for Windows.
- On Ubuntu:
-
Ensure that
tesseract
is in your system’s PATH if using Tesseract.
For users who cannot install Tesseract on Windows or prefer a simpler setup, we provide the alternative pdf_processor_easyocr.py
script, which uses EasyOCR for OCR extraction instead of Tesseract.
-
To run the version of the app using EasyOCR, simply execute:
python pdf_processor_easyocr.py
This will provide the same functionality without requiring the user to install Tesseract.
The application stores PDF data in an SQLite database called pdf_data.db
. The following tables are used to store the extracted data:
-
pdf_files: Stores metadata for each processed PDF file.
CREATE TABLE pdf_files ( id INTEGER PRIMARY KEY AUTOINCREMENT, file_name TEXT, file_path TEXT );
-
pages: Stores text extracted from each PDF page.
CREATE TABLE pages ( id INTEGER PRIMARY KEY AUTOINCREMENT, pdf_id INTEGER, page_number INTEGER, text TEXT, FOREIGN KEY(pdf_id) REFERENCES pdf_files(id) );
-
images: Stores metadata about extracted images from the PDF.
CREATE TABLE images ( id INTEGER PRIMARY KEY AUTOINCREMENT, pdf_id INTEGER, page_number INTEGER, image_name TEXT, image_ext TEXT, FOREIGN KEY(pdf_id) REFERENCES pdf_files(id) );
-
ocr_text: Stores the text extracted via OCR from images.
CREATE TABLE ocr_text ( id INTEGER PRIMARY KEY AUTOINCREMENT, pdf_id INTEGER, page_number INTEGER, ocr_text TEXT, FOREIGN KEY(pdf_id) REFERENCES pdf_files(id) );
-
Running the Application:
-
To start the Tesseract-based version of the application, run the
main()
function in thepdf_search_gui.py
script:python pdf_search_gui.py
-
To start the EasyOCR-based version of the application, run the
main()
function inpdf_processor_easyocr.py
:python pdf_processor_easyocr.py
-
-
Processing PDF Files:
- The application provides two options:
- Single File: Select a single PDF file to process.
- Batch Processing: Select a folder containing multiple PDFs for processing.
After processing, the text, images, and OCR data will be stored in the SQLite database.
- The application provides two options:
-
Searching for Text:
In the GUI, enter a search term and press "Search". The application will search both PDF page text and OCR-extracted text from images. The results will be displayed in a table, showing the PDF file name, page number, and matching context.
-
Previewing PDF Pages:
From the search results, you can select a PDF and page to preview. The selected PDF page will be displayed in the right-hand pane of the GUI, with zoom and navigation controls available.
-
PDF Processing:
- The Tesseract-based version (
pdf_search_gui.py
) usespytesseract
to perform OCR on extracted images from PDFs. - The EasyOCR-based version (
pdf_processor_easyocr.py
) uses EasyOCR to perform OCR on extracted images from PDFs. Images extracted from the PDF are converted to NumPy arrays before being passed to EasyOCR for processing.
- The Tesseract-based version (
-
Database Interaction: The script inserts extracted PDF text, image metadata, and OCR results into the SQLite database. It provides search functionality for both PDF text and OCR-extracted text.
-
Text Extraction: For each page in the PDF, text is extracted using
PyMuPDF
and inserted into thepages
table in the database. -
Image Extraction and OCR:
- Tesseract version: For each image found in the PDF, metadata is saved in the
images
table. The image is passed to Tesseract to extract text via OCR, and the result is stored in theocr_text
table. - EasyOCR version: In the EasyOCR version (
pdf_processor_easyocr.py
), each image is extracted from the PDF, converted into a NumPy array, and passed to EasyOCR for text extraction. The extracted text is stored in theocr_text
table.
Note: EasyOCR requires image input in specific formats, including file paths, URLs, bytes, or NumPy arrays. The application automatically handles the conversion to a NumPy array before passing the image to EasyOCR.
- Tesseract version: For each image found in the PDF, metadata is saved in the
-
Search: The user can search both the
pages
table (PDF text) and theocr_text
table (OCR text from images). The results are combined and displayed in the GUI. -
Preview: The selected PDF file is opened and rendered in the GUI's canvas area, allowing the user to view the selected page.
-
Single PDF File:
- Open the application.
- Choose a PDF file to process.
- Search for text or OCR data using the search bar.
- View the search results and select a page to preview.
-
Batch Processing:
- Select a folder containing multiple PDF files.
- The application will process all PDFs and extract text, images, and OCR data.
- Perform searches across all processed files.
- All errors during processing are logged to
app.log
, and the user is notified of issues via GUI pop-up messages.
This project is open-source and available under the MIT License.
- Add support for exporting search results.
- Improve image OCR accuracy with advanced preprocessing.
- Add annotations for highlighted text in preview mode.