-
Notifications
You must be signed in to change notification settings - Fork 121
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #111 from Mayank202004/main
Added Extract Aadhar Card Details under Computer Vision
- Loading branch information
Showing
17 changed files
with
1,179 additions
and
0 deletions.
There are no files selected for viewing
Empty file.
Empty file.
Binary file added
BIN
+2.91 KB
Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/__pycache__/app.cpython-312.pyc
Binary file not shown.
59 changes: 59 additions & 0 deletions
59
Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/app.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
from flask import Flask, request, jsonify | ||
import easyocr | ||
import re | ||
|
||
app = Flask(__name__) | ||
reader = easyocr.Reader(['en', 'hi']) # Load EasyOCR with English and Hindi support | ||
|
||
def extract_info(ocr_result): | ||
first_name, middle_name, last_name, gender, dob, year_of_birth, aadhaar_number = None, None, None, None, None, None, None | ||
|
||
for item in ocr_result: | ||
text = item[1] | ||
|
||
# Check for gender and extract names | ||
if re.search(r'Male|Female|पुरुष|महिला', text): | ||
name_match = re.findall(r'[A-Za-z]+', text) | ||
if len(name_match) >= 3: | ||
first_name, middle_name, last_name = name_match[:3] | ||
gender = 'Male' if 'Male' in text or 'पुरुष' in text else 'Female' | ||
|
||
# Extract DOB or Year of Birth | ||
dob_match = re.search(r'\b(\d{2}/\d{2}/\d{4})\b', text) | ||
if dob_match: | ||
dob = dob_match.group(1) | ||
elif 'Year of Birth' in text or 'जन्म वर्ष' in text: | ||
yob_match = re.search(r'Year of Birth\s*:\s*([\d]+)', text) | ||
year_of_birth = yob_match.group(1) if yob_match else None | ||
|
||
# Extract Aadhaar number | ||
aadhaar_match = re.search(r'\b\d{4}\s\d{4}\s\d{4}\b', text) | ||
if aadhaar_match: | ||
aadhaar_number = aadhaar_match.group(0) | ||
|
||
return { | ||
"First Name": first_name, | ||
"Middle Name": middle_name, | ||
"Last Name": last_name, | ||
"Gender": gender, | ||
"DOB": dob, | ||
"Year of Birth": year_of_birth, | ||
"Aadhaar Number": aadhaar_number | ||
} | ||
|
||
@app.route('/extract', methods=['POST']) | ||
def extract_data(): | ||
data = request.json | ||
image_path = data.get('image_path') | ||
|
||
if not image_path: | ||
return jsonify({"error": "Image path is required"}), 400 | ||
|
||
# Process the image with EasyOCR | ||
ocr_result = reader.readtext(image_path, paragraph=True) | ||
extracted_info = extract_info(ocr_result) | ||
|
||
return jsonify(extracted_info) | ||
|
||
if __name__ == '__main__': | ||
app.run(debug=True) |
Empty file.
Empty file.
53 changes: 53 additions & 0 deletions
53
Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/readme.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Flask OCR and Aadhaar Information Extraction API | ||
|
||
This project is a Flask-based API that extracts relevant information (like Name, Gender, Date of Birth, and Aadhaar number) from images of Aadhaar cards using the `EasyOCR` library. The API supports both **English** and **Hindi** text extraction. | ||
|
||
--- | ||
|
||
## Features | ||
|
||
- **Optical Character Recognition (OCR)**: Uses `EasyOCR` to read text from Aadhaar card images. | ||
- **Multi-language Support**: Supports both **English** and **Hindi** text extraction. | ||
- **Regex Matching**: Uses regular expressions to identify key pieces of information such as: | ||
- First, Middle, and Last Names | ||
- Gender | ||
- Date of Birth (DOB) | ||
- Year of Birth (YOB) | ||
- Aadhaar Number | ||
- **REST API**: Provides a `/extract` POST endpoint to process Aadhaar card images. | ||
|
||
--- | ||
|
||
## How It Works | ||
|
||
1. **User sends an Aadhaar card image** to the `/extract` endpoint via a POST request. | ||
2. The **EasyOCR** library reads the image and extracts the text in **English** and **Hindi**. | ||
3. The text is processed to extract important details such as: | ||
- First Name, Middle Name, Last Name | ||
- Gender | ||
- Date of Birth or Year of Birth | ||
- Aadhaar Number | ||
4. The extracted information is returned in JSON format. | ||
|
||
--- | ||
|
||
## Install Required Libraries | ||
Install all required Python libraries using the requirements.txt file: | ||
|
||
|
||
`pip install -r requirements.txt` | ||
|
||
This will install: | ||
|
||
- Flask (For creating the API server) | ||
- EasyOCR (For extracting text from images) | ||
- regex (For pattern matching in the text) | ||
|
||
## How to Run the API | ||
Once you’ve installed all the necessary libraries, follow these steps to run the Flask application: | ||
|
||
Start the Flask development server: | ||
|
||
`python app.py` | ||
|
||
The Flask API will be running on http://127.0.0.1:5000/. |
Empty file.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Aadhaar Information Extraction Project | ||
|
||
This project focuses on extracting relevant information from Aadhaar card images using Optical Character Recognition (OCR). Two approaches have been implemented: | ||
|
||
1. **Tesseract OCR with Pre-processing**: In this approach, the image is converted to greyscale and passed to Tesseract OCR. The text extracted is processed using regular expressions to find the useful information. | ||
2. **EasyOCR**: This approach utilizes the `EasyOCR` library, which supports multi-language OCR. Both **Hindi** and **English** are used to extract the text from Aadhaar cards. The extracted text is then processed using regular expressions to extract key Aadhaar information. | ||
|
||
--- | ||
|
||
## Project Theory and Overview | ||
|
||
The purpose of the project is to automate the extraction of critical Aadhaar information such as: | ||
|
||
- **First Name** | ||
- **Middle Name** | ||
- **Last Name** | ||
- **Gender** | ||
- **Date of Birth (DOB)** | ||
- **Aadhaar Number** | ||
|
||
### Approach 1: Tesseract OCR with Pre-processing | ||
|
||
Tesseract OCR is a popular open-source OCR engine but has certain limitations when working with complex documents like Aadhaar cards. To improve accuracy, the following pre-processing techniques are used: | ||
|
||
- **Image Pre-processing**: The image is converted to greyscale to enhance text visibility. | ||
- **Text Extraction**: Tesseract extracts the text from the greyscale image. | ||
- **Post-processing**: Regular expressions (`re`) are used to search for relevant patterns in the text, such as names, gender, DOB, and Aadhaar numbers. | ||
|
||
#### Limitations of Tesseract Approach: | ||
- **Low Accuracy**: Due to the complex fonts and mixed-language content on Aadhaar cards, Tesseract often struggles with accurate extraction, especially for Hindi text. | ||
- **Reliance on Pre-processing**: Image quality and pre-processing techniques significantly affect Tesseract's output. | ||
|
||
### Approach 2: EasyOCR with Multi-language Support (English and Hindi) | ||
|
||
To overcome the limitations of Tesseract, the second approach uses **EasyOCR**, a more advanced OCR library that supports multiple languages, including both **English** and **Hindi**. This enables better extraction from Aadhaar cards, which typically contain text in both languages. | ||
|
||
- **Text Extraction**: EasyOCR reads the Aadhaar card image and extracts text from both Hindi and English regions. | ||
- **Regular Expressions for Information Extraction**: Once the text is extracted, regular expressions are used to identify and extract specific pieces of information, including: | ||
- First, Middle, and Last Names | ||
- Gender (Male/Female in both Hindi and English) | ||
- Date of Birth (DOB) | ||
- Aadhaar Number in `XXXX XXXX XXXX` format | ||
|
||
#### Advantages of EasyOCR Approach: | ||
- **Higher Accuracy**: The combination of Hindi and English support allows for better recognition of names and other details from Aadhaar cards. | ||
- **No Need for Extensive Pre-processing**: EasyOCR works well even with the original image without the need for intense pre-processing steps. | ||
|
||
--- | ||
|
||
## How It Works | ||
|
||
1. **Input**: The user provides an image of an Aadhaar card. | ||
2. **Processing**: | ||
- The image is processed by either Tesseract OCR (with greyscale conversion) or EasyOCR. | ||
- The extracted text is then scanned using regular expressions to find important details. | ||
3. **Output**: The extracted information is presented in a structured format, including: | ||
- Name (First, Middle, Last) | ||
- Gender | ||
- Date of Birth (DOB) | ||
- Aadhaar Number | ||
|
||
--- | ||
|
||
## Installation | ||
|
||
### 1. Clone the Repository: | ||
`git clone https://UppuluriKalyani/ML-Nexus` | ||
|
||
|
||
`cd <project-directory>` | ||
|
||
### 2. Install Requirements | ||
`pip install -r requirements.txt` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# Result Comparison: Aadhaar Information Extraction | ||
|
||
This document showcases the results of extracting Aadhaar card information using two different approaches: | ||
|
||
1. **Tesseract OCR with Image Pre-processing** | ||
2. **EasyOCR with Multi-language Support (English and Hindi)** | ||
|
||
Screenshots are provided for the extracted information, along with an API result screenshot. | ||
|
||
--- | ||
|
||
## 1. Tesseract OCR Approach | ||
|
||
In this approach, the Aadhaar card image is first converted to greyscale and then passed through the Tesseract OCR engine. Regular expressions (`re`) are used to extract key information such as names, gender, date of birth, and Aadhaar number from the extracted text. | ||
|
||
### Screenshot for Tesseract OCR Result: | ||
![Tesseract OCR Result](assets/images/tesseract.png) | ||
|
||
#### Challenges: | ||
- **Accuracy**: Tesseract struggles with mixed-language documents (English + Hindi). | ||
- **Pre-processing Required**: The image needs to be pre-processed (converted to greyscale) to improve text extraction. | ||
- **Hindi Text**: Tesseract doesn't handle Hindi text as well, which reduces its accuracy for Aadhaar cards that include Hindi. | ||
|
||
--- | ||
|
||
## 2. EasyOCR Approach | ||
|
||
The EasyOCR approach uses multi-language support for both Hindi and English, making it a better fit for Aadhaar card text recognition. The extracted text is processed using regular expressions to find relevant details. | ||
|
||
### Output for EasyOCR: | ||
- **First Name**: `Rahul` | ||
- **Middle Name**: `Ramesh` | ||
- **Last Name**: `Gaikwad` | ||
- **Gender**: `Male` | ||
- **DOB**: `23/08/1995` | ||
- **Aadhaar Number**: `2058 6470 5393` | ||
|
||
### Screenshot for EasyOCR Result: | ||
![EasyOCR Result](assets/images/easyocr.png) | ||
|
||
### After Extraction | ||
![EasyOCR Result](assets/images/Output.png) | ||
|
||
#### Advantages: | ||
- **Higher Accuracy**: EasyOCR performs significantly better with mixed-language documents, making it ideal for Aadhaar cards. | ||
- **Multi-language Support**: Supports both English and Hindi, improving text extraction accuracy. | ||
- **No Heavy Pre-processing**: Works well without needing extensive image manipulation. | ||
|
||
--- | ||
|
||
## Comparison of Results | ||
|
||
| Feature | Tesseract OCR | EasyOCR | | ||
|----------------------|------------------------------------|----------------------------------| | ||
| **Languages** | English only | English and Hindi support | | ||
| **Accuracy** | Low to Medium | High | | ||
| **Pre-processing** | Requires greyscale conversion | Minimal pre-processing needed | | ||
| **Performance** | Faster but less accurate | Bit slower and more accurate | | ||
| **Aadhaar Extraction**| Struggles with Hindi and complex fonts | Handles both languages well | | ||
|
||
--- | ||
|
||
## API Result Screenshot | ||
|
||
Here is the expected result returned from the API after extracting information from the Aadhaar card image: | ||
|
||
![EasyOCR Result](assets/images/api_response.png) | ||
|
||
### Input Body JSON: | ||
{ | ||
|
||
"image_path": "C:\\Users\\mayan\\Downloads\\fs.jpeg" | ||
|
||
} | ||
|
||
### API Response: | ||
```json | ||
{ | ||
"First Name": "Rahul", | ||
"Middle Name": "Ramesh", | ||
"Last Name": "Gaikwad", | ||
"Gender": "Male", | ||
"DOB": "23/08/1995", | ||
"Aadhaar Number": "2058 6470 5393" | ||
} |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+53.1 KB
Computer Vision/Extracting Aadhar Details/assets/images/api_response.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.