Merge pull request #111 from Mayank202004/main

Added Extract Aadhar Card Details under Computer Vision
UppuluriKalyani · Oct 6, 2024 · a5d28d6 · a5d28d6
2 parents 4fdc0d5 + ae62fd0
commit a5d28d6
Show file tree

Hide file tree

Showing 17 changed files with 1,179 additions and 0 deletions.
diff --git a/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/0.8 b/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/0.8
diff --git a/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/7.0 b/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/7.0
diff --git a/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/__pycache__/app.cpython-312.pyc b/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/__pycache__/app.cpython-312.pyc
diff --git a/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/app.py b/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/app.py
@@ -0,0 +1,59 @@
+from flask import Flask, request, jsonify
+import easyocr
+import re
+
+app = Flask(__name__)
+reader = easyocr.Reader(['en', 'hi'])  # Load EasyOCR with English and Hindi support
+
+def extract_info(ocr_result):
+    first_name, middle_name, last_name, gender, dob, year_of_birth, aadhaar_number = None, None, None, None, None, None, None
+
+    for item in ocr_result:
+        text = item[1]
+
+        # Check for gender and extract names
+        if re.search(r'Male|Female|पुरुष|महिला', text):
+            name_match = re.findall(r'[A-Za-z]+', text)
+            if len(name_match) >= 3:
+                first_name, middle_name, last_name = name_match[:3]
+            gender = 'Male' if 'Male' in text or 'पुरुष' in text else 'Female'
+
+            # Extract DOB or Year of Birth
+            dob_match = re.search(r'\b(\d{2}/\d{2}/\d{4})\b', text)
+            if dob_match:
+                dob = dob_match.group(1)
+            elif 'Year of Birth' in text or 'जन्म वर्ष' in text:
+                yob_match = re.search(r'Year of Birth\s*:\s*([\d]+)', text)
+                year_of_birth = yob_match.group(1) if yob_match else None
+
+        # Extract Aadhaar number
+        aadhaar_match = re.search(r'\b\d{4}\s\d{4}\s\d{4}\b', text)
+        if aadhaar_match:
+            aadhaar_number = aadhaar_match.group(0)
+
+    return {
+        "First Name": first_name,
+        "Middle Name": middle_name,
+        "Last Name": last_name,
+        "Gender": gender,
+        "DOB": dob,
+        "Year of Birth": year_of_birth,
+        "Aadhaar Number": aadhaar_number
+    }
+
+@app.route('/extract', methods=['POST'])
+def extract_data():
+    data = request.json
+    image_path = data.get('image_path')
+
+    if not image_path:
+        return jsonify({"error": "Image path is required"}), 400
+
+    # Process the image with EasyOCR
+    ocr_result = reader.readtext(image_path, paragraph=True)
+    extracted_info = extract_info(ocr_result)
+
+    return jsonify(extracted_info)
+
+if __name__ == '__main__':
+    app.run(debug=True)
diff --git a/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/pip b/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/pip
diff --git a/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/python b/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/python
diff --git a/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/readme.md b/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/readme.md
@@ -0,0 +1,53 @@
+# Flask OCR and Aadhaar Information Extraction API
+
+This project is a Flask-based API that extracts relevant information (like Name, Gender, Date of Birth, and Aadhaar number) from images of Aadhaar cards using the `EasyOCR` library. The API supports both **English** and **Hindi** text extraction.
+
+---
+
+## Features
+
+- **Optical Character Recognition (OCR)**: Uses `EasyOCR` to read text from Aadhaar card images.
+- **Multi-language Support**: Supports both **English** and **Hindi** text extraction.
+- **Regex Matching**: Uses regular expressions to identify key pieces of information such as:
+  - First, Middle, and Last Names
+  - Gender
+  - Date of Birth (DOB)
+  - Year of Birth (YOB)
+  - Aadhaar Number
+- **REST API**: Provides a `/extract` POST endpoint to process Aadhaar card images.
+
+---
+
+## How It Works
+
+1. **User sends an Aadhaar card image** to the `/extract` endpoint via a POST request.
+2. The **EasyOCR** library reads the image and extracts the text in **English** and **Hindi**.
+3. The text is processed to extract important details such as:
+   - First Name, Middle Name, Last Name
+   - Gender
+   - Date of Birth or Year of Birth
+   - Aadhaar Number
+4. The extracted information is returned in JSON format.
+
+---
+
+## Install Required Libraries
+Install all required Python libraries using the requirements.txt file:
+
+
+`pip install -r requirements.txt`
+
+This will install:
+
+ - Flask (For creating the API server)
+ - EasyOCR (For extracting text from images)
+ - regex (For pattern matching in the text)
+
+## How to Run the API
+Once you’ve installed all the necessary libraries, follow these steps to run the Flask application:
+
+Start the Flask development server:
+
+`python app.py`
+
+The Flask API will be running on http://127.0.0.1:5000/.
diff --git a/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/uvicorn b/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/uvicorn
diff --git a/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/uvicorn) b/Computer Vision/Extracting Aadhar Details/OCR ADHAAR API/uvicorn)
diff --git a/Computer Vision/Extracting Aadhar Details/README.md b/Computer Vision/Extracting Aadhar Details/README.md
@@ -0,0 +1,73 @@
+# Aadhaar Information Extraction Project
+
+This project focuses on extracting relevant information from Aadhaar card images using Optical Character Recognition (OCR). Two approaches have been implemented:
+
+1. **Tesseract OCR with Pre-processing**: In this approach, the image is converted to greyscale and passed to Tesseract OCR. The text extracted is processed using regular expressions to find the useful information.
+2. **EasyOCR**: This approach utilizes the `EasyOCR` library, which supports multi-language OCR. Both **Hindi** and **English** are used to extract the text from Aadhaar cards. The extracted text is then processed using regular expressions to extract key Aadhaar information.
+
+---
+
+## Project Theory and Overview
+
+The purpose of the project is to automate the extraction of critical Aadhaar information such as:
+
+- **First Name**
+- **Middle Name**
+- **Last Name**
+- **Gender**
+- **Date of Birth (DOB)**
+- **Aadhaar Number**
+
+### Approach 1: Tesseract OCR with Pre-processing
+
+Tesseract OCR is a popular open-source OCR engine but has certain limitations when working with complex documents like Aadhaar cards. To improve accuracy, the following pre-processing techniques are used:
+
+- **Image Pre-processing**: The image is converted to greyscale to enhance text visibility.
+- **Text Extraction**: Tesseract extracts the text from the greyscale image.
+- **Post-processing**: Regular expressions (`re`) are used to search for relevant patterns in the text, such as names, gender, DOB, and Aadhaar numbers.
+
+#### Limitations of Tesseract Approach:
+- **Low Accuracy**: Due to the complex fonts and mixed-language content on Aadhaar cards, Tesseract often struggles with accurate extraction, especially for Hindi text.
+- **Reliance on Pre-processing**: Image quality and pre-processing techniques significantly affect Tesseract's output.
+
+### Approach 2: EasyOCR with Multi-language Support (English and Hindi)
+
+To overcome the limitations of Tesseract, the second approach uses **EasyOCR**, a more advanced OCR library that supports multiple languages, including both **English** and **Hindi**. This enables better extraction from Aadhaar cards, which typically contain text in both languages.
+
+- **Text Extraction**: EasyOCR reads the Aadhaar card image and extracts text from both Hindi and English regions.
+- **Regular Expressions for Information Extraction**: Once the text is extracted, regular expressions are used to identify and extract specific pieces of information, including:
+  - First, Middle, and Last Names
+  - Gender (Male/Female in both Hindi and English)
+  - Date of Birth (DOB)
+  - Aadhaar Number in `XXXX XXXX XXXX` format
+
+#### Advantages of EasyOCR Approach:
+- **Higher Accuracy**: The combination of Hindi and English support allows for better recognition of names and other details from Aadhaar cards.
+- **No Need for Extensive Pre-processing**: EasyOCR works well even with the original image without the need for intense pre-processing steps.
+
+---
+
+## How It Works
+
+1. **Input**: The user provides an image of an Aadhaar card.
+2. **Processing**:
+   - The image is processed by either Tesseract OCR (with greyscale conversion) or EasyOCR.
+   - The extracted text is then scanned using regular expressions to find important details.
+3. **Output**: The extracted information is presented in a structured format, including:
+   - Name (First, Middle, Last)
+   - Gender
+   - Date of Birth (DOB)
+   - Aadhaar Number
+
+---
+
+## Installation
+
+### 1. Clone the Repository:
+`git clone https://UppuluriKalyani/ML-Nexus`
+
+
+`cd <project-directory>`
+
+### 2. Install Requirements
+`pip install -r requirements.txt`
diff --git a/Computer Vision/Extracting Aadhar Details/RESULT.md b/Computer Vision/Extracting Aadhar Details/RESULT.md
@@ -0,0 +1,85 @@
+# Result Comparison: Aadhaar Information Extraction
+
+This document showcases the results of extracting Aadhaar card information using two different approaches:
+
+1. **Tesseract OCR with Image Pre-processing**
+2. **EasyOCR with Multi-language Support (English and Hindi)**
+
+Screenshots are provided for the extracted information, along with an API result screenshot.
+
+---
+
+## 1. Tesseract OCR Approach
+
+In this approach, the Aadhaar card image is first converted to greyscale and then passed through the Tesseract OCR engine. Regular expressions (`re`) are used to extract key information such as names, gender, date of birth, and Aadhaar number from the extracted text.
+
+### Screenshot for Tesseract OCR Result:
+![Tesseract OCR Result](assets/images/tesseract.png)
+
+#### Challenges:
+- **Accuracy**: Tesseract struggles with mixed-language documents (English + Hindi).
+- **Pre-processing Required**: The image needs to be pre-processed (converted to greyscale) to improve text extraction.
+- **Hindi Text**: Tesseract doesn't handle Hindi text as well, which reduces its accuracy for Aadhaar cards that include Hindi.
+
+---
+
+## 2. EasyOCR Approach
+
+The EasyOCR approach uses multi-language support for both Hindi and English, making it a better fit for Aadhaar card text recognition. The extracted text is processed using regular expressions to find relevant details.
+
+### Output for EasyOCR:
+- **First Name**: `Rahul`
+- **Middle Name**: `Ramesh`
+- **Last Name**: `Gaikwad`
+- **Gender**: `Male`
+- **DOB**: `23/08/1995`
+- **Aadhaar Number**: `2058 6470 5393`
+
+### Screenshot for EasyOCR Result:
+![EasyOCR Result](assets/images/easyocr.png)
+
+### After Extraction
+![EasyOCR Result](assets/images/Output.png)
+
+#### Advantages:
+- **Higher Accuracy**: EasyOCR performs significantly better with mixed-language documents, making it ideal for Aadhaar cards.
+- **Multi-language Support**: Supports both English and Hindi, improving text extraction accuracy.
+- **No Heavy Pre-processing**: Works well without needing extensive image manipulation.
+
+---
+
+## Comparison of Results
+
+| Feature              | Tesseract OCR                      | EasyOCR                          |
+|----------------------|------------------------------------|----------------------------------|
+| **Languages**         | English only                      | English and Hindi support        |
+| **Accuracy**          | Low to Medium                     | High                             |
+| **Pre-processing**    | Requires greyscale conversion     | Minimal pre-processing needed    |
+| **Performance**       | Faster but less accurate          | Bit slower and more accurate     |
+| **Aadhaar Extraction**| Struggles with Hindi and complex fonts | Handles both languages well       |
+
+---
+
+## API Result Screenshot
+
+Here is the expected result returned from the API after extracting information from the Aadhaar card image:
+
+![EasyOCR Result](assets/images/api_response.png)
+
+### Input  Body JSON: 
+{
+
+    "image_path": "C:\\Users\\mayan\\Downloads\\fs.jpeg"
+
+}
+
+### API Response:
+```json
+{
+  "First Name": "Rahul",
+  "Middle Name": "Ramesh",
+  "Last Name": "Gaikwad",
+  "Gender": "Male",
+  "DOB": "23/08/1995",
+  "Aadhaar Number": "2058 6470 5393"
+}
diff --git a/Computer Vision/Extracting Aadhar Details/assets/images/Output.png b/Computer Vision/Extracting Aadhar Details/assets/images/Output.png
diff --git a/Computer Vision/Extracting Aadhar Details/assets/images/api_response.png b/Computer Vision/Extracting Aadhar Details/assets/images/api_response.png
diff --git a/Computer Vision/Extracting Aadhar Details/assets/images/easyocr.png b/Computer Vision/Extracting Aadhar Details/assets/images/easyocr.png
diff --git a/Computer Vision/Extracting Aadhar Details/assets/images/tesseract.png b/Computer Vision/Extracting Aadhar Details/assets/images/tesseract.png