Skip to content

Personally Identifiable Information Detection and Analysis

License

Notifications You must be signed in to change notification settings

PritK99/Data-Watchdog

Repository files navigation

Data Watchdog

logo page

Table of Contents

About The Project

Data Watchdog is a tool built to find and classify Personally Identifiable Information (PII) like names, emails, Aadhaar numbers, and PAN numbers in different types of data storage. It works with databases like MySQL and cloud services like Google Cloud and Amazon S3. The tool supports various file types, including unstructured files (such as .txt, .log, .jpg, .png, .jpeg, .pdf, .mp3, .mp4) and structured files (such as .csv).

Objective: Storing personal data comes with risks, and businesses need to follow rules to protect it. Data Watchdog helps companies find and classify personal data in their systems, making sure they follow data privacy laws like GDPR and CCPA, and reduce the risk of data breaches.

Features:

  1. Data Ingestion and Continuous Integration: Efficiently handles data from multiple sources, including Amazon AWS Cloud and SQL databases, as well as various file formats such as text files, log files, images, CSVs, and PDFs. Once configured, it continuously integrates and updates data fetched from the cloud.

  2. PII Detection: Utilizes machine learning and advanced techniques to identify personal identifiable information (PII). Provides a comprehensive list of detected PII across various file types.

  3. Drilldown: Calculates a risk score based on the type of detected PII. Classifies data into categories and buckets, allowing users to view information at different levels of granularity.

  4. Data Visualization: Offers analytics related to detected PII, including metrics such as mean risk per file, mean risk per file type, total PII counts per file, and identification of the riskiest PII elements.

Demo

data.watchdog.mp4

Methodology

We primarily deal with 6 types of files, which are Text Files (.txt, .log, .docx), Image Files (.png, .jpg, .jpeg), PDF Files (.pdf), CSV Files (.csv), Audio Files (.mp3) and Video Files(.mp4). Details about each PII extraction process can be found here

High Level Design

flowchart

Sequence Diagram

sequence diagram

API Contract

API Contract for the project can be found here

Tech Stack

  • Python

  • Flask

  • HTML

  • CSS

  • JavaScript

  • Hugging Face

File Structure

👨‍💻Data Watchdog
 ┣ 📂assets
 ┃ ┣ 📂demo
 ┃ ┣ 📂img          
 ┃ ┣ 📂temp                            // Sample files for testing
 ┃ ┣ 📂results                         // Output
 ┃ ┣ 📄problem-statement.pdf
 ┣ 📂client                            // Frontend        
 ┃ ┣ 📂static    
 ┃ ┣ 📂templates    
 ┃ ┃ ┣ 📄home.html
 ┃ ┃ ┣ 📄configure.html
 ┃ ┃ ┣ 📄dashboard.html 
 ┃ ┃ ┣ 📄view.html                           
 ┣ 📂model                                      
 ┃ ┣ 📄detect.py                       // Core functionality
 ┃ ┣ 📄utils.py 
 ┃ ┣ 📄analytics.py
 ┃ ┣ 📄postprocess.py
 ┃ ┣ 📄main.py
 ┃ ┣ 📄requirements.txt
 ┃ ┣ 📄README.md            
 ┣ 📂server                            // Backend 
 ┃ ┣ 📂archive  
 ┃ ┣ 📄app.py  
 ┃ ┣ 📄requirements.txt 
 ┣ 📄api_contract.yaml
 ┣ 📄data-watchdog-ppt.pptx  
 ┣ 📄Configure.md   
 ┣ 📄README.md

Getting Started

Installation

Clone the project by typing the following command in your Terminal/CommandPrompt

git clone [email protected]:PritK99/Data-Watchdog.git

Navigate to the Data Watchdog folder

cd Data-Watchdog

Usage

Create a virtual environment to install all the dependencies

python -m venv data-watchdog

Activate the virtual environment

For Windows: data-watchdog\Scripts\activate

For Linux: source data-watchdog/bin/activate

Install all the required dependencies

Open a new terminal in root folder and navigate to the server folder

cd server/

Install all the required dependencies

pip install -r requirements.txt

Note

Path to poppler and pytesseract are required in utils.py to perform pdf to image conversion and OCR respectively.

For poppler refer here.

For pytesseract refer here.

Please replace the paths in utils.py with your paths.

In addition to this, we require ffmpeg to deal with multimedia files such as audio and video. For the installation, please refer this YouTube video

Once all the above steps are completed, run the createdatabase.py script using the following command in the terminal in server directory:

python app.py  

Running the server will load the frontend at http://localhost:5000/

Note
The configuration page will require the configurations of SQL or Cloud that we want to analyze. For demo purpose we are using PostgreSQL using Render and LocalStack which allows us to simulate AWS cloud environment locally. For configuring these, steps are provided in Configure markdown file.

Screenshots of the Website

Home Page

home page

home page

Configuration Page

config page

Analytics and Dashboard

dashboard page

Drilldown Page

drilldown page

Output File

output.csv

Contributors

License

MIT License

About

Personally Identifiable Information Detection and Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published