Data Watchdog

About The Project

Data Watchdog is a tool built to find and classify Personally Identifiable Information (PII) like names, emails, Aadhaar numbers, and PAN numbers in different types of data storage. It works with databases like MySQL and cloud services like Google Cloud and Amazon S3. The tool supports various file types, including unstructured files (such as .txt, .log, .jpg, .png, .jpeg, .pdf, .mp3, .mp4) and structured files (such as .csv).

Objective: Storing personal data comes with risks, and businesses need to follow rules to protect it. Data Watchdog helps companies find and classify personal data in their systems, making sure they follow data privacy laws like GDPR and CCPA, and reduce the risk of data breaches.

Features:

Data Ingestion and Continuous Integration: Efficiently handles data from multiple sources, including Amazon AWS Cloud and SQL databases, as well as various file formats such as text files, log files, images, CSVs, and PDFs. Once configured, it continuously integrates and updates data fetched from the cloud.
PII Detection: Utilizes machine learning and advanced techniques to identify personal identifiable information (PII). Provides a comprehensive list of detected PII across various file types.
Drilldown: Calculates a risk score based on the type of detected PII. Classifies data into categories and buckets, allowing users to view information at different levels of granularity.
Data Visualization: Offers analytics related to detected PII, including metrics such as mean risk per file, mean risk per file type, total PII counts per file, and identification of the riskiest PII elements.

Demo

data.watchdog.mp4

Methodology

We primarily deal with 6 types of files, which are Text Files (.txt, .log, .docx), Image Files (.png, .jpg, .jpeg), PDF Files (.pdf), CSV Files (.csv), Audio Files (.mp3) and Video Files(.mp4). Details about each PII extraction process can be found here

High Level Design

Sequence Diagram

API Contract

API Contract for the project can be found here

Tech Stack

File Structure

👨‍💻Data Watchdog
 ┣ 📂assets
 ┃ ┣ 📂demo
 ┃ ┣ 📂img          
 ┃ ┣ 📂temp                            // Sample files for testing
 ┃ ┣ 📂results                         // Output
 ┃ ┣ 📄problem-statement.pdf
 ┣ 📂client                            // Frontend        
 ┃ ┣ 📂static    
 ┃ ┣ 📂templates    
 ┃ ┃ ┣ 📄home.html
 ┃ ┃ ┣ 📄configure.html
 ┃ ┃ ┣ 📄dashboard.html 
 ┃ ┃ ┣ 📄view.html                           
 ┣ 📂model                                      
 ┃ ┣ 📄detect.py                       // Core functionality
 ┃ ┣ 📄utils.py 
 ┃ ┣ 📄analytics.py
 ┃ ┣ 📄postprocess.py
 ┃ ┣ 📄main.py
 ┃ ┣ 📄requirements.txt
 ┃ ┣ 📄README.md            
 ┣ 📂server                            // Backend 
 ┃ ┣ 📂archive  
 ┃ ┣ 📄app.py  
 ┃ ┣ 📄requirements.txt 
 ┣ 📄api_contract.yaml
 ┣ 📄data-watchdog-ppt.pptx  
 ┣ 📄Configure.md   
 ┣ 📄README.md

Getting Started

Installation

Clone the project by typing the following command in your Terminal/CommandPrompt

git clone [email protected]:PritK99/Data-Watchdog.git

Navigate to the Data Watchdog folder

cd Data-Watchdog

Usage

Create a virtual environment to install all the dependencies

python -m venv data-watchdog

Activate the virtual environment

For Windows: data-watchdog\Scripts\activate

For Linux: source data-watchdog/bin/activate

Install all the required dependencies

Open a new terminal in root folder and navigate to the server folder

cd server/

Install all the required dependencies

pip install -r requirements.txt

Note

Path to poppler and pytesseract are required in utils.py to perform pdf to image conversion and OCR respectively.

For poppler refer here.

For pytesseract refer here.

Please replace the paths in utils.py with your paths.

In addition to this, we require ffmpeg to deal with multimedia files such as audio and video. For the installation, please refer this YouTube video

Once all the above steps are completed, run the createdatabase.py script using the following command in the terminal in server directory:

python app.py

Running the server will load the frontend at http://localhost:5000/

Note
The configuration page will require the configurations of SQL or Cloud that we want to analyze. For demo purpose we are using PostgreSQL using Render and LocalStack which allows us to simulate AWS cloud environment locally. For configuring these, steps are provided in Configure markdown file.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
assets		assets
client		client
model		model
server		server
.gitattributes		.gitattributes
.gitignore		.gitignore
Configure.md		Configure.md
Data Watchdog PPT.pptx		Data Watchdog PPT.pptx
High Level Design.png		High Level Design.png
LICENSE		LICENSE
README.md		README.md
Sequence Diagram.png		Sequence Diagram.png
api_contract.yaml		api_contract.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Watchdog

Table of Contents

About The Project

Demo

Methodology

High Level Design

Sequence Diagram

API Contract

Tech Stack

File Structure

Getting Started

Installation

Usage

Screenshots of the Website

Home Page

Configuration Page

Analytics and Dashboard

Drilldown Page

Output File

Contributors

License

About

Releases

Packages

Contributors 2

Languages

License

PritK99/Data-Watchdog

Folders and files

Latest commit

History

Repository files navigation

Data Watchdog

Table of Contents

About The Project

Demo

Methodology

High Level Design

Sequence Diagram

API Contract

Tech Stack

File Structure

Getting Started

Installation

Usage

Screenshots of the Website

Home Page

Configuration Page

Analytics and Dashboard

Drilldown Page

Output File

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages