Data Watchdog is a tool built to find and classify Personally Identifiable Information (PII) like names, emails, Aadhaar numbers, and PAN numbers in different types of data storage. It works with databases like MySQL and cloud services like Google Cloud and Amazon S3. The tool supports various file types, including unstructured files (such as .txt
, .log
, .jpg
, .png
, .jpeg
, .pdf
, .mp3
, .mp4
) and structured files (such as .csv
).
Objective: Storing personal data comes with risks, and businesses need to follow rules to protect it. Data Watchdog helps companies find and classify personal data in their systems, making sure they follow data privacy laws like GDPR and CCPA, and reduce the risk of data breaches.
Features:
-
Data Ingestion and Continuous Integration: Efficiently handles data from multiple sources, including Amazon AWS Cloud and SQL databases, as well as various file formats such as text files, log files, images, CSVs, and PDFs. Once configured, it continuously integrates and updates data fetched from the cloud.
-
PII Detection: Utilizes machine learning and advanced techniques to identify personal identifiable information (PII). Provides a comprehensive list of detected PII across various file types.
-
Drilldown: Calculates a risk score based on the type of detected PII. Classifies data into categories and buckets, allowing users to view information at different levels of granularity.
-
Data Visualization: Offers analytics related to detected PII, including metrics such as mean risk per file, mean risk per file type, total PII counts per file, and identification of the riskiest PII elements.
data.watchdog.mp4
We primarily deal with 6 types of files, which are Text Files (.txt
, .log
, .docx
), Image Files (.png
, .jpg
, .jpeg
), PDF Files (.pdf
), CSV Files (.csv
), Audio Files (.mp3
) and Video Files(.mp4
). Details about each PII extraction process can be found here
API Contract for the project can be found here
👨💻Data Watchdog
┣ 📂assets
┃ ┣ 📂demo
┃ ┣ 📂img
┃ ┣ 📂temp // Sample files for testing
┃ ┣ 📂results // Output
┃ ┣ 📄problem-statement.pdf
┣ 📂client // Frontend
┃ ┣ 📂static
┃ ┣ 📂templates
┃ ┃ ┣ 📄home.html
┃ ┃ ┣ 📄configure.html
┃ ┃ ┣ 📄dashboard.html
┃ ┃ ┣ 📄view.html
┣ 📂model
┃ ┣ 📄detect.py // Core functionality
┃ ┣ 📄utils.py
┃ ┣ 📄analytics.py
┃ ┣ 📄postprocess.py
┃ ┣ 📄main.py
┃ ┣ 📄requirements.txt
┃ ┣ 📄README.md
┣ 📂server // Backend
┃ ┣ 📂archive
┃ ┣ 📄app.py
┃ ┣ 📄requirements.txt
┣ 📄api_contract.yaml
┣ 📄data-watchdog-ppt.pptx
┣ 📄Configure.md
┣ 📄README.md
Clone the project by typing the following command in your Terminal/CommandPrompt
git clone [email protected]:PritK99/Data-Watchdog.git
Navigate to the Data Watchdog folder
cd Data-Watchdog
Create a virtual environment to install all the dependencies
python -m venv data-watchdog
Activate the virtual environment
For Windows: data-watchdog\Scripts\activate
For Linux: source data-watchdog/bin/activate
Install all the required dependencies
Open a new terminal in root folder and navigate to the server folder
cd server/
Install all the required dependencies
pip install -r requirements.txt
Note
Path to
poppler
andpytesseract
are required inutils.py
to perform pdf to image conversion and OCR respectively.For
poppler
refer here.For
pytesseract
refer here.Please replace the paths in
utils.py
with your paths.In addition to this, we require
ffmpeg
to deal with multimedia files such as audio and video. For the installation, please refer this YouTube video
Once all the above steps are completed, run the createdatabase.py script using the following command in the terminal in server
directory:
python app.py
Running the server will load the frontend at http://localhost:5000/
Note
The configuration page will require the configurations of SQL or Cloud that we want to analyze. For demo purpose we are usingPostgreSQL
using Render andLocalStack
which allows us to simulate AWS cloud environment locally. For configuring these, steps are provided in Configure markdown file.