Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2 DWS final version #30

Closed
wants to merge 7 commits into from

Conversation

offthepier
Copy link

This pull request introduces key enhancements and unit tests for the data pipeline script. The following changes have been made:

Detailed Inline Comments:

Added more detailed comments throughout the code, especially for complex operations such as chunking, encryption, and SQL command conversion.
Unit Tests for Critical Functions:

Created unit tests in the tests/test_pipeline.py file to verify the functionality of critical components:
encrypt_data: Tests to ensure sensitive data is properly encrypted.
is_valid_url: Tests URL validation logic.
sanitize_table_name: Tests that file names are properly sanitized for SQL compatibility.
Robust Encryption Key Management:

Implemented encryption for sensitive columns using the cryptography library, with a more secure key management system to ensure flexibility in production environments.
Improved User Interaction:

Added prompts that allow users to specify if they want to anonymize the data for each file and select sensitive columns dynamically.
Performance and Reliability Improvements:

Introduced chunked processing of large datasets to ensure memory efficiency.
Implemented retry mechanisms for file downloads to handle network interruptions gracefully.
Bug Fixes:

Fixed minor issues related to SQL command formatting and table name sanitization.
Testing:
Unit tests have been successfully added and run using the unittest framework. All tests passed without issues.

Copy link
Member

@ben-AI-cybersec ben-AI-cybersec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes needed, but overall good work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove separate license file, this should be covered by the repo licence.

Comment on lines +62 to +65
DREMIO_URL: The URL of your Dremio instance (e.g., http://your-dremio-url:9047)
DREMIO_USERNAME: Your Dremio username.
DREMIO_PASSWORD: Your Dremio password.
DREMIO_SOURCE: The source location in Dremio (e.g., project name).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-up of configuration via .env file

Comment on lines +76 to +79
CSV file:
https://github.com/username/repo/blob/main/data/sample.csv
XLSX file:
https://github.com/username/repo/blob/main/data/sample.xlsx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These urls are not valid

return value

# Function to validate the given URLs by ensuring the scheme is http/https and netloc is present.
def is_valid_url(url):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add stripping to reduce change of stored XSS attacks

Comment on lines +72 to +73
encryption_key = Fernet.generate_key()
cipher_suite = Fernet(encryption_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please upgrade to 256bit encryption per redback policy

sys.exit(1)

# Function to sanitize table names, removing any special characters to ensure compatibility with SQL.
def sanitize_table_name(file_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a re.escape() to reduce risk of stored XSS attacks

return converted_commands

# Function to send an SQL command to the Dremio API.
def send_sql_command(command):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be parameterised?

@ben-AI-cybersec
Copy link
Member

Closing PR, please reopen if/when you make the required changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants