-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project 2 DWS final version #30
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes needed, but overall good work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove separate license file, this should be covered by the repo licence.
DREMIO_URL: The URL of your Dremio instance (e.g., http://your-dremio-url:9047) | ||
DREMIO_USERNAME: Your Dremio username. | ||
DREMIO_PASSWORD: Your Dremio password. | ||
DREMIO_SOURCE: The source location in Dremio (e.g., project name). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-up of configuration via .env file
CSV file: | ||
https://github.com/username/repo/blob/main/data/sample.csv | ||
XLSX file: | ||
https://github.com/username/repo/blob/main/data/sample.xlsx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These urls are not valid
return value | ||
|
||
# Function to validate the given URLs by ensuring the scheme is http/https and netloc is present. | ||
def is_valid_url(url): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add stripping to reduce change of stored XSS attacks
encryption_key = Fernet.generate_key() | ||
cipher_suite = Fernet(encryption_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please upgrade to 256bit encryption per redback policy
sys.exit(1) | ||
|
||
# Function to sanitize table names, removing any special characters to ensure compatibility with SQL. | ||
def sanitize_table_name(file_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a re.escape() to reduce risk of stored XSS attacks
return converted_commands | ||
|
||
# Function to send an SQL command to the Dremio API. | ||
def send_sql_command(command): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be parameterised?
Closing PR, please reopen if/when you make the required changes. |
This pull request introduces key enhancements and unit tests for the data pipeline script. The following changes have been made:
Detailed Inline Comments:
Added more detailed comments throughout the code, especially for complex operations such as chunking, encryption, and SQL command conversion.
Unit Tests for Critical Functions:
Created unit tests in the tests/test_pipeline.py file to verify the functionality of critical components:
encrypt_data: Tests to ensure sensitive data is properly encrypted.
is_valid_url: Tests URL validation logic.
sanitize_table_name: Tests that file names are properly sanitized for SQL compatibility.
Robust Encryption Key Management:
Implemented encryption for sensitive columns using the cryptography library, with a more secure key management system to ensure flexibility in production environments.
Improved User Interaction:
Added prompts that allow users to specify if they want to anonymize the data for each file and select sensitive columns dynamically.
Performance and Reliability Improvements:
Introduced chunked processing of large datasets to ensure memory efficiency.
Implemented retry mechanisms for file downloads to handle network interruptions gracefully.
Bug Fixes:
Fixed minor issues related to SQL command formatting and table name sanitization.
Testing:
Unit tests have been successfully added and run using the unittest framework. All tests passed without issues.