Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPR2-1642: Include part-*.snappy.parquet in batch job during ingestio… #9199

Merged
merged 2 commits into from
Jan 3, 2025

Conversation

koladeadewuyi-moj
Copy link
Contributor

@koladeadewuyi-moj koladeadewuyi-moj commented Dec 30, 2024

The batch job of the replay pipeline does not process diff files created by the reload pipeline. This is because the batch job looks for files matching the name pattern LOAD*.parquet whereas the reload diff files have the name pattern part-*.snappy.parquet.

Steps to reproduce:

  • Run the ingestion pipeline if there is no already ingested data
  • Run the reload pipeline. This generates the reload diff files
  • Run the replay pipeline. The batch job would fail to read the reload diff files

To fix this, an extra argument will be supplied to the batch job invocation during the replay pipeline:
--dpr.batch.load.fileglobpattern : {part-*.snappy.parquet,LOAD*parquet}

@koladeadewuyi-moj koladeadewuyi-moj requested review from a team as code owners December 30, 2024 17:53
@github-actions github-actions bot added the environments-repository Used to exclude PRs from this repo in our Slack PR update label Dec 30, 2024
@koladeadewuyi-moj koladeadewuyi-moj temporarily deployed to digital-prison-reporting-test December 30, 2024 17:55 — with GitHub Actions Inactive
@koladeadewuyi-moj koladeadewuyi-moj temporarily deployed to digital-prison-reporting-development December 30, 2024 17:55 — with GitHub Actions Inactive
Copy link
Contributor

Trivy Scan Success

Show Output ```hcl

Trivy will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline


Running Trivy in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
2024-12-30T17:55:37Z INFO [vulndb] Need to update DB
2024-12-30T17:55:37Z INFO [vulndb] Downloading vulnerability DB...
2024-12-30T17:55:37Z INFO [vulndb] Downloading artifact... repo="public.ecr.aws/aquasecurity/trivy-db:2"
2024-12-30T17:55:40Z INFO [vulndb] Artifact successfully downloaded repo="public.ecr.aws/aquasecurity/trivy-db:2"
2024-12-30T17:55:40Z INFO [vuln] Vulnerability scanning is enabled
2024-12-30T17:55:40Z INFO [misconfig] Misconfiguration scanning is enabled
2024-12-30T17:55:40Z INFO [misconfig] Need to update the built-in checks
2024-12-30T17:55:40Z INFO [misconfig] Downloading the built-in checks...
160.80 KiB / 160.80 KiB [------------------------------------------------------] 100.00% ? p/s 100ms2024-12-30T17:55:40Z INFO [secret] Secret scanning is enabled
2024-12-30T17:55:40Z INFO [secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning
2024-12-30T17:55:40Z INFO [secret] Please see also https://aquasecurity.github.io/trivy/v0.57/docs/scanner/secret#recommendation for faster secret detection
2024-12-30T17:55:41Z INFO [terraform scanner] Scanning root module file_path="."
2024-12-30T17:55:41Z WARN [terraform parser] Variable values was not found in the environment or variable files. Evaluating may not work correctly. module="root" variables="dms_replication_task_arn, glue_maintenance_compaction_job, glue_maintenance_retention_job, glue_s3_max_attempts, glue_s3_retry_max_wait_millis, glue_s3_retry_min_wait_millis, processed_files_check_max_attempts, processed_files_check_wait_interval_seconds, replication_task_id, s3_curated_path, s3_structured_path, step_function_execution_role_arn"
2024-12-30T17:55:41Z INFO Number of language-specific files num=0
2024-12-30T17:55:41Z INFO Detected config files num=1
trivy_exitcode=0

</details> #### `Checkov Scan` Success
<details><summary>Show Output</summary>

```hcl

*****************************

Checkov will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline

*****************************

Running Checkov in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
Excluding the following checks: CKV_GIT_1,CKV_AWS_126,CKV2_AWS_38,CKV2_AWS_39

checkov_exitcode=0

CTFLint Scan Success

Show Output
*****************************

Setting default tflint config...
Running tflint --init...
Installing "terraform" plugin...
Installed "terraform" (source: github.com/terraform-linters/tflint-ruleset-terraform, version: 0.9.1)
tflint will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline

*****************************

Running tflint in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
Excluding the following checks: terraform_unused_declarations
tflint_exitcode=0

Trivy Scan Success

Show Output
*****************************

Trivy will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline

*****************************

Running Trivy in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
2024-12-30T17:55:37Z	INFO	[vulndb] Need to update DB
2024-12-30T17:55:37Z	INFO	[vulndb] Downloading vulnerability DB...
2024-12-30T17:55:37Z	INFO	[vulndb] Downloading artifact...	repo="public.ecr.aws/aquasecurity/trivy-db:2"
2024-12-30T17:55:40Z	INFO	[vulndb] Artifact successfully downloaded	repo="public.ecr.aws/aquasecurity/trivy-db:2"
2024-12-30T17:55:40Z	INFO	[vuln] Vulnerability scanning is enabled
2024-12-30T17:55:40Z	INFO	[misconfig] Misconfiguration scanning is enabled
2024-12-30T17:55:40Z	INFO	[misconfig] Need to update the built-in checks
2024-12-30T17:55:40Z	INFO	[misconfig] Downloading the built-in checks...
160.80 KiB / 160.80 KiB [------------------------------------------------------] 100.00% ? p/s 100ms2024-12-30T17:55:40Z	INFO	[secret] Secret scanning is enabled
2024-12-30T17:55:40Z	INFO	[secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning
2024-12-30T17:55:40Z	INFO	[secret] Please see also https://aquasecurity.github.io/trivy/v0.57/docs/scanner/secret#recommendation for faster secret detection
2024-12-30T17:55:41Z	INFO	[terraform scanner] Scanning root module	file_path="."
2024-12-30T17:55:41Z	WARN	[terraform parser] Variable values was not found in the environment or variable files. Evaluating may not work correctly.	module="root" variables="dms_replication_task_arn, glue_maintenance_compaction_job, glue_maintenance_retention_job, glue_s3_max_attempts, glue_s3_retry_max_wait_millis, glue_s3_retry_min_wait_millis, processed_files_check_max_attempts, processed_files_check_wait_interval_seconds, replication_task_id, s3_curated_path, s3_structured_path, step_function_execution_role_arn"
2024-12-30T17:55:41Z	INFO	Number of language-specific files	num=0
2024-12-30T17:55:41Z	INFO	Detected config files	num=1
trivy_exitcode=0

Copy link
Contributor

Trivy Scan Success

Show Output ```hcl

Trivy will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline


Running Trivy in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
2024-12-30T19:42:30Z INFO [vulndb] Need to update DB
2024-12-30T19:42:30Z INFO [vulndb] Downloading vulnerability DB...
2024-12-30T19:42:30Z INFO [vulndb] Downloading artifact... repo="public.ecr.aws/aquasecurity/trivy-db:2"
2024-12-30T19:42:33Z INFO [vulndb] Artifact successfully downloaded repo="public.ecr.aws/aquasecurity/trivy-db:2"
2024-12-30T19:42:33Z INFO [vuln] Vulnerability scanning is enabled
2024-12-30T19:42:33Z INFO [misconfig] Misconfiguration scanning is enabled
2024-12-30T19:42:33Z INFO [misconfig] Need to update the built-in checks
2024-12-30T19:42:33Z INFO [misconfig] Downloading the built-in checks...
160.80 KiB / 160.80 KiB [------------------------------------------------------] 100.00% ? p/s 100ms2024-12-30T19:42:34Z INFO [secret] Secret scanning is enabled
2024-12-30T19:42:34Z INFO [secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning
2024-12-30T19:42:34Z INFO [secret] Please see also https://aquasecurity.github.io/trivy/v0.57/docs/scanner/secret#recommendation for faster secret detection
2024-12-30T19:42:35Z INFO [terraform scanner] Scanning root module file_path="."
2024-12-30T19:42:35Z WARN [terraform parser] Variable values was not found in the environment or variable files. Evaluating may not work correctly. module="root" variables="dms_replication_task_arn, glue_maintenance_compaction_job, glue_maintenance_retention_job, glue_s3_max_attempts, glue_s3_retry_max_wait_millis, glue_s3_retry_min_wait_millis, processed_files_check_max_attempts, processed_files_check_wait_interval_seconds, replication_task_id, s3_curated_path, s3_structured_path, step_function_execution_role_arn"
2024-12-30T19:42:35Z INFO Number of language-specific files num=0
2024-12-30T19:42:35Z INFO Detected config files num=1
trivy_exitcode=0

</details> #### `Checkov Scan` Success
<details><summary>Show Output</summary>

```hcl

*****************************

Checkov will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline

*****************************

Running Checkov in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
Excluding the following checks: CKV_GIT_1,CKV_AWS_126,CKV2_AWS_38,CKV2_AWS_39

checkov_exitcode=0

CTFLint Scan Success

Show Output
*****************************

Setting default tflint config...
Running tflint --init...
Installing "terraform" plugin...
Installed "terraform" (source: github.com/terraform-linters/tflint-ruleset-terraform, version: 0.9.1)
tflint will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline

*****************************

Running tflint in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
Excluding the following checks: terraform_unused_declarations
tflint_exitcode=0

Trivy Scan Success

Show Output
*****************************

Trivy will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline

*****************************

Running Trivy in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
2024-12-30T19:42:30Z	INFO	[vulndb] Need to update DB
2024-12-30T19:42:30Z	INFO	[vulndb] Downloading vulnerability DB...
2024-12-30T19:42:30Z	INFO	[vulndb] Downloading artifact...	repo="public.ecr.aws/aquasecurity/trivy-db:2"
2024-12-30T19:42:33Z	INFO	[vulndb] Artifact successfully downloaded	repo="public.ecr.aws/aquasecurity/trivy-db:2"
2024-12-30T19:42:33Z	INFO	[vuln] Vulnerability scanning is enabled
2024-12-30T19:42:33Z	INFO	[misconfig] Misconfiguration scanning is enabled
2024-12-30T19:42:33Z	INFO	[misconfig] Need to update the built-in checks
2024-12-30T19:42:33Z	INFO	[misconfig] Downloading the built-in checks...
160.80 KiB / 160.80 KiB [------------------------------------------------------] 100.00% ? p/s 100ms2024-12-30T19:42:34Z	INFO	[secret] Secret scanning is enabled
2024-12-30T19:42:34Z	INFO	[secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning
2024-12-30T19:42:34Z	INFO	[secret] Please see also https://aquasecurity.github.io/trivy/v0.57/docs/scanner/secret#recommendation for faster secret detection
2024-12-30T19:42:35Z	INFO	[terraform scanner] Scanning root module	file_path="."
2024-12-30T19:42:35Z	WARN	[terraform parser] Variable values was not found in the environment or variable files. Evaluating may not work correctly.	module="root" variables="dms_replication_task_arn, glue_maintenance_compaction_job, glue_maintenance_retention_job, glue_s3_max_attempts, glue_s3_retry_max_wait_millis, glue_s3_retry_min_wait_millis, processed_files_check_max_attempts, processed_files_check_wait_interval_seconds, replication_task_id, s3_curated_path, s3_structured_path, step_function_execution_role_arn"
2024-12-30T19:42:35Z	INFO	Number of language-specific files	num=0
2024-12-30T19:42:35Z	INFO	Detected config files	num=1
trivy_exitcode=0

@koladeadewuyi-moj koladeadewuyi-moj temporarily deployed to digital-prison-reporting-test December 30, 2024 19:46 — with GitHub Actions Inactive
@koladeadewuyi-moj koladeadewuyi-moj temporarily deployed to digital-prison-reporting-development December 30, 2024 19:46 — with GitHub Actions Inactive
Copy link
Contributor

github-actions bot commented Jan 3, 2025

Trivy Scan Success

Show Output ```hcl

Trivy will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline


Running Trivy in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
2025-01-03T14:31:55Z INFO [vulndb] Need to update DB
2025-01-03T14:31:55Z INFO [vulndb] Downloading vulnerability DB...
2025-01-03T14:31:55Z INFO [vulndb] Downloading artifact... repo="public.ecr.aws/aquasecurity/trivy-db:2"
2025-01-03T14:31:57Z INFO [vulndb] Artifact successfully downloaded repo="public.ecr.aws/aquasecurity/trivy-db:2"
2025-01-03T14:31:57Z INFO [vuln] Vulnerability scanning is enabled
2025-01-03T14:31:57Z INFO [misconfig] Misconfiguration scanning is enabled
2025-01-03T14:31:57Z INFO [misconfig] Need to update the built-in checks
2025-01-03T14:31:57Z INFO [misconfig] Downloading the built-in checks...
160.80 KiB / 160.80 KiB [------------------------------------------------------] 100.00% ? p/s 100ms2025-01-03T14:31:57Z INFO [secret] Secret scanning is enabled
2025-01-03T14:31:57Z INFO [secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning
2025-01-03T14:31:57Z INFO [secret] Please see also https://aquasecurity.github.io/trivy/v0.57/docs/scanner/secret#recommendation for faster secret detection
2025-01-03T14:31:58Z INFO [terraform scanner] Scanning root module file_path="."
2025-01-03T14:31:58Z WARN [terraform parser] Variable values was not found in the environment or variable files. Evaluating may not work correctly. module="root" variables="dms_replication_task_arn, glue_maintenance_compaction_job, glue_maintenance_retention_job, glue_s3_max_attempts, glue_s3_retry_max_wait_millis, glue_s3_retry_min_wait_millis, processed_files_check_max_attempts, processed_files_check_wait_interval_seconds, replication_task_id, s3_curated_path, s3_structured_path, step_function_execution_role_arn"
2025-01-03T14:31:58Z INFO Number of language-specific files num=0
2025-01-03T14:31:58Z INFO Detected config files num=1
trivy_exitcode=0

</details> #### `Checkov Scan` Success
<details><summary>Show Output</summary>

```hcl

*****************************

Checkov will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline

*****************************

Running Checkov in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
Excluding the following checks: CKV_GIT_1,CKV_AWS_126,CKV2_AWS_38,CKV2_AWS_39

checkov_exitcode=0

CTFLint Scan Success

Show Output
*****************************

Setting default tflint config...
Running tflint --init...
Installing "terraform" plugin...
Installed "terraform" (source: github.com/terraform-linters/tflint-ruleset-terraform, version: 0.9.1)
tflint will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline

*****************************

Running tflint in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
Excluding the following checks: terraform_unused_declarations
tflint_exitcode=0

Trivy Scan Success

Show Output
*****************************

Trivy will check the following folders:
terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline

*****************************

Running Trivy in terraform/environments/digital-prison-reporting/modules/domains/replay-pipeline
2025-01-03T14:31:55Z	INFO	[vulndb] Need to update DB
2025-01-03T14:31:55Z	INFO	[vulndb] Downloading vulnerability DB...
2025-01-03T14:31:55Z	INFO	[vulndb] Downloading artifact...	repo="public.ecr.aws/aquasecurity/trivy-db:2"
2025-01-03T14:31:57Z	INFO	[vulndb] Artifact successfully downloaded	repo="public.ecr.aws/aquasecurity/trivy-db:2"
2025-01-03T14:31:57Z	INFO	[vuln] Vulnerability scanning is enabled
2025-01-03T14:31:57Z	INFO	[misconfig] Misconfiguration scanning is enabled
2025-01-03T14:31:57Z	INFO	[misconfig] Need to update the built-in checks
2025-01-03T14:31:57Z	INFO	[misconfig] Downloading the built-in checks...
160.80 KiB / 160.80 KiB [------------------------------------------------------] 100.00% ? p/s 100ms2025-01-03T14:31:57Z	INFO	[secret] Secret scanning is enabled
2025-01-03T14:31:57Z	INFO	[secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning
2025-01-03T14:31:57Z	INFO	[secret] Please see also https://aquasecurity.github.io/trivy/v0.57/docs/scanner/secret#recommendation for faster secret detection
2025-01-03T14:31:58Z	INFO	[terraform scanner] Scanning root module	file_path="."
2025-01-03T14:31:58Z	WARN	[terraform parser] Variable values was not found in the environment or variable files. Evaluating may not work correctly.	module="root" variables="dms_replication_task_arn, glue_maintenance_compaction_job, glue_maintenance_retention_job, glue_s3_max_attempts, glue_s3_retry_max_wait_millis, glue_s3_retry_min_wait_millis, processed_files_check_max_attempts, processed_files_check_wait_interval_seconds, replication_task_id, s3_curated_path, s3_structured_path, step_function_execution_role_arn"
2025-01-03T14:31:58Z	INFO	Number of language-specific files	num=0
2025-01-03T14:31:58Z	INFO	Detected config files	num=1
trivy_exitcode=0

@koladeadewuyi-moj koladeadewuyi-moj merged commit cd28b8e into main Jan 3, 2025
10 of 13 checks passed
@koladeadewuyi-moj koladeadewuyi-moj deleted the DPR2-1642 branch January 3, 2025 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
environments-repository Used to exclude PRs from this repo in our Slack PR update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants