Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made AWS uploader script idempotent when backfilling dates #372

Open
jkarpen opened this issue Aug 28, 2024 · 0 comments
Open

Made AWS uploader script idempotent when backfilling dates #372

jkarpen opened this issue Aug 28, 2024 · 0 comments

Comments

@jkarpen
Copy link

jkarpen commented Aug 28, 2024

From Ian:

Currently the AWS uploader script only uploads files to S3. Most of the files that are uploaded have a unique hash in the file name. This means that if we need to backfill a date that, e.g., has three out of four of the expected files, we get a large number of duplicate files covering the same time frame. If we backfill the same date multiple times, we get an accumulation of largely duplicate files in the S3 bucket. This makes it difficult to know how to load data from the bucket to Snowflake, and more expensive to deduplicate within the data warehouse.

As an example, here are the files for the bucket for 2024-07-10 in District 10. It was filled once on 2024-07-11, and then got new uploads on 2024-07-18 and 2024-07-22. Any attempt to backfill from this bucket into Snowflake will have to do extra work to deduplicate these files when loading, and it makes it more difficult to point to the bucket as an authoritative data lake source.

Image

Instead, when backfilling data we should clear out the files from the S3 bucket for the date and districts we are backfilling. This would ensure that we could run the backfill as many times as we want, and the S3 bucket would still be in the same state, with just one copy of the data per date.

This change would not need to be a heavy lift, and could be done with the AWS CLI:

aws s3 rm --recursive s3://caltrans-pems-prd-us-west-2-raw/db96_export_staging_area/tables/VDS30SEC/district=D10/year=2024/month=7/day=10/

aws s3 cp the_new_file.parquet s3://caltrans-pems-prd-us-west-2-raw/db96_export_staging_area/tables/VDS30SEC/district=D10/year=2024/month=7/day=10/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants