Made AWS uploader script idempotent when backfilling dates #372

jkarpen · 2024-08-28T21:56:29Z

From Ian:

Currently the AWS uploader script only uploads files to S3. Most of the files that are uploaded have a unique hash in the file name. This means that if we need to backfill a date that, e.g., has three out of four of the expected files, we get a large number of duplicate files covering the same time frame. If we backfill the same date multiple times, we get an accumulation of largely duplicate files in the S3 bucket. This makes it difficult to know how to load data from the bucket to Snowflake, and more expensive to deduplicate within the data warehouse.

As an example, here are the files for the bucket for 2024-07-10 in District 10. It was filled once on 2024-07-11, and then got new uploads on 2024-07-18 and 2024-07-22. Any attempt to backfill from this bucket into Snowflake will have to do extra work to deduplicate these files when loading, and it makes it more difficult to point to the bucket as an authoritative data lake source.

Instead, when backfilling data we should clear out the files from the S3 bucket for the date and districts we are backfilling. This would ensure that we could run the backfill as many times as we want, and the S3 bucket would still be in the same state, with just one copy of the data per date.

This change would not need to be a heavy lift, and could be done with the AWS CLI:

aws s3 rm --recursive s3://caltrans-pems-prd-us-west-2-raw/db96_export_staging_area/tables/VDS30SEC/district=D10/year=2024/month=7/day=10/

aws s3 cp the_new_file.parquet s3://caltrans-pems-prd-us-west-2-raw/db96_export_staging_area/tables/VDS30SEC/district=D10/year=2024/month=7/day=10/

jkarpen assigned pingpingxiu-DOT-ca-gov Aug 28, 2024

jkarpen added this to the VDS Data Modeling: Stabilize Data Relay Server milestone Aug 28, 2024

jkarpen added the pipeline features label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Made AWS uploader script idempotent when backfilling dates #372

Made AWS uploader script idempotent when backfilling dates #372

jkarpen commented Aug 28, 2024 •

edited

Loading

Made AWS uploader script idempotent when backfilling dates #372

Made AWS uploader script idempotent when backfilling dates #372

Comments

jkarpen commented Aug 28, 2024 • edited Loading

jkarpen commented Aug 28, 2024 •

edited

Loading