You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the AWS uploader script only uploads files to S3. Most of the files that are uploaded have a unique hash in the file name. This means that if we need to backfill a date that, e.g., has three out of four of the expected files, we get a large number of duplicate files covering the same time frame. If we backfill the same date multiple times, we get an accumulation of largely duplicate files in the S3 bucket. This makes it difficult to know how to load data from the bucket to Snowflake, and more expensive to deduplicate within the data warehouse.
As an example, here are the files for the bucket for 2024-07-10 in District 10. It was filled once on 2024-07-11, and then got new uploads on 2024-07-18 and 2024-07-22. Any attempt to backfill from this bucket into Snowflake will have to do extra work to deduplicate these files when loading, and it makes it more difficult to point to the bucket as an authoritative data lake source.
Instead, when backfilling data we should clear out the files from the S3 bucket for the date and districts we are backfilling. This would ensure that we could run the backfill as many times as we want, and the S3 bucket would still be in the same state, with just one copy of the data per date.
This change would not need to be a heavy lift, and could be done with the AWS CLI:
From Ian:
Currently the AWS uploader script only uploads files to S3. Most of the files that are uploaded have a unique hash in the file name. This means that if we need to backfill a date that, e.g., has three out of four of the expected files, we get a large number of duplicate files covering the same time frame. If we backfill the same date multiple times, we get an accumulation of largely duplicate files in the S3 bucket. This makes it difficult to know how to load data from the bucket to Snowflake, and more expensive to deduplicate within the data warehouse.
As an example, here are the files for the bucket for 2024-07-10 in District 10. It was filled once on 2024-07-11, and then got new uploads on 2024-07-18 and 2024-07-22. Any attempt to backfill from this bucket into Snowflake will have to do extra work to deduplicate these files when loading, and it makes it more difficult to point to the bucket as an authoritative data lake source.
Instead, when backfilling data we should clear out the files from the S3 bucket for the date and districts we are backfilling. This would ensure that we could run the backfill as many times as we want, and the S3 bucket would still be in the same state, with just one copy of the data per date.
This change would not need to be a heavy lift, and could be done with the AWS CLI:
aws s3 rm --recursive s3://caltrans-pems-prd-us-west-2-raw/db96_export_staging_area/tables/VDS30SEC/district=D10/year=2024/month=7/day=10/
aws s3 cp the_new_file.parquet s3://caltrans-pems-prd-us-west-2-raw/db96_export_staging_area/tables/VDS30SEC/district=D10/year=2024/month=7/day=10/
The text was updated successfully, but these errors were encountered: