AMP-97070 Move Spark metadata upon transfer completion #8

LeontiBrechko · 2024-04-03T21:15:41Z

Description

Move spark transfer metadata to a subdirectory of the target export S3 location

Testing

Unit
Local end-to-end through DatabricksToS3WorkerServiceIntegrationTest.testUnloadData
- Example job: https://dbc-da2a49b8-70a8.cloud.databricks.com/jobs/1097682711284253/runs/297030288143881?o=7679126809635876
- Example s3 repository state:

fzqgriffin · 2024-04-04T09:22:16Z

unload_databricks_data_to_s3.py

+    print(f'Identified bucket: {bucket}, prefix: {prefix}')
+
+    # List all files in the s3_path directory
+    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter='/')


Will this return folders as well? If only files are returned, then we are good.

only files. Same for listing objects in our Java repo using Amplitude's S3 wrapper

The example job mentioned in the description has the logs of what was discovered (note that /meta folder exists at the point of execution of this method and is not listed)

fzqgriffin · 2024-04-04T09:24:21Z

unload_databricks_data_to_s3.py

+    if '://' not in s3_uri_with_spark_metadata:
+        raise ValueError(f'Invalid s3 URI: {s3_uri_with_spark_metadata}. Expected to contain "://".')
+    bucket, prefix = s3_uri_with_spark_metadata.split('://')[1].split('/', 1)
+    bucket = replace_double_slashes_with_single_slash(bucket)


curious where double slashes are coming from?

for bucket, it shouldn't

It's just a sanitization in case some unnormalized input for s3uri is provided (e.g. s3:////bucket////prefix////)

fzqgriffin · 2024-04-04T09:24:51Z

test/unload_databricks_data_to_s3_tests.py

+        expected_output = '/path/to/file/with/double/slashes/end/'
+        self.assertEqual(expected_output, replace_double_slashes_with_single_slash(input_string))
+
+    def test_move_spark_metadata_to_separate_s3_folder(self):


thanks for adding tests!

fzqgriffin

LGTM. Let's hold on merge until AMP-96980 is approved so the other PR can still focus on event mutation stuff.

AMP-97070 Move Spark metadata upon transfer completion

0395976

LeontiBrechko requested review from fzqgriffin and bh202205 April 3, 2024 21:15

fzqgriffin reviewed Apr 4, 2024

View reviewed changes

fzqgriffin approved these changes Apr 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMP-97070 Move Spark metadata upon transfer completion #8

AMP-97070 Move Spark metadata upon transfer completion #8

LeontiBrechko commented Apr 3, 2024

fzqgriffin Apr 4, 2024

LeontiBrechko Apr 4, 2024 •

edited

Loading

fzqgriffin Apr 4, 2024

LeontiBrechko Apr 4, 2024

fzqgriffin Apr 4, 2024

fzqgriffin left a comment

AMP-97070 Move Spark metadata upon transfer completion #8

Are you sure you want to change the base?

AMP-97070 Move Spark metadata upon transfer completion #8

Conversation

LeontiBrechko commented Apr 3, 2024

Description

Testing

fzqgriffin Apr 4, 2024

Choose a reason for hiding this comment

LeontiBrechko Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

fzqgriffin Apr 4, 2024

Choose a reason for hiding this comment

LeontiBrechko Apr 4, 2024

Choose a reason for hiding this comment

fzqgriffin Apr 4, 2024

Choose a reason for hiding this comment

fzqgriffin left a comment

Choose a reason for hiding this comment

LeontiBrechko Apr 4, 2024 •

edited

Loading