Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL-encode partition field names in file locations #1457

Merged
merged 13 commits into from
Jan 2, 2025

Conversation

smaheshwar-pltr
Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr commented Dec 20, 2024

Closes #1458
Closes #175

Comment on lines 726 to 749
# Test that special characters are URL-encoded
(
[PartitionField(source_id=15, field_id=1001, transform=IdentityTransform(), name="special#string#field")],
["special string"],
Record(**{"special#string#field": "special string"}), # type: ignore
"special%23string%23field=special%20string",
f"""CREATE TABLE {identifier} (
`special#string#field` string
)
USING iceberg
PARTITIONED BY (
identity(`special#string#field`)
)
""",
f"""INSERT INTO {identifier}
VALUES
('special string')
""",
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I thought this test would work but it fails, and I'm not sure why yet.

It fails

assert expected_hive_partition_path_slice in spark_path_for_justification

with

E AssertionError: assert 'special%23string%23field=special%20string' in 's3://warehouse/default/test_table/data/special#string#field=special+string/00000-57-0756f620-2b2e-4ffa-97f6-625343525c9b-00001.parquet'

and it fails

assert spark_partition_for_justification == expected_partition_record

with

E AssertionError: assert Record[specia...ecial string'] == Record[specia...ecial string']
E - Record[special#string#field='special string']
E + Record[special_x23string_x23field='special string']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special_x23string_x23field is related to #590

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinjqliu. And the first failure is because the Iceberg version on spark was before apache/iceberg#10329, so it's not URL-encoded (I think).

Given this, I've disabled justification with a message similar to other tests here where behaviour differs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is sufficient (given we're testing PartitionKey's to_path, it felt natural but I'm unsure)?

If not, happy to be pointed to somewhere where I can add an integration test similar to the one shown in the issue. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/iceberg@795fea9
should be available starting in 1.6.0.

It looks like pyspark is using an older library version, can you add this change as see if the test pass?
#1462

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is sufficient (given we're testing PartitionKey's to_path, it felt natural but I'm unsure)?

yea we're testing partition_to_path, maybe add a test in tests/table/test_partitioning.py, which is not part of the integration test suite.

The integration test is a nice to have though. lets see if upgrading the iceberg library helps

Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion - bumping made the path test fail because quote was being used instead of quote_plus for encoding (the Java implementation encodes spaces to + which quote doesn't do).

For consistency, I've made the change to match Java behaviour, but can revert that if consistency isn't so important - what do you think?.

A unit test sounds good (and integration for justification checks would be great).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! thats why we like integration tests :)
let me merge #1462 first, and then we can rebase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the Record comparison still fails because of the non-optional sanitisation-transformation described in apache/iceberg#10120. And, as it stands, the provided Record param is used to check key.partition so can't be changed because that should be unsanitised, IIUC.

Think some test rewiring might be required - maybe providing a separate record param that's by default the other one, just for this justification check, but I wonder if we're really just testing spark behaviour then.

@kevinjqliu
Copy link
Contributor

BTW #1462 is merged, could you rebase this PR?

spec_id=3,
)

record = Record(**{"my#str%bucket": "my+str", "other str+bucket": "( )", "my!int:bucket": 10}) # type: ignore
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy complains here and elsewhere but I think it's fine


# Both partition names fields and values should be URL encoded, with spaces mapping to plus signs, to match the Java
# behaviour: https://github.com/apache/iceberg/blob/ca3db931b0f024f0412084751ac85dd4ef2da7e7/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L198-L204
assert spec.partition_to_path(record, schema) == "my%23str%25bucket=my%2Bstr/other+str%2Bbucket=%28+%29/my%21int%3Abucket=10"
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-checked with Java implementation (integration tests will do this eventually), in particular WRT to ' ' and '+' encoding. It is consistent.

@smaheshwar-pltr
Copy link
Contributor Author

Done, @kevinjqliu. Fails due to #1457 (comment) but will think over it.

FYI, am away for a little bit now so will pick this back up shortly in the new year! (Feel free to take over if v urgent 😄)

@kevinjqliu
Copy link
Contributor

Thanks for the PR! I've dug into the test failure a bit. Heres what I found.

There's a subtle difference between PartitionKey.partition and DataFile.partition. In most cases, these are the same value. For strings with special characters, DataFile.partition is sanitized but PartitionKey.partition is not.

DataFile.partition is sanitized according to apache/iceberg/#10120 this is to match the column value stored in the underlying parquet file.
PartitionKey.partition uses the value from the PartitionSpec which stores the un-sanitized value.

You can verify this by looking up the table partition spec.

iceberg_table.metadata.spec()
iceberg_table.metadata.specs()

The integration test assumes that the value for PartitionKey.partition and DataFile.partition is the same.
One possible solution is to sanitize the given Record before comparison

After spark_path_for_justification,

        # Special characters in partition value are sanitized when written to the data file's partition field
        # Use `make_compatible_name` to match the sanitize behavior
        sanitized_record = Record(**{make_compatible_name(k): v for k, v in vars(expected_partition_record).items()})
        assert spark_partition_for_justification == sanitized_record

@smaheshwar-pltr
Copy link
Contributor Author

Thanks a lot for this explanation and suggestion @kevinjqliu! It sounds good.

Had some time so I've made this change so tests pass - using make_compatible_name as a param that can be specified on a per-case basis (instead of porting over all this logic into tests). I also made it Optional so each subtest doesn't have to specify the identity transform but None.

@@ -234,9 +234,11 @@ def partition_to_path(self, data: Record, schema: Schema) -> str:
partition_field = self.fields[pos]
value_str = partition_field.transform.to_human_string(field_types[pos].field_type, value=data[pos])

value_str = quote(value_str, safe="")
value_str = quote_plus(value_str, safe="")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It defaults to utf-8, so that's good 👍

Comment on lines +240 to +241
field_str = quote_plus(partition_field.name, safe="")
field_strs.append(field_str)
Copy link
Contributor

@Fokko Fokko Dec 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I would just collapse these:

Suggested change
field_str = quote_plus(partition_field.name, safe="")
field_strs.append(field_str)
field_strs.append(quote_plus(partition_field.name, safe=""))

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smaheshwar-pltr for picking this up, and thanks for @kevinjqliu the review. I'll leave this one open in case Kevin has any further comments.

VALUES
('special string')
""",
lambda name: name.replace("#", "_x23").replace("+", "_x2B"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just reuse the make_compatible_name function here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was conflicted about this:

  • this sanitisation felt unique to this test instance so a parameter seemed best
  • alternatively, given the schema with these two special characters is specified at the top of the file (so all the test instances of this test use that schema), it's reasonable to use the same sanitisation for them all. Maybe having it as a top-level function beside the schema definition would best highlight this

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern was around providing context on "why" this lambda is here and implemented the way it is. This comment refers to this function which is also used in the write path

to me, it make sense to use the same function to show that we're doing the same thing as what the underlying system is doing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, I missed this function completely 😅 . Now it makes more sense 👍

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working on this @smaheshwar-pltr
A few nit comments left on this PR but not blocking. Going to merge this PR as is and we can deal with nit comment as a followup

@kevinjqliu kevinjqliu merged commit 5da1f4d into apache:main Jan 2, 2025
7 checks passed
@smaheshwar-pltr smaheshwar-pltr deleted the encode-partition-names branch January 8, 2025 17:18
@smaheshwar-pltr
Copy link
Contributor Author

Sounds good, thanks Kevin for your excellent review here!

@smaheshwar-pltr
Copy link
Contributor Author

Going to merge this PR as is and we can deal with nit comment as a followup

I've put up #1499 for the nits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants