Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading of proto Uint32Value #3113

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

0x26res
Copy link

@0x26res 0x26res commented Jan 6, 2025

Rationale for this change

Fix for #3112

What changes are included in this PR?

  • Uint32Value primitive / physical type switched to INT32 (from INT64)
  • Reading value for UInt32Value now uses int instead of long. In order to be backward compatible with previous parquet files (that were using long), I've left the addLong implementation of the ProtoUInt32ValueConverter

Are these changes tested?

There was extensive test coverage of this feature for round trip reading. The main change is the physical / underlying type change from int32 to int64 which is tested.

There are however no test for backward compatibility.

Are there any user-facing changes?

The parquet file generated from protobuf using Uint32Value will now use int32 instead of int64.

Related MR:

Closes #3112

@@ -260,7 +260,7 @@ private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addF
return builder.primitive(INT32, getRepetition(descriptor));
}
if (messageType.equals(UInt32Value.getDescriptor())) {
return builder.primitive(INT64, getRepetition(descriptor));
return builder.primitive(INT32, getRepetition(descriptor));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unsigned 32-bit integer might overflow a signed 32-bit integer, right? It would be good to test this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added a test using the min and max value of Integer and doing a round trip to make sure we get it back. It looks a bit weird because uint32 are represented as signed 32 bits integer in java, but it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the test really test the out of bounds case. The Integer.MAX_VALUE is 2^31-1 = 2147483647, but the unsigned integer could go up to 2^32-1 = 4294967295. Maybe the original author @mwong38 has an opinion on this (I'm not a protobuf expert myself).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my understanding.

uint32 is not a java concept. Therefore a protobuf field of type uint32, in the java generated code, will be treated as signed 32 bit integer. What java calls an int.

Something like:

message TestMessage {
  uint32 value = 1;
}

Will generate setters and getters that use an int:

 public Builder setValue(int value) // ....

However if I set the value to a negative value, for example builder.setValue(-1) and call toString, it will correctly interpret it as 4294967295.

Again what's important to note is that the underlying representation of a uint32 protobuf type in java is (32bit) int. Say I am to store, from a language that support unsigned integers (eg python), the value of a uint32 to 4294967295. Then if I read it in java I would get -1. But if I serialize the message again and pass it back to C or python, I would get back 4294967295 which is what I want.

I've extended the test to illustrate the behaviour better (adding calls to toString). Also I've added -1 on top of Integer.MIN_VALUE and Integer.MAX_VALUE to correctly test the full spectrum of unsigned integer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fail to read google.protobuf.UInt32Value from parquet
2 participants