GH-3116: Implement the Variant binary encoding #3117

gene-db · 2025-01-07T21:14:58Z

Rationale for this change

This is a reference implementation for the Variant binary format.

What changes are included in this PR?

A new module for encoding/decoding the Variant binary format.

Are these changes tested?

Added unit tests

Are there any user-facing changes?

No

Closes #3116

Fokko

Thanks for working on this @gene-db! I left some comments, but this is looking good

Fokko · 2025-01-14T15:20:18Z

parquet-variant/pom.xml

+      <version>${slf4j.version}</version>
+      <scope>test</scope>
+    </dependency>
+    <dependency>


How about this one up with jackson we group the scopes together.

Fokko · 2025-01-20T15:52:28Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+import static java.time.temporal.ChronoField.*;
+import static java.time.temporal.ChronoField.SECOND_OF_MINUTE;
+import static org.apache.parquet.variant.VariantUtil.*;


We try to avoid * imports. Even better would be to get rid of the static imports altogether.

Fokko · 2025-01-20T15:53:45Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    this.pos = pos;
+    // There is currently only one allowed version.
+    if (metadata.length < 1 || (metadata[0] & VERSION_MASK) != VERSION) {
+      throw malformedVariant();


How about mentioning which version was found instead.

I agree. It would be nice to have an error message like "Unsupported variant metadata version: %s".

Fokko · 2025-01-20T15:54:57Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> {
+      // Use linear search for a short list. Switch to binary search when the length reaches
+      // `BINARY_SEARCH_THRESHOLD`.
+      final int BINARY_SEARCH_THRESHOLD = 32;


Move this one to the class level? We can use it in the tests as well to ensure we test both branches.

Fokko · 2025-01-20T15:56:17Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+      if (index < 0 || index >= size) {
+        throw malformedVariant();
+      }


This looks inconsistent with the getFieldAtIndex where we return a null. Let's raise an exception at line 220 as well.

Fokko · 2025-01-23T10:44:21Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+    if (value <= U8_MAX) return 1;
+    if (value <= U16_MAX) return 2;
+    return U24_SIZE;


Suggested change

if (value <= U8_MAX) return 1;

if (value <= U16_MAX) return 2;

return U24_SIZE;

if (value <= U8_MAX) return U8_SIZE;

if (value <= U16_MAX) return U16_SIZE;

return U24_SIZE;

Fokko · 2025-01-23T10:46:16Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+          // If the value doesn't fit any integer type, parse it as decimal or floating instead.
+          parseAndAppendFloatingPoint(parser);


I think this is lossy, and I'd rather raise an exception

Fokko · 2025-01-23T10:50:48Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+  public int addKey(String key) {
+    int id;
+    if (dictionary.containsKey(key)) {
+      id = dictionary.get(key);
+    } else {
+      id = dictionaryKeys.size();
+      dictionary.put(key, id);
+      dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8));
+    }
+    return id;
+  }


Suggested change

public int addKey(String key) {

int id;

if (dictionary.containsKey(key)) {

id = dictionary.get(key);

} else {

id = dictionaryKeys.size();

dictionary.put(key, id);

dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8));

}

return id;

}

public int addKey(String key) {

return dictionary.computeIfAbsent(key, newKey -> {

int id = dictionaryKeys.size();

dictionaryKeys.add(newKey.getBytes(StandardCharsets.UTF_8));

return id;

});

}

Fokko · 2025-01-23T12:57:11Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+ * Builder for creating Variant value and metadata.
+ */
+public class VariantBuilder {
+  public VariantBuilder(boolean allowDuplicateKeys) {


Why would we allow this? This isn't allowed by the spec

Fokko · 2025-01-23T12:58:53Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+   * @param l the long value to append
+   */
+  public void appendLong(long l) {
+    checkCapacity(1 + 8);


shouldn't we make the check-capacity based on what we write? Same for the decimal below

Wouldn't it make more sense to do this check in writeLong?

rdblue · 2025-01-23T23:42:40Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  }
+
+  public byte[] getValue() {
+    if (pos == 0) return value;


Why assume that the size is correct when pos is 0? Is it that we don't care about extra bytes unless we are going to copy? If so, maybe mention it in a comment.

Also, in Parquet I think that we always use curly braces even if they are unnecessary.

rdblue · 2025-01-23T23:46:42Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return Arrays.copyOfRange(value, pos, pos + size);
+  }
+
+  public byte[] getMetadata() {


The use of byte[] seems awkward given the assumptions that are made. It looks like the intent is for value and metadata to either be two separate arrays starting at offset 0, or a single array with metadata coming first followed by value at pos (but in this case, the array is passed to the constructor twice).

A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.

rdblue · 2025-01-23T23:47:20Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  }
+
+  /**
+   * @return the type info bits from a variant value


What does "type info" mean? It is not a term from the encoding spec.

rdblue · 2025-01-23T23:52:02Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+
+  // Get the object field at the `index` slot. Return null if `index` is out of the bound of
+  // `[0, objectSize())`.
+  // It is only legal to call it when `getType()` is `Type.OBJECT`.


Duplicate comment?

rdblue · 2025-01-23T23:57:49Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+
+  /**
+   * @param zoneId The ZoneId to use for formatting timestamps
+   * @param truncateTrailingZeros Whether to truncate trailing zeros in decimal values or timestamps


I don't think this is allowed by the JSON conversion spec either.

rdblue · 2025-01-23T23:59:44Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  }
+
+  private static void toJsonImpl(
+      byte[] value, byte[] metadata, int pos, StringBuilder sb, ZoneId zoneId, boolean truncateTrailingZeros) {


Because this already relies on Jackson's generator, I think it would be far safer to use the generator rather than a string builder.

rdblue · 2025-01-24T00:02:07Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+          sb.append('{');
+          for (int i = 0; i < size; ++i) {
+            int id = readUnsigned(value, idStart + idSize * i, idSize);
+            int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize);


The logic here is copied in multiple places. I think it would be better to avoid copying. Instead, why not use an approach similar to getFieldAtIndex combined with handleObject? You could either use an Iterator or accept a lambda for each field.

rdblue · 2025-01-24T00:04:13Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+    // in case of pathological data.
+    long maxSize = Math.max(dictionaryStringSize, numKeys);
+    if (maxSize > sizeLimitBytes) {
+      throw new VariantSizeLimitException();


I think this should have a good error message with the estimated size.

gene-db added 8 commits January 6, 2025 13:21

Implement Variant encoding

c3c71b7

remove optional

c5d19e6

split test

0086b34

cleanup

5af337f

cleanup comment

5997732

Run mvn spotless:apply

de96bac

Fix dependencies

848ddcb

Fix tests for older jdk versions

1a448ea

Fokko reviewed Jan 23, 2025

View reviewed changes

rdblue reviewed Jan 23, 2025

View reviewed changes

rdblue reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3116: Implement the Variant binary encoding #3117

GH-3116: Implement the Variant binary encoding #3117

gene-db commented Jan 7, 2025

Fokko left a comment

Fokko Jan 14, 2025

Fokko Jan 20, 2025

Fokko Jan 20, 2025

rdblue Jan 23, 2025

Fokko Jan 20, 2025

Fokko Jan 20, 2025

Fokko Jan 23, 2025

Fokko Jan 23, 2025

Fokko Jan 23, 2025

Fokko Jan 23, 2025

Fokko Jan 23, 2025

Fokko Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 24, 2025

rdblue Jan 24, 2025

		// If the value doesn't fit any integer type, parse it as decimal or floating instead.
		parseAndAppendFloatingPoint(parser);

GH-3116: Implement the Variant binary encoding #3117

Are you sure you want to change the base?

GH-3116: Implement the Variant binary encoding #3117

Conversation

gene-db commented Jan 7, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment