Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3116: Implement the Variant binary encoding #3117

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

gene-db
Copy link

@gene-db gene-db commented Jan 7, 2025

Rationale for this change

This is a reference implementation for the Variant binary format.

What changes are included in this PR?

A new module for encoding/decoding the Variant binary format.

Are these changes tested?

Added unit tests

Are there any user-facing changes?

No

Closes #3116

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @gene-db! I left some comments, but this is looking good

<version>${slf4j.version}</version>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this one up with jackson we group the scopes together.

Comment on lines 38 to 40
import static java.time.temporal.ChronoField.*;
import static java.time.temporal.ChronoField.SECOND_OF_MINUTE;
import static org.apache.parquet.variant.VariantUtil.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid * imports. Even better would be to get rid of the static imports altogether.

this.pos = pos;
// There is currently only one allowed version.
if (metadata.length < 1 || (metadata[0] & VERSION_MASK) != VERSION) {
throw malformedVariant();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about mentioning which version was found instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It would be nice to have an error message like "Unsupported variant metadata version: %s".

return handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> {
// Use linear search for a short list. Switch to binary search when the length reaches
// `BINARY_SEARCH_THRESHOLD`.
final int BINARY_SEARCH_THRESHOLD = 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this one to the class level? We can use it in the tests as well to ensure we test both branches.

Comment on lines +238 to +240
if (index < 0 || index >= size) {
throw malformedVariant();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks inconsistent with the getFieldAtIndex where we return a null. Let's raise an exception at line 220 as well.

Comment on lines +580 to +582
if (value <= U8_MAX) return 1;
if (value <= U16_MAX) return 2;
return U24_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (value <= U8_MAX) return 1;
if (value <= U16_MAX) return 2;
return U24_SIZE;
if (value <= U8_MAX) return U8_SIZE;
if (value <= U16_MAX) return U16_SIZE;
return U24_SIZE;

Comment on lines +551 to +552
// If the value doesn't fit any integer type, parse it as decimal or floating instead.
parseAndAppendFloatingPoint(parser);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is lossy, and I'd rather raise an exception

Comment on lines +267 to +277
public int addKey(String key) {
int id;
if (dictionary.containsKey(key)) {
id = dictionary.get(key);
} else {
id = dictionaryKeys.size();
dictionary.put(key, id);
dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8));
}
return id;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public int addKey(String key) {
int id;
if (dictionary.containsKey(key)) {
id = dictionary.get(key);
} else {
id = dictionaryKeys.size();
dictionary.put(key, id);
dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8));
}
return id;
}
public int addKey(String key) {
return dictionary.computeIfAbsent(key, newKey -> {
int id = dictionaryKeys.size();
dictionaryKeys.add(newKey.getBytes(StandardCharsets.UTF_8));
return id;
});
}

* Builder for creating Variant value and metadata.
*/
public class VariantBuilder {
public VariantBuilder(boolean allowDuplicateKeys) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we allow this? This isn't allowed by the spec

* @param l the long value to append
*/
public void appendLong(long l) {
checkCapacity(1 + 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we make the check-capacity based on what we write? Same for the decimal below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to do this check in writeLong?

}

public byte[] getValue() {
if (pos == 0) return value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why assume that the size is correct when pos is 0? Is it that we don't care about extra bytes unless we are going to copy? If so, maybe mention it in a comment.

Also, in Parquet I think that we always use curly braces even if they are unnecessary.

return Arrays.copyOfRange(value, pos, pos + size);
}

public byte[] getMetadata() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of byte[] seems awkward given the assumptions that are made. It looks like the intent is for value and metadata to either be two separate arrays starting at offset 0, or a single array with metadata coming first followed by value at pos (but in this case, the array is passed to the constructor twice).

A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.

}

/**
* @return the type info bits from a variant value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "type info" mean? It is not a term from the encoding spec.


// Get the object field at the `index` slot. Return null if `index` is out of the bound of
// `[0, objectSize())`.
// It is only legal to call it when `getType()` is `Type.OBJECT`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate comment?


/**
* @param zoneId The ZoneId to use for formatting timestamps
* @param truncateTrailingZeros Whether to truncate trailing zeros in decimal values or timestamps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is allowed by the JSON conversion spec either.

}

private static void toJsonImpl(
byte[] value, byte[] metadata, int pos, StringBuilder sb, ZoneId zoneId, boolean truncateTrailingZeros) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this already relies on Jackson's generator, I think it would be far safer to use the generator rather than a string builder.

sb.append('{');
for (int i = 0; i < size; ++i) {
int id = readUnsigned(value, idStart + idSize * i, idSize);
int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is copied in multiple places. I think it would be better to avoid copying. Instead, why not use an approach similar to getFieldAtIndex combined with handleObject? You could either use an Iterator or accept a lambda for each field.

// in case of pathological data.
long maxSize = Math.max(dictionaryStringSize, numKeys);
if (maxSize > sizeLimitBytes) {
throw new VariantSizeLimitException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have a good error message with the estimated size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Variant binary encoding
3 participants