-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-3116: Implement the Variant binary encoding #3117
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @gene-db! I left some comments, but this is looking good
parquet-variant/pom.xml
Outdated
<version>${slf4j.version}</version> | ||
<scope>test</scope> | ||
</dependency> | ||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this one up with jackson we group the scopes together.
import static java.time.temporal.ChronoField.*; | ||
import static java.time.temporal.ChronoField.SECOND_OF_MINUTE; | ||
import static org.apache.parquet.variant.VariantUtil.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to avoid *
imports. Even better would be to get rid of the static imports altogether.
this.pos = pos; | ||
// There is currently only one allowed version. | ||
if (metadata.length < 1 || (metadata[0] & VERSION_MASK) != VERSION) { | ||
throw malformedVariant(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about mentioning which version was found instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It would be nice to have an error message like "Unsupported variant metadata version: %s"
.
return handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> { | ||
// Use linear search for a short list. Switch to binary search when the length reaches | ||
// `BINARY_SEARCH_THRESHOLD`. | ||
final int BINARY_SEARCH_THRESHOLD = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this one to the class level? We can use it in the tests as well to ensure we test both branches.
if (index < 0 || index >= size) { | ||
throw malformedVariant(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks inconsistent with the getFieldAtIndex
where we return a null
. Let's raise an exception at line 220 as well.
if (value <= U8_MAX) return 1; | ||
if (value <= U16_MAX) return 2; | ||
return U24_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (value <= U8_MAX) return 1; | |
if (value <= U16_MAX) return 2; | |
return U24_SIZE; | |
if (value <= U8_MAX) return U8_SIZE; | |
if (value <= U16_MAX) return U16_SIZE; | |
return U24_SIZE; |
// If the value doesn't fit any integer type, parse it as decimal or floating instead. | ||
parseAndAppendFloatingPoint(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is lossy, and I'd rather raise an exception
public int addKey(String key) { | ||
int id; | ||
if (dictionary.containsKey(key)) { | ||
id = dictionary.get(key); | ||
} else { | ||
id = dictionaryKeys.size(); | ||
dictionary.put(key, id); | ||
dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8)); | ||
} | ||
return id; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public int addKey(String key) { | |
int id; | |
if (dictionary.containsKey(key)) { | |
id = dictionary.get(key); | |
} else { | |
id = dictionaryKeys.size(); | |
dictionary.put(key, id); | |
dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8)); | |
} | |
return id; | |
} | |
public int addKey(String key) { | |
return dictionary.computeIfAbsent(key, newKey -> { | |
int id = dictionaryKeys.size(); | |
dictionaryKeys.add(newKey.getBytes(StandardCharsets.UTF_8)); | |
return id; | |
}); | |
} |
* Builder for creating Variant value and metadata. | ||
*/ | ||
public class VariantBuilder { | ||
public VariantBuilder(boolean allowDuplicateKeys) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we allow this? This isn't allowed by the spec
* @param l the long value to append | ||
*/ | ||
public void appendLong(long l) { | ||
checkCapacity(1 + 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we make the check-capacity based on what we write? Same for the decimal below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense to do this check in writeLong
?
} | ||
|
||
public byte[] getValue() { | ||
if (pos == 0) return value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why assume that the size is correct when pos
is 0? Is it that we don't care about extra bytes unless we are going to copy? If so, maybe mention it in a comment.
Also, in Parquet I think that we always use curly braces even if they are unnecessary.
return Arrays.copyOfRange(value, pos, pos + size); | ||
} | ||
|
||
public byte[] getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of byte[]
seems awkward given the assumptions that are made. It looks like the intent is for value
and metadata
to either be two separate arrays starting at offset 0, or a single array with metadata
coming first followed by value
at pos
(but in this case, the array is passed to the constructor twice).
A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.
} | ||
|
||
/** | ||
* @return the type info bits from a variant value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "type info" mean? It is not a term from the encoding spec.
|
||
// Get the object field at the `index` slot. Return null if `index` is out of the bound of | ||
// `[0, objectSize())`. | ||
// It is only legal to call it when `getType()` is `Type.OBJECT`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate comment?
|
||
/** | ||
* @param zoneId The ZoneId to use for formatting timestamps | ||
* @param truncateTrailingZeros Whether to truncate trailing zeros in decimal values or timestamps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is allowed by the JSON conversion spec either.
} | ||
|
||
private static void toJsonImpl( | ||
byte[] value, byte[] metadata, int pos, StringBuilder sb, ZoneId zoneId, boolean truncateTrailingZeros) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this already relies on Jackson's generator, I think it would be far safer to use the generator rather than a string builder.
sb.append('{'); | ||
for (int i = 0; i < size; ++i) { | ||
int id = readUnsigned(value, idStart + idSize * i, idSize); | ||
int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is copied in multiple places. I think it would be better to avoid copying. Instead, why not use an approach similar to getFieldAtIndex
combined with handleObject
? You could either use an Iterator
or accept a lambda for each field.
// in case of pathological data. | ||
long maxSize = Math.max(dictionaryStringSize, numKeys); | ||
if (maxSize > sizeLimitBytes) { | ||
throw new VariantSizeLimitException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should have a good error message with the estimated size.
Rationale for this change
This is a reference implementation for the Variant binary format.
What changes are included in this PR?
A new module for encoding/decoding the Variant binary format.
Are these changes tested?
Added unit tests
Are there any user-facing changes?
No
Closes #3116