-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] binary without schema embedded #87
Comments
FWIW, the Avro spec does not specify a format for this. Typical behavior is to write segments of 10 or so rows with the embedded schema at the front (github.com/linkedin/goavro), write a "schema ID" at the start of emitted rows (bottledwater, and presumably the schema ID is the 128-bit MD5 specified in the RPC handshake section of the spec), or just write out JSON rows (eg, confluent Kafka REST proxy and at least one Database -> Kafka CDC tool I looked at). This is because unlike Thrift, Protocol Buffers, etc, Avro's binary format is not forward compatible. This "saves space" on larger files and "forces everyone to implement the schema protocol" or something like that. |
The avro container file format includes the schema, so that in the future a reader could be able to parse the file, even if schemas were to change. As a matter of course though, the schema is not technically necessary so long as the receiving/reading end knows the schema of what it's getting. This could be simply hard-coded or agreed upon by the communicating ends, or communicated a different way than sending an object container format, like by sending the MD5 hash of the schema before sending the record (which is used in the Avro-RPC protocol, for example). How you implement that though is not handled within a formal part of the avro spec. Important note: Avro Binary serialization are not inherently forwards or backwards compatible unless the reader can know the exact schema the record was encoded with. This means that if you make any changes, including new fields, adding defaults, adding new options to a type union, or even adding entries to an enum, this is considered a new and different schema and without knowing that this is a different schema, the reader is likely to fail. |
I don't dispute any of that. However I should issue a correction I've discovered: the Confluent platform has invented its own avro binary format for efficient binary representation of a single row. I thought it was writing JSON but it appears I read the Java sources wrong. The row format consists of null byte, a 32-bit schema ID, and then binary data column by column. I'm not sure how the 32-bit schema ID is generated; it's nothing canonical (and might be a Kafka Schema Registry allocated identifier) |
Side note, I was trying to avoid advertising, but since this project has stopped responding to PR's for 1 year now, and I contacted the original maintainer back last year and he said he is no longer able to access the https://github.com/go-avro/avro#about-this-fork The new Go import path is |
Hi,
I was wondering if that was possible to encode in binary without embedding the schema within the message?
Many thanks,
Marc
The text was updated successfully, but these errors were encountered: