Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] binary without schema embedded #87

Open
MarcMagnin opened this issue Mar 14, 2017 · 4 comments
Open

[Question] binary without schema embedded #87

MarcMagnin opened this issue Mar 14, 2017 · 4 comments

Comments

@MarcMagnin
Copy link

Hi,

I was wondering if that was possible to encode in binary without embedding the schema within the message?

Many thanks,
Marc

@samv
Copy link

samv commented Apr 23, 2017

FWIW, the Avro spec does not specify a format for this. Typical behavior is to write segments of 10 or so rows with the embedded schema at the front (github.com/linkedin/goavro), write a "schema ID" at the start of emitted rows (bottledwater, and presumably the schema ID is the 128-bit MD5 specified in the RPC handshake section of the spec), or just write out JSON rows (eg, confluent Kafka REST proxy and at least one Database -> Kafka CDC tool I looked at).

This is because unlike Thrift, Protocol Buffers, etc, Avro's binary format is not forward compatible. This "saves space" on larger files and "forces everyone to implement the schema protocol" or something like that.

@crast
Copy link
Contributor

crast commented Apr 25, 2017

The avro container file format includes the schema, so that in the future a reader could be able to parse the file, even if schemas were to change.

As a matter of course though, the schema is not technically necessary so long as the receiving/reading end knows the schema of what it's getting. This could be simply hard-coded or agreed upon by the communicating ends, or communicated a different way than sending an object container format, like by sending the MD5 hash of the schema before sending the record (which is used in the Avro-RPC protocol, for example). How you implement that though is not handled within a formal part of the avro spec.

Important note: Avro Binary serialization are not inherently forwards or backwards compatible unless the reader can know the exact schema the record was encoded with. This means that if you make any changes, including new fields, adding defaults, adding new options to a type union, or even adding entries to an enum, this is considered a new and different schema and without knowing that this is a different schema, the reader is likely to fail.

@samv
Copy link

samv commented Apr 26, 2017

I don't dispute any of that. However I should issue a correction I've discovered: the Confluent platform has invented its own avro binary format for efficient binary representation of a single row. I thought it was writing JSON but it appears I read the Java sources wrong. The row format consists of null byte, a 32-bit schema ID, and then binary data column by column. I'm not sure how the 32-bit schema ID is generated; it's nothing canonical (and might be a Kafka Schema Registry allocated identifier)

@crast
Copy link
Contributor

crast commented May 22, 2017

Side note, I was trying to avoid advertising, but since this project has stopped responding to PR's for 1 year now, and I contacted the original maintainer back last year and he said he is no longer able to access the elodina project, I am going to mention that I've forked this project here:

https://github.com/go-avro/avro#about-this-fork

The new Go import path is gopkg.in/avro.v0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants