Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update Kafka bindings to describe schema usage #55

Closed
wants to merge 6 commits into from

Conversation

dalelane
Copy link
Collaborator

Adding enough information about how schemas are being
used in Kafka messages to enable an application that
consumes messages serialized using schemas to be
implemented purely from an AsyncAPI spec.

Also introduces two new common, reusable definitions
that can be used as message traits, to make it easier
to quickly specify headers used to capture schema
details.

Contributes to: #40
Contributes to: #41

Signed-off-by: Dale Lane [email protected]

Adding enough information about how schemas are being
used in Kafka messages to enable an application that
consumes messages serialized using schemas to be
implemented purely from an AsyncAPI spec.

Also introduces two new common, reusable definitions
that can be used as message traits, to make it easier
to quickly specify headers used to capture schema
details.

Contributes to: asyncapi#40
Contributes to: asyncapi#41

Signed-off-by: Dale Lane <[email protected]>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to AsyncAPI. Thanks a lot for creating your first pull request. Please check out our contributors guide and the instructions about a basic recommended setup useful for opening a pull request.

Keep in mind there are also other channels you can use to interact with AsyncAPI community. For more details check out this issue.

@dalelane
Copy link
Collaborator Author

I've put an illustration of how this could be used in https://gist.github.com/dalelane/3931c17b14c51fa4a1cf25496237d188

@dalelane dalelane changed the title docs: Update Kafka bindings to describe schema usage docs: update Kafka bindings to describe schema usage Feb 23, 2021
Copy link
Collaborator

@lbroudoux lbroudoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dalelane for great work. In my opinion, there's still the need for useRegistry flag dissociated from servers.

kafka/README.md Outdated Show resolved Hide resolved
Moving the message traits definitions into an examples folder to
make it clear that they are not part of the spec or bindings.

Contributes to: asyncapi#40

Signed-off-by: Dale Lane <[email protected]>
dalelane added 2 commits March 3, 2021 20:48
There is some debate about the best place to host and manage these,
and as these aren't directly required for this PR I'd like to move
them out so that they're not a blocker to this being merged.

Contributes to: asyncapi#40

Signed-off-by: Dale Lane <[email protected]>
It's easier to add fields in a future PR than to remove them, so
I'm erring on the cautious side for now and removing this (as I
think there is some duplication and potential for conflict with
useSchemaRegistry)

Contributes to: asyncapi#40

Signed-off-by: Dale Lane <[email protected]>
kafka/README.md Outdated Show resolved Hide resolved
Also sorted the list alphabetically to avoid indicating any other
significance to the ordering.

Signed-off-by: Dale Lane <[email protected]>
@derberg
Copy link
Member

derberg commented Apr 16, 2021

where are we with this PR? looks good to me

@M3lkior
Copy link
Contributor

M3lkior commented Apr 19, 2021

it is a good start ! some users (like me) needs to provide extra informations about topics naming strategy like this (confluent specific)

x-key.subject.name.strategy:
           type: string
           description: |
             We use the **RecordNameStrategy** to infer the messages schema.
             Use `io.confluent.kafka.serializers.subject.RecordNameStrategy` for your consumer.
         x-value.subject.name.strategy:
           type: string
           description: |
             We use the **RecordNameStrategy** to infer the messages schema.
             Use `io.confluent.kafka.serializers.subject.RecordNameStrategy` for your consumer.

So maybe it will be nice to allow bindings extension by x- prefixes

A Slack community discussion is started here : https://asyncapi.slack.com/archives/C34F2JV0U/p1618495734389300

@derberg
Copy link
Member

derberg commented Apr 20, 2021

@M3lkior where does it say extensions in bindings are not allowed? I mean that additional properties are not allowed?

when you do

    bindings:
      kafka:
        x-test: test

all is good, but for now of course because of the fact that bindings do not have schemas

@M3lkior
Copy link
Contributor

M3lkior commented Apr 20, 2021

@M3lkior where does it say extensions in bindings are not allowed? I mean that additional properties are not allowed?

when you do

    bindings:
      kafka:
        x-test: test

all is good, but for now of course because of the fact that bindings do not have schemas

i try to set bindings with x- prefixes, but they are not displayed or returned by the bindings() method

@derberg
Copy link
Member

derberg commented Apr 20, 2021

@M3lkior you sure? I recently updated playground with latest HTML where you contributed bindings info rendering and I can see the extension is rendered 👇🏼

Screenshot 2021-04-20 at 09 40 19

If spec doesn't say extensions are not allowed, then we are good. Bindings object can be extended and protocol-specific binding objects too. There is some work on schemas started #60

Can I suggest you create a separate issue in the spec repo to clarify if we should not have explicitly described in the spec that specific bindings must always be extendable? Instead of having it as a decision per binding?

@M3lkior
Copy link
Contributor

M3lkior commented Apr 20, 2021

@M3lkior you sure? I recently updated playground with latest HTML where you contributed bindings info rendering and I can see the extension is rendered 👇🏼

Screenshot 2021-04-20 at 09 40 19

If spec doesn't say extensions are not allowed, then we are good. Bindings object can be extended and protocol-specific binding objects too. There is some work on schemas started #60

Can I suggest you create a separate issue in the spec repo to clarify if we should not have explicitly described in the spec that specific bindings must always be extendable? Instead of having it as a decision per binding?

yes sorry :

image

(by the way, there is a rendering issue with too long text in the declaration)

@derberg
Copy link
Member

derberg commented Apr 20, 2021

@M3lkior no worries. Fix for the bug asyncapi/html-template#183

@dalelane dalelane marked this pull request as draft April 20, 2021 19:56
@dalelane
Copy link
Collaborator Author

where are we with this PR? looks good to me

@derberg Sorry, been super busy recently and this one got away from me.

I've changed the state to Draft to reflect that there are a few bits of feedback that I need to consider and reflect in the PR:

  • Apicurio pointed out that they have different approaches to encoding schema info in the body that I'm not adequately capturing
  • @M3lkior's pointed out that I haven't captured Confluent strategies

@lbroudoux
Copy link
Collaborator

Hi there!

Had a review this morning and I'm struggling with the useSchemaRegistry and schemaIdLocation properties when this later value is none.

From the binding README, definition of useSchemaRegistry is clear but the examples from @dalelane Gist confuse me as they seems to favor the schemaIdLocation=none approach. Especially the first one here: https://gist.github.com/dalelane/3931c17b14c51fa4a1cf25496237d188#binary-encoded-avro

Also the default value for schemaIdLocation is none which - as I understand from the Gist - means useSchemaRegistry is equivalent to false - but this one has no default value specified.

I think we should solve this ambiguity and I see 2 options at the moment:

  1. Keep both properties and have useSchemaRegistry default to false and restrict schemaIdLocation to either payload - the default because the most common - or header,

  2. Remove the useSchemaRegistry and just keep schemaIdLocation with none as default, + the payload and header values.

From my perspective, I prefer option 1. as I find it more easy to use for non-expert of SerDes library and more straightforward for most common cases. That way, you may have something like:

channels:
  user-signedup:
    publish:
      bindings:
        kafka:
          groupId:
            type: string
            enum: ['myGroupId']
          clientId:
            type: string
            enum: ['myClientId']
          bindingVersion: '0.2.0'

and this is telling that this operation/channel is not using any registry. Then we can enable registry usage explicitly:

channels:
  user-signedup:
    publish:
      bindings:
        kafka:
          groupId:
            type: string
            enum: ['myGroupId']
          clientId:
            type: string
            enum: ['myClientId']
          useSchemaRegistry: true
          bindingVersion: '0.2.0'

and this mean that a Schema registry is used and consumer will have to pick the SerDes implementation corresponding to registry vendor. The Schema ID Location will thus be the payload of the message. Then we can also refine things adding:

channels:
  user-signedup:
    publish:
      bindings:
        kafka:
          groupId:
            type: string
            enum: ['myGroupId']
          clientId:
            type: string
            enum: ['myClientId']
          useSchemaRegistry: true
          bindingVersion: '0.2.0'
      message:
        bindings:
          kafka:
            schemaIdLocation: 'header'
            bindingVersion: '0.2.0'

and this is necessary only is we're not sticking with most common usage.

What do you think?

@nictownsend
Copy link

@lbroudoux I think there are two use cases this is looking to handle:

  1. Inform API developers to enable them to pick the right vendor serdes
  2. Inform API developers how to bypass if schema registry is unavailable

For use case 1 - schemaRegistryURL + schemaRegistryVendor is sufficient to pick a SERDES library. useSchemaRegistry flags if the SERDES library should be used for the operation (if not, assumption is message has been produced without using a SERDES library (e.g basic producer using AVRO).

Missing:

  • Naming strategies in each operation allow additional SERDES config for Confluent + Apicurio. These maybe just need to be class names so that the key can be vendor agnostic (naming/lookupStrategy)?
  • Apicurio ID encoding types

For use case 2 - to deserialize from the payload without using a SERDES library you need to know:

  • schemaIdLocation - so that you know if you have to skip any bytes (Apicurio looks for a magic byte to automatically determine if in header or not - confluent only use payload, Event Streams uses header only). Maybe this just needs to be schemaIdInPayload = true/false
  • schemaRegistryVendor - so that for each wire format e.g Confluent, Apicurio (find in page 'magic byte') - you know how many bytes to skip (magic + schema id)
  • Apicurio ID encoding type to skip the right number of bytes for the schema ID

I think this means that there are two sets of properties - with overlap but no conflict. You could combine the two sets to describe both use cases on the same spec without any clashes (though for code generation purposes I guess you'd generate using SERDES if all required SERDES properties are described).

I think the only change to the constraint is that schemaRegistryVendor is required if either schemaIdInPayload or schemaRegistryURL are supplied.

Does that clear up the ambiguity by separating properties into the two different concerns?

@lbroudoux
Copy link
Collaborator

Thanks @nictownsend : this makes things clearer.

However, I'd thought that schemaIdLocation is still useful for use case 1 - enabling developers to correctly setup SERDES properties. I especially think to the Apicurio usage of headers for ID.

props.putIfAbsent(SerdeConfig.ENABLE_HEADERS, "true")

Is it correct?

+1 for the missing lookup strategies!

@nictownsend
Copy link

Thanks @nictownsend : this makes things clearer.

However, I'd thought that schemaIdLocation is still useful for use case 1 - enabling developers to correctly setup SERDES properties. I especially think to the Apicurio usage of headers for ID.

Good point, yes for Apicurio case - schemaIdLocation is needed for producer configuration (I had stupidly focused on consume only).

So do we then have the following properties:

  • schemaRegistryURL - url for the registry (use case 1 only)
  • schemaRegistryVendor - SERDES library to use (use case 1 only)
  • schemaIdLocation - where to find the ID (and thus using this property means useSchemaRegistry is no longer needed as use of this is stating "this operation uses SERDES") (both use cases)
  • schemaIdPayloadEncoding - length of bytes or vendor specific values (e.g confluent/apicurio-legacy/apicurio-new) (both use cases)
  • schemaLookupStrategy - freeform string for any naming strategy class to use (use case 1 only) and clients default to the vendor default if not supplied

Only query is whether it can be assumed that every vendor uses a magic byte when writing schema to payload. If not, then schemaRegistryVendor becomes use case 2 as well.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity 😴
It will be closed in 60 days if no further activity occurs. To unstale this pull request, add a comment with detailed explanation.
Thank you for your contributions ❤️

@github-actions github-actions bot added the stale label Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants