Skip to content

Commit

Permalink
Merge branch 'main' into tsung-julii/sc-23516/hive-crawler-views-mate…
Browse files Browse the repository at this point in the history
…rialized-views
  • Loading branch information
usefulalgorithm authored Jan 8, 2024
2 parents 43cdc18 + 2653602 commit b1d24ca
Showing 1 changed file with 59 additions and 81 deletions.
140 changes: 59 additions & 81 deletions metaphor/kafka/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,76 +4,55 @@ This connector extracts technical metadata from Kafka using [Confluent's Python

## Setup

To run a Kafka cluster locally, follow the instructions below:

1. Start a Kafka cluster (broker + schema registry + REST proxy) locally via docker-compose:
```shell
$ docker-compose --file metaphor/kafka/docker-compose.yml up -d
```
- Broker is on port 9092.
- Schema registry is on port 8081.
- REST proxy is on port 8082.
2. Find the cluster ID:
```shell
$ curl -X GET --silent http://localhost:8082/v3/clusters/ | jq '.data[].cluster_id'
```
3. Register a new topic via the REST proxy:
```shell
curl -X POST -H "Content-Type: application/json" http://localhost:8082/v3/clusters/<YOUR CLUSTER ID>/topics -d '{"topic_name": "<YOUR TOPIC NAME>"}'| jq .
```
4. Register a schema to the registry:
```shell
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema": <SCHEMA AS STRING>}' http://localhost:8081/subjects/<YOUR TOPIC NAME>-<key|value>/version
```
- It is possible to have schema with name different to the topic. See `Topic <-> Schema Subject Mapping` section below for more info.
If [ACL](https://docs.confluent.io/platform/current/security/rbac/authorization-acl-with-mds.html) is enabled, the credentials used by the crawler must be allowed to perform [Describe operation](https://docs.confluent.io/platform/current/kafka/authorization.html#topic-resource-type-operations) on the topics of interest.

## Config File

Create a YAML config file based on the following template.

### Required Configurations

You must specify at least one bootstrap server, i.e. a pair of host and port pointing to the Kafka broker instance. You must also specify a URL for the schema registry.
You must specify at least one bootstrap server, i.e. a pair of host and port pointing to a Kafka broker instance. You must also specify a URL for the schema registry.

```yaml
bootstrap_servers:
- host: <host>
port: <port>
schema_registry_url: <schema_registry_url> # Schema Registry URL. Schema registry client supports URL with basic HTTP authentication values, i.e. `http://username:password@host:port`.
schema_registry_url: <schema_registry_url>
output:
file:
directory: <output_directory>
```
To use HTTP basic authentication for the schema registry, specify the credentials in `schema_regitry_url` using the format `https://<username>:<password>@host:port`.

See [Output Config](../common/docs/output.md) for more information on `output`.

### Optional Configurations

#### Kafka Admin Client
#### SASL Authentication

##### Common SASL Authentication Configurations

Some most commonly used SASL authentication configurations have their own section:
You can optionally authenticate against the brokers by adding the following SASL configurations:

```yaml
sasl_config:
username: <username> # SASL username for use with the `PLAIN` and `SASL-SCRAM-..` mechanisms.
password: <password> # SASL password for use with the `PLAIN` and `SASL-SCRAM-..` mechanisms.
mechanism: <mechanism> # SASL mechanism to use for authentication. Supported: `GSSAPI`, `PLAIN`, `SCRAM-SHA-256`, `SCRAM-SHA-512`, `OAUTHBEARER`. Default: `GSSAPI`.
# SASL mechanism, e.g. GSSAPI, PLAIN, SCRAM-SHA-256, etc.
mechanism: <mechanism>
# SASL username & password for PLAIN, SCRAM-* mechanisms
username: <username>
password: <password>
```

##### Other Configurations
For other configurable values, please use `extra_admin_client_config` field:
Some mechanisms (e.g., `kerberos` & `oauthbearer`) require additional configs that can be specified using `extra_admin_client_config`:

```yaml
extra_admin_client_config:
sasl.kerberos.service.name: "kafka"
sasl.kerberos.principal: "kafkaclient"
...
```

Visit [https://github.com/confluentinc/librdkafka/blob/master/CONFIGURATION.md](https://github.com/confluentinc/librdkafka/blob/master/CONFIGURATION.md) for full list of available Kafka client configurations.
See [https://github.com/confluentinc/librdkafka/blob/master/CONFIGURATION.md](https://github.com/confluentinc/librdkafka/blob/master/CONFIGURATION.md) for a complete list.

#### Filtering

Expand All @@ -90,81 +69,80 @@ By default the following topics are excluded:
- `_schema`
- `__consumer_offsets`

#### Topic <-> Schema Subject Mapping
#### Topic to Schema Subject Mapping

Kafka messages are sent as key / value pairs, and both can have their schemas defined in the schema registry. There are three strategies to map topic to schema subjects:
Kafka messages can have key and value schemas defined in the schema registry. There are three strategies to map topics to schema subjects from the schema registry:

##### Strategies
| Subject Name Strategy | Key Schema Subject | Value Schema Subject |
| :------------------------------------- | :----------------- | :------------------- |
| `TOPIC_RECORD_NAME_STRATEGY` (Default) | `<topic>-key` | `<topic>-value` |
| `RECORD_NAME_STRATEGY` | `*-key` | `*-value` |
| `TOPIC_RECORD_NAME_STRATEGY` | `<topic>-*-key` | `<topic>-*-value` |

###### Topic Name Strategy (Default)
where `<topic>` is the topic name, and `*` matches either all strings or a set of values specified in the config.

For a topic `foo`, the subjects for the schemas for the messages sent through this topic would be `foo-key` (the key schema subject) and `foo-value` (the value schema subject).
##### Example: TOPIC_RECORD_NAME_STRATEGY

###### Record Name Strategy
The following is the default config, which assumes all messages for a topic `topic` have `topic-key` key schema and `topic-value` value schema.

It is possible for a topic to have more than one schema. In that case this strartegy can be useful. To enable this as default, add the following in the configuration file:
```yaml
default_subject_name_strategy: TOPIC_RECORD_NAME_STRATEGY
```

##### Example: RECORD_NAME_STRATEGY

The following config specificities that topic `topic` to have two types of key-value schemas, `(type1-key, type1-value)` and `(type2-key, type2-value)`:

```yaml
default_subject_name_strategy: RECORD_NAME_STRATEGY
topic_naming_strategies:
foo:
topic:
records:
- bar
- baz
- type1
- type2
```

This means topic `foo` can transmit the following schemas:

- `bar-key`
- `bar-value`
- `baz-key`
- `baz-value`

###### Topic Record Name Strategy
##### Example: TOPIC_RECORD_NAME_STRATEGY

This strategy is best demonstrated through an example:
This is similar to `RECORD_NAME_STRATEGY`, except the schema subjects are prefixed with the topic name. For example, the following specifies that the topic `topic` to have two types of key-value schemas, `(topic-type1-key, topic-type1-value)` and `(topic-type2-key, topic-type2-value)`

```yaml
default_subject_name_strategy: TOPIC_RECORD_NAME_STRATEGY
topic_naming_strategies:
foo:
topic:
records:
- bar
- baz
quax:
records: [] # If list of record names is empty, we take all subjects that starts with "<topic>-" and ends with "-<key|value>" as topic subjects.
- type1
- type2
```

- For topic `foo`, the subjects it transmits are
- `foo-bar-key`
- `foo-bar-value`
- `foo-baz-key`
- `foo-baz-value`
- For topic `quax`, all subject that starts with `quax-` and ends with either `-key` or `-value` is considered a subject on topic `quax`.
Instead of explicitly enumerating the type values, you can specify an empty list to match all possible values, i.e. `(topic-*-key, topic-*-value)`:

##### Overriding Subject Name Strategy for Specific Topics
```yaml
default_subject_name_strategy: TOPIC_RECORD_NAME_STRATEGY
topic_naming_strategies:
tpoic:
records: []
```

##### Example: Overriding Strategy for Specific Topics

It is possible to override subject name strategy for specific topics:
It is possible to override the subject name strategy for specific topics, e.g.

```yaml
default_subject_name_strategy: RECORD_NAME_STRATEGY
topic_naming_strategies:
foo:
topic1:
records:
- bar
- baz
quax:
- type1
- type2
topic2:
override_subject_name_strategy: TOPIC_NAME_STRATEGY
```

- The following subjects are transmitted through topic `foo`:
- `bar-key`
- `bar-value`
- `baz-key`
- `baz-value`
- For topic `quax`, since it uses `TOPIC_NAME_STRATEGY`, connector will look for the following 2 subjects:
- `quax-key`
- `quax-value`
The results in the following schemas

- `topic1`: `(type1-key, type1-value)`, `(type2-key, type2-value)`
- `topic2`: `(topic2-key, topic2-value)`

## Testing

Expand All @@ -176,4 +154,4 @@ To test the connector locally, change the config file to output to a local path
metaphor kafka <config_file>
```

Manually verify the output after the run finishes.
Manually verify the output after the run finishes.

0 comments on commit b1d24ca

Please sign in to comment.