Skip to content

Latest commit

 

History

History
executable file
·
370 lines (320 loc) · 13 KB

kafka-cheat-sheet.md

File metadata and controls

executable file
·
370 lines (320 loc) · 13 KB

Kafka cheat sheet

position in architecture

Kafka guarantees

  • messages that sent into particular topic will be appended in the same order
  • consumer see messages in order that were written
  • "At-least-once" message delivery guaranteed - for consumer who crushed before it commited offset
  • "At-most-once" delivery - ( custom realization ) consumer will never read the same message again, even when crushed before process it

Best practice

  • it is better to have many small messages instead of big one
  • Integration tests should have real-world messages
git clone https://github.com/apache/kafka.git kafka

or kafka download

main concepts

  • Topics category of messages, consists from Partitions
  • Partition ( Leader and Followers ) part of the Topic, can be replicated (replication factor) across Brokers, must have at least one Leader and 0..* Followers when you save message, it will be saved into one of the partitions depends on: partition number | hash of the key | round robin
    partition size calculator
    partitions
  • Leader main partition in certain period of time, contains InSyncReplica's - list of Followers that are alive in current time
  • Committed Messages when all InSyncReplicas wrote message, Consumer can read it after, Producer can wait for it or not
  • Brokers one of the server of Kafka ( one of the server of cluster )
  • Producers some process that publish message into specific topic
  • Consumers topics subscriber
  • Consumer Group group of consumers, have one Load Balancer for one group, consumer instance from different group will receive own copy of message ( one message per group )

concepts
workflow
consumer
consumer
consumer group
producer
partitions
recommendations

Error Handling

ZoopKeeper ( one instance per cluster )

  • must be started before using Kafka ( zookeeper-server-start.sh, kafka-server-start.sh )
  • cluster membership
  • electing a controller
  • topic configuration leader, which topic exists
  • Quotas
  • ACLs
    ./kafka-acls.sh

scripts

start Kafka's Broker

zookeeper-server-start.sh
kafka-server-start.sh config/server.properties

ksql

flowchart LR

ksql --> ks["kafka \n stream"]
ks --> cp[consumer\nproducer]
Loading
@startuml

[ksql] as ksql 
rectangle "kafka stream jar" as stream #lightgreen
[app]  as app 

[consumer \n producer] as consumer
[kafka] as kafkap

ksql -right--> stream : use
app o-- stream  : aggregate

stream -right--> consumer
consumer -up-> kafka

@enduml

ksql ( MapR )

create stream

# create stream
maprcli stream create -path sample-stream -produceperm p -consumeperm p -topicperm p

# generate dummy data 
/opt/mapr/ksql/ksql-4.1.1/bin/ksql-datagen quickstart=pageviews format=delimited topic=sample-stream:pageviews  maxInterval=5000

create table for stream

/opt/mapr/ksql/ksql-4.1.1/bin/ksql http://ubs000130.vantage.org:8084

ksqldb what is

it is storage of messages with ability to look into window ( time based ) using ConfluenceKafkaSQL

ksqldb pillars

  • stream processing
  • connectors
  • mater. views

ksqldb queries

  • pull query
  • push query
create table pageviews_original_table (viewtime bigint, userid varchar, pageid varchar) with (kafka_topic='sample-stream:pageviews', value_format='DELIMITED', key='viewtime')
select * from pageviews_original_table;

topic create

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic mytopic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --describe --topic mytopic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --config retention.ms=360000 --topic mytopic

or just enable "autocreation"

auto.create.topics.enable=true

topic delete

can be marked "for deletion"

bin/kafka-topics.sh --delete --zookeeper localhost:2181 --topic mytopic

topics list

bin/kafka-topics.sh --create --zookeeper localhost:2181 --list

topics describe

kafka-topics --describe --zookeeper localhost:2181 --topic mytopicname

topic update

bin/kafka-topics.sh --alter --zookeeper localhost:2181 --partitions 5 --topic mytopic
bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic mytopic --config retention.ms=72000
bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic mytopic --deleteConfig retention.ms=72000

producer console

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic mytopic

PRODUCER_CONFIG=/path/to/config.properties
TOPIC_NAME=my-topic
BROKER=192.168.1.140:9988
bin/kafka-console-producer.sh --producer.config $PRODUCER_CONFIG \
--broker-list $BROKER --topic $TOPIC_NAME

java producer example

 Properties props = new Properties();
 props.put("bootstrap.servers", "localhost:4242");
 props.put("acks", "all");  // 0 - no wait; 1 - leader write into local log; all - leader write into local log and wait ACK from full set of InSyncReplications 
 props.put("client.id", "unique_client_id"); // nice to have
 props.put("retries", 0);           // can change ordering of the message in case of retriying
 props.put("batch.size", 16384);    // collect messages into batch
 props.put("linger.ms", 1);         // additional wait time before sending batch
 props.put("compression.type", ""); // type of compression: none, gzip, snappy, lz4
 props.put("buffer.memory", 33554432);
 props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 Producer<String, String> producer = new KafkaProducer<>(props);
 producer.metrics(); // 
 for(int i = 0; i < 100; i++)
     producer.send(new ProducerRecord<String, String>("mytopic", Integer.toString(i), Integer.toString(i)));
     producer.flush(); // immediatelly send, even if 'linger.ms' is greater than 0
 producer.close();
 producer.partitionsFor("mytopic")

partition will be selected

Consumer

consumer console ( console consumer )

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic mytopic --from-beginning
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic mytopic --from-beginning --consumer.config my_own_config.properties

bin/kafka-console-consumer.sh --bootstrap-server mus07.mueq.adac.com:9092 --new-consumer --topic session-ingest-stage-1 --offset 20 --partition 0  --consumer.config kafka-log4j.properties
bin/kafka-console-consumer.sh --bootstrap-server mus07.mueq.adac.com:9092 --group my-consumer-2 --topic session-ingest-stage-1 --from-beginning  --consumer.config kafka-log4j.properties

# read information about partitions
java kafka.tools.GetOffsetShell --broker-list musnn071001:9092 --topic session-ingest-stage-1
# get number of messages in partitions, partitions messages count
bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic session-ingest-stage-1

consumer group console

bin/kafka-consumer-groups.sh --zoopkeeper localhost:2181 --describe --group mytopic-consumer-group

consumer offset

  • automatic commit offset (enable.auto.commit=true) with period (auto.commit.interval.ms=1000)
  • manual offset commit (enable.auto.commit=false)
  • property "auto.offset.reset=latest" start with consuming only newly appeared messages in the topic after connection/creation
 Properties props = new Properties();
 props.put("bootstrap.servers", "localhost:4242"); // list of host/port pairs to connect to cluster
 props.put("client.id", "unique_client_id");       // nice to have
 props.put("group.id", "unique_group_id");         // nice to have
 props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 props.put("fetch.min.bytes", 0);              // if value 1 - will be fetched immediatelly
 props.put("enable.auto.commit", "true");      //
 // timeout of detecting failures of consumer, Kafka group coordinator will wait for heartbeat from consumer within this period of time
 props.put("session.timeout.ms", "1000"); 
 // expected time between heartbeats to the consumer coordinator,
 // is consumer session stays active, 
 // facilitate rebalancing when new consumers join/leave group,
 // must be set lower than *session.timeout.ms*
 props.put("heartbeat.interval.ms", "");

consumer java

NOT THREAD Safe !!!

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
ConsumerRecords<String, String> records = consumer.pool(100); // time in ms

consumer consume messages

  • by topic
consumer.subscribe(Arrays.asList("mytopic_1", "mytopic_2"));
  • by partition
TopicPartition partition0 = new TopicPartition("mytopic_1", 0);
TopicPartition partition1 = new TopicPartition("mytopic_1", 1);
consumer.assign(Arrays.asList(partition0, partition1));
  • seek to position
seek(partition0, 1024);
seekToBeginning(parition0, partition1);
seekToEnd(parition0, partition1);

Kafka Stream State Stores

:TODO: in-memory DB ( Rocks DB )

Kafka connect

  • manage copying data between Kafka and another system
  • connector either a source or a sink
  • connector can split "job" to "tasks" ( to copy subset of data )
  • partitioned streams for source/sink, each record into it: [key,value,offset]
  • standalone/distributed mode
  • Two ways of wokring with Stream:
    • KSQL (KSQLDB)
    • Flink

      engine for running queries on cluster

Kafka connect standalone

start connect

bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties

connect settings

name=local-file-source
connector.class=org.apache.kafka.connect.file.FileStreamSourceConnector
tasks.max=1
file=my_test_file.txt
topic=topic_for_me

after execution you can check the topic

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic topic_for_me --from-beginning

additional tools

kafka cli (producer & consumer)

installation

apt-get install kafkacat

docker run

docker run -it --network=host edenhill/kcat:1.7.1

commands

minimal command
BROKER_HOST=192.168.1.150
BROKER_PORT=3388
TOPIC=my-topic
kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC
# -X security.protocol=sasl_ssl \
# -X sasl.mechanisms=PLAIN      \
# -X sasl.username=$SASL_USER   \
# -X sasl.password=$SASL_PASS   \
read all messages, read messages from the beginning
kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC -o beginning

read last messages, read messages from the end

kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC -o -5

Consume messages and stop

kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC -c 5
# Print messages with a specific output
kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC -c 5 -f 'Key: %k, message: %s \n'
# more complex output
kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC -c 5 -f '\nKey (%K bytes): %k\t\nValue (%S bytes): %s\nTimestamp: %T\tPartition: %p\tOffset: %o\nHeaders: %h\n--\n' -e

read messages in the time range, read messages between two datetimes

date_start=`date +'%Y-%m-%d %H:%M:%S' --date="2 hour ago"`
date_end=`date +'%Y-%m-%d %H:%M:%S'`

kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC -o s@$date_start -o e@$date_end

read key of the message

kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC -K\t

read message with specific key

kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC -o beginning -K\t | grep $MESSAGE_KEY
# shrink time of the scan from "beginning" to something more expected

write/produce message

kafkacat -C -b $BROKER_HOST:$BROKER_PORT -t $TOPIC -c 5 -P -l /path/to/file