Skip to content

Latest commit

 

History

History
44 lines (33 loc) · 1.88 KB

aws_data_specialty.md

File metadata and controls

44 lines (33 loc) · 1.88 KB

AWS Data Specialty

General

Collection

Kinesis

  • If you read Kinesis Data Streams, think Kafka topics
  • Producers can be apps (SDK), clients (written using Kinesis Producer Library, KPL) Kinesis agents
  • Consumers can be apps (SDK), clients (written using Kinesis Consumer Library, KCL, Lambda functions, Kinesis Data Firehose or Kinesis Data Analytics
  • Records sent to data streams contain partition key and data blob
  • Partition key determines to which shard data is written
  • Retention can be between 1 and 365 days
  • Kinesis Data Streams can be created in provisioned (define shards upfront and pay per shard per hour) or demand (automatic scaling - based on throughput of last 30 days - payed per stream per hour & data in/output per GB) mode
  • If you capacity in advance, go for provisioned mode
  • Kinesis Data Streams support VPC endpoints

Producers

SDK
  • Use SDK for low throughput, high latency
  • SDK exposes PutRecord (single record) and PutRecords (batched records) method
  • ProvisionedThroughPutExceed exception is thrown when data/records per second exceeds threshold of shard
  • Choose your partition key wisely
KPL
  • Java/C++ library
  • Use KPL to build high throughput, long-running producers
  • Supports sync and async API -> If you read sending data to Kinesis asynchronously, think KPL
  • Submits metrics to Cloudwatch
  • Supports batch via collect (write to multiple shards in one PutRecords API call) and aggregate (nest multiple records in a single record)
  • Compression is not supported out of the box
  • KPL created records can only be read with the KCL library
  • Don't use KPL if latency is important or if only latest events are of interest
Agent
  • built on top of KPL
  • watches files/directories and can send to multipe streams
  • can also preprocess and convert data before sending it
  • Supports file rotation, checkpointing and retries