Skip to content

Latest commit

 

History

History
238 lines (224 loc) · 15.6 KB

research.md

File metadata and controls

238 lines (224 loc) · 15.6 KB

Research

Approach

I need to be purposeful about my research. There's a lot to learn from reading, but more to learn from building. Unfortunately, it won't be worthwhile for me to try and learn everything about distributed systems and data engineering before getting started - I need to focus on some subset of the information.

My approach will be to focus on Netflix's data architecture. They have world-class engineering and have grown over enough time to have gained plenty of lessons and learnings. I'll study up on their decisions, wins, and mistakes, select some interesting technologies to use and challenging problems to solve, and get to work. I may still look into other companies' data stacks, but only at a high level.

Resources

The checkbox indicates whether I've reviewed the article/technology.

Questions

The checkbox indicates whether I've found a reasonable answer to the question.

  • Q.1: designing direct communication between microservices, outside the regular data pipeline flow
    • Q.1.1: fan-out
    • Q.1.2: communication protocols
    • Q.1.3: data joins/enrichment
  • Q.2: Kappa versus Lambda architecture
  • Q.3: Delta Lake or other OSS for CDC and backfills
  • Q.4: Star schema vs snowflake schema
  • Q.5: How to run performance testing
  • Q.6: Need k8s? Rancher?
  • Q.7: Avro-RPC vs gRPC
  • Q.8: Is using something like Ibis or Fugue a good idea for this project?

Notes

  • Numbers (R.31)
    • 200 million users at Netflix (2023) (R.17)
    • 700 billion events, ~1.3 PB per day (2016) (R.4, R.6)
    • 8 million events, ~24 GB per second during peak hours (2016) (R.6)
    • Daily data loss rate of less than 0.01% (2016) (R.4)
    • Watch time for most Netflix titles available here
  • Analysis
    • Take rate: number of times a title is played / number of times it is shown to the user (R.24)
    • Sessionization: ends after 30 mins of inactivity (R.27)
  • Keystone pipeline streams
    • Video viewing activities (R.6)
    • UI activities (R.6)
    • Error logs (R.6)
    • Performance events (R.6)
    • Troubleshooting and diagnostic events (R.6)
  • Performance metrics
    • Time end to end, from production of an event until it reaches all sinks (R.11)
    • Processing lag for every consumer (R.11)
    • Payload sizes (R.11)
    • Compute resource utilization efficiency (R.11)
    • Checkpointing and failure recovery (R.11)
    • Ability to provide backpressure to sources without crashing (R.11)
    • Handling event bursts (R.11)
  • Operational metrics flow through a different pipeline than the Keystone pipeline (R.6)
  • The Keystone pipeline has been replaced with Data Mesh at Netflix (R.36)
  • Real-time is defined as sub-minute latency (R.6)
  • WAP pattern: write to hidden Iceberg snapshot, audit, publish (R.17)
  • System events are treated as their own data streams (R.17)
  • Keystone stream: kafka topic with a flink "router" that is configured by the control plane to output to one or more sinks (R.17)
  • Sync data to Iceberg between steps for offline statistical analysis and backfilling (R.17)
  • Backfilling: retroactively processing historical records
    • Batch Kafka events by time and send to Iceberg table (R.24)
    • Basic technique: replay events from all batches at the same time (can break if application relies on ordering) (R.24)
    • Advanced technique: replay events from batches using a lateness tolerance level and local watermark + global watermark (R.24)
    • Two Flink stacks: production (real-time) and backfill (R.24)
    • Production (Kafka) / backfill (Iceberg) sources | application (flink) | Kafka | Iceberg (R.24 @ 23:00)
    • Netflix sees 99.9% consistency with production after backfilling (R.24)
  • A workflow is a set of tasks, a task can be a job, a job can be an ETL job (R.27)
  • Table ready signal: point in time up to which data in a table is valid and ready (within the context of batch processing) (R.27)
  • Concurrent incremental pattern
    • Lower latency than hourly ETL (e.g., 10 minutes) (R.27)
    • Use a watermark for processed data in the source table (R.27)
    • Queuing service triggers periodically and sends partitions to queue to trigger processing on those partitions (R.27)
  • Extractor pattern: large dataset with many consumers interested in different subsets (also applies to streams)
    • Offer YAML DSL to combine multiple extraction jobs into one (R.27)
  • Intermediate state: to handle unbounded windows, rollover unresolved state to the next window (e.g., sessionization) (R.27)
  • Flink is used to dice/prepare data streams for downsteam, real-time data services (R.17)
  • Fronting kafka clusters receive from all producers and pass data through Flink routers to sinks (including secondary/consumer Kafka) (R.4, R.6)
  • If a message cannot be delivered by a producer after retries, it is dropped (R.4)
  • Archaius is used for dynamically configuring Kafka destinations in producers, but non-Java clients use a REST proxy to relay messages to Kafka clusters (R.4)
  • Downstream data services do not directly consume from fronting Kafka clusters to enable providing predictable load (R.4)
  • A dedicated ZooKeeper cluster is used for each Kafka cluster (R.4)
  • Kafka deployment configuration (R.4)
  • Backend services communicate through Kafka pub/sub (R.11)
  • Events have a standard format: UUID, type (CRUD), timestamp, payload (R.11)
  • Change data capture (CDC) propagates database changes as events (R.11)
  • Enrichers consume from Kafka, join the data with additional data from GraphQL/gRPC calls to other services, and then place the enriched data onto another Kafka topic (R.11)
  • Enrichers are created using Flink, RocksDB, and ksqlDB (R.11)
    • ksqlDB is more accessible than Flink but is closed source (R.2)
  • To avoid misordering events, producers send only the primary ID of the resource that changed. During the enrichment process, the source service is queried to get the up-to-date payload (delayed materialization) (R.11).
  • Hive for auditing (R.11)
  • Consumers must be idempotent and use a distributed cache with expiry to avoid repeating computation (R.11)
  • Kappa architecture is becoming dominant (R.37)
  • Batch processing is now a downstream process in the streaming pipeline (R.37)
  • Sources: event sourcing (from applications) or CDC (from DBs) (R.37)
  • Principles: storage and compute separation, data platform component composability, single source of truth, cloud native (R.18)
  • CDC from OLTP stores (Cassandra, Amazon RDS, EVCache, CockroachDB) through Flink and Kafka to OLAP warehouse in S3 (R.18, R.39)
  • Iceberg tables stored with an S3 prefix matching the table name (R.18)
  • Producers should never block when sending to Kafka (R.31)
  • Lambda architecture
    • Idea: data flows through two paths, batch and streaming (R.25)
    • Batch layer: complete, accurate, idempotent, pre-computes views (S3, Spark) (R.25)
    • Streaming layer: (Kafka, Flink, Spark Streaming, Storm) (R.25)
    • Serving layer: API/facade for the batch and streaming layer results (Cassandra, Redis, ZooKeeper) (R.25)
  • Kappa architecture
    • All processing done on a single real-time stream (no batch layer like in Lambda architecture) (R.26)
    • Data sources | stream processing | data store | applications (R.26)
    • Historical data is treated as a real-time stream (R.26)
  • Microservices architecture
    • Services > libraries as they are individually deployable (R.33)
    • Service may consist of multiple processes that are developed and deployed together (R.33)
    • Organized around business capability / domain rather than technology (R.33)
    • Split based on independent replacement and upgradeability (R.33)
    • If two services often change together, consider merging them (R.33)
    • Services may differ on their domain models, in other words, having bounded context (R.42)
    • Product (entire lifecycle) teams rather project teams (just development) (R.33)
    • Often eventual consistency over transactions (R.33)
    • CI/CD (R.33)
    • Design for failure of services (R.33)
    • Monitor for load and failures (R.33)
    • Be conservative in what you do, and liberal in what you accept from others (R.40)
  • Tooling technologies
    • Big data querying UI (R.17)
    • Maestro workflow and job scheduler (R.17)
    • Streaming platform as a service (control plane) (R.17)
    • Mantis for running ad-hoc queries against raw event data (observability) (R.17)
    • Atlas for telemetry (R.6)
    • Archaius library for static/dynamic configuration management (R.4)
    • Jenkins (R.5)
    • Spinnaker (R.17)
  • Data platform technologies
    • Iceberg (R.16, R.17, R.18)
    • Spark (SQL, Python, Scala) for ETL/batch pipelines (R.16, R.17, R.18)
    • Trino for interactive analytics (R.12, R.16, R.17, R.18)
    • Druid for real-time aggregated dashboards (R.16, R.17, R.18)
    • Snowflake (R.17)
    • Flink (R.17, R.18, R.36)
    • Titus (R.17)
    • Kafka (R.17, R.18, R.36)
    • Elasticsearch (R.17, R.36)
    • Cassandra (R.17, R.18)
    • ZooKeeper (R.4, R.31)
    • EVCache (R.5, R.18)
    • CockroachDB (R.5, R.18)
    • MySQL (R.5)
    • S3 (R.5, R.16)
    • RocksDB (R.11)
    • ksqlDB (R.11)
    • Avro (R.11, R.36)
    • Confluent Schema Registry (R.11)
    • Hive (replaced by Trino) (R.11, R.34)
    • Polars for single-node data manipulation (R.34)
    • Beam (R.37)
    • Amazon RDS (R.18)
  • Backend services technologies
    • gRPC (R.17)
    • Spring Boot (R.17)
    • Zuul (R.5)
    • Eureka (R.5)
  • Analytics technologies
    • Tableau (R.5, R.16)
    • Jupyter (R.18)
    • Apache SuperSet (R.38)
  • Testing
    • Native unit test libraries for UDFs (R.17, R.29)
    • Dataflow mocking tool for creating sampled inputs for unit tests (R.17)
    • Data Auditor (R.17)
    • Chaos Monkey (R.5)
    • Spark unit test library (R.17)
    • ScalaCheck / Hypothesis (R.29)
    • Base classes for writing tests with Spark (R.29)
    • Supports property testing (if unable to sample production data to generate test input) (R.29)
    • Audit the data: nulls, distributions, uniqueness, counts (R.29)
    • Tensorflow Data Validation (R.29)

Ideas

  • Notifications
  • Recommendations, clustering, etc.
  • Combine multiple review sources