Step By Step Guide

[TOC]

Technology Stack

Interview Process

Functional Requirement - API (5 mins)

API

input: ProcessEvents(Id, eventtype, timerange)

[extend] event: like, views, comments
[extend] time range

How it will be used? real-time, or periodical system, eg: p99 latency, Write-to-read data delay

Non-functional requirements - Qualities (5 mins)

consistency: eventually consistency

QPS

latency: real-time, or periodical system, eg: p99 latency, Write-to-read data delay

avaialbility: important,

read/write ratio: both heavily Scalability

Traffic Spike

MISC - 这个就要看具体系统，比如金融交易这些，idempotency就很重要，要提到。比如用户facing的系统，latency就很重要

Data Model

Design Princple:

Scalability: scale write/read
Delay: make both writes and read fast
Consistency: what level of consistency
Durability: not lose data in case of hardware faults and network partitions, and how to recover
Extensible for data model changes in the future

What we store

Prerequisite: SQL vs Nosql

SQL

Performance finetune

Focus on scalability, availability(replica)

**Load balancer**: Zookeeper provide the registry service

monitor health of master, replica, handle master failover

Replica provide data replication and increase read throuput

Shard proxy provide cache of the query, monitor instance health, publish metrics, terminate queries that take too long to return

Relational

SQL can store the relation between the table, and decrease data duplication, avoid inconsistency data.

NoSQL: Cassandra

Feature: fault-tolerant, scalable(read/write throuput increase linearly as new machines are added), multi datacenter replication, works well with time-series data.

Gossip protocol: every second, the node share the info with 3 nodes, so each node about the info of whole cluster. -> don't need proxy

Quorum read/write, quorum level is configurable.

Once each server query to the node(maybe the nearest, the fastest), the node will return the value that can achieve the quorum.

Wide Column

The field in cassandra support thread safe list, map, continuing append new data into the list.

Data Aggregation

Pre-processing?

First one will contineouing update the DB once it receive the query.

Second one, store the data in memory, but update DB later(single point failure)

Push or Pull

Push data into server, but if server failed, the data may lose
server pull data from MQ(message queue), so can recalculate it(with offset)

Partitioning(Scalability):

High Level Design - Diagram (重点)

Processing service

Consumer will have a cache(with lease) to store last 10 mins events, so as to deduplicate the event.

Aggregator helps to calculate the group the events, and calculate the sum.

Internal Queue is to decouple the aggregator to the db, async way to write to the database, and improve the throughput(mutliple thread in Database writer).

Dead-letter Queue can help when database is unreachable or slow, or database writer can store locally(in disk)

State Store: save the state, as a checkpoint, to avoid reproduce a lot data if there is failure.

Data Ingestion Path

Strategy to choose from: pros and cons

Partitioner Service Client

Blocking vs non-blocking I/O:

Blocking: one thread per request, synced, easy to debug/track, lower throuput.
Unblocking, multiplex: one thread pile up the request, increase the complexity of the system

Buffering and batching: gateway, reduce the traffic, improve the efficiency. But it increase complexity, and data may be unrecoverable if batch processing failed in the middle.

Timeout: Connection timeout and request timeout:

Retry after timeout
- Change to another available machine
- Retry storm event when a lot client retry at the same time, it will overload server
- Improvement:
  - Exponential backoff and Jitter(random increase interval).
  - Circuit breaker: stop a client from repeatedly trying to execute a operation that is likely to fail. Theory is calculate how many request is failed recently(a timer), if exceed the threshold, stop service.
    - drawbacks: harder to test, hard to set properly error threshold and timers.

Load Balancer

Hardware
- pros: powerful, can handle big throuput
- Cons: Expensive
Software
- pros: open-source, free, a lot type/algorithm to choose from
- TCP: cannot inspect the pkg,
- HTTP: can check the content, make decision upon that.
- Algorithm: round robin, lowest live connection, server with the fastest response time, hash-based.
DNS
- paritioner service is registerd to DNS
- Monitor the health of paritioner
- Primary-Secondary node, reside in different data-center, to improva the availability. Primary take request, secondary work as backup.

Partitioner service and Partitions

Hot partition, with unbalance load

- hash by video id, event time is not a good choice.
- split hot partition into multiple new partition.
- allocated dedicated paritition just for some hot video.

Service Discovery -> zookeeper

- maintain the info of replication
- monitor the health, replace by backup.

Replication: leader and follower.

Message format

- textual formats, xml, csv, json, simple and readable, but it waste some space to encode the field name
- Binary format, save the space to encode the filed name, but server, client side should be aware of the data strucure in advance.

Data retrieval path: design for extension

Extension:

- Save the hisotry data for analysis use
- CDN speed up
- Pre-cache some results with lease.

Save the hisotry data for analysis use

Improve the granduality of the table, for time serires data.

hour-> day-> month-> year
store the cold data in object storage, like AWS S3. Cold data store the archived and infrequently accessed data.

Data Flow simulation

Reference

System Design Interview – Step By Step Guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step-by-step-guide.md

step-by-step-guide.md

Step By Step Guide

Technology Stack

Interview Process

Functional Requirement - API (5 mins)

Non-functional requirements - Qualities (5 mins)

Data Model

What we store

Prerequisite: SQL vs Nosql

SQL

Performance finetune

Relational

NoSQL: Cassandra

Wide Column

Data Aggregation

Pre-processing?

Push or Pull

Partitioning(Scalability):

High Level Design - Diagram (重点)

Processing service

Data Ingestion Path

Partitioner Service Client

Load Balancer

Partitioner service and Partitions

Data retrieval path: design for extension

Data Flow simulation

Reference

Files

step-by-step-guide.md

Latest commit

History

step-by-step-guide.md

File metadata and controls

Step By Step Guide

Technology Stack

Interview Process

Functional Requirement - API (5 mins)

Non-functional requirements - Qualities (5 mins)

Data Model

What we store

Prerequisite: SQL vs Nosql

SQL

Performance finetune

Relational

NoSQL: Cassandra

Wide Column

Data Aggregation

Pre-processing?

Push or Pull

Partitioning(Scalability):

High Level Design - Diagram (重点)

Processing service

Data Ingestion Path

Partitioner Service Client

Load Balancer

Partitioner service and Partitions

Data retrieval path: design for extension

Data Flow simulation

Reference