[TOC]
API
input: ProcessEvents(Id, eventtype, timerange)
-
[extend] event: like, views, comments
-
[extend] time range
How it will be used? real-time, or periodical system, eg: p99 latency, Write-to-read data delay
consistency: eventually consistency
QPS
latency: real-time, or periodical system, eg: p99 latency, Write-to-read data delay
avaialbility: important,
read/write ratio: both heavily Scalability
Traffic Spike
MISC - 这个就要看具体系统,比如金融交易这些,idempotency就很重要,要提到。比如用户facing的系统,latency就很重要
Design Princple:
- Scalability: scale write/read
- Delay: make both writes and read fast
- Consistency: what level of consistency
- Durability: not lose data in case of hardware faults and network partitions, and how to recover
- Extensible for data model changes in the future
Focus on scalability, availability(replica)
**Load balancer**: Zookeeper provide the registry service- monitor health of master, replica, handle master failover
Replica provide data replication and increase read throuput
Shard proxy provide cache of the query, monitor instance health, publish metrics, terminate queries that take too long to return
SQL can store the relation between the table, and decrease data duplication, avoid inconsistency data.
Feature: fault-tolerant, scalable(read/write throuput increase linearly as new machines are added), multi datacenter replication, works well with time-series data.
Gossip protocol: every second, the node share the info with 3 nodes, so each node about the info of whole cluster. -> don't need proxy
Quorum read/write, quorum level is configurable.
Once each server query to the node(maybe the nearest, the fastest), the node will return the value that can achieve the quorum.
The field in cassandra support thread safe list, map, continuing append new data into the list.
First one will contineouing update the DB once it receive the query.
Second one, store the data in memory, but update DB later(single point failure)
- Push data into server, but if server failed, the data may lose
- server pull data from MQ(message queue), so can recalculate it(with offset)
Consumer will have a cache(with lease) to store last 10 mins events, so as to deduplicate the event.
Aggregator helps to calculate the group the events, and calculate the sum.
Internal Queue is to decouple the aggregator to the db, async way to write to the database, and improve the throughput(mutliple thread in Database writer).
Dead-letter Queue can help when database is unreachable or slow, or database writer can store locally(in disk)
State Store: save the state, as a checkpoint, to avoid reproduce a lot data if there is failure.
Strategy to choose from: pros and cons
Blocking vs non-blocking I/O:
- Blocking: one thread per request, synced, easy to debug/track, lower throuput.
- Unblocking, multiplex: one thread pile up the request, increase the complexity of the system
Buffering and batching: gateway, reduce the traffic, improve the efficiency. But it increase complexity, and data may be unrecoverable if batch processing failed in the middle.
Timeout: Connection timeout and request timeout:
- Retry after timeout
- Change to another available machine
- Retry storm event when a lot client retry at the same time, it will overload server
- Improvement:
- Exponential backoff and Jitter(random increase interval).
- Circuit breaker: stop a client from repeatedly trying to execute a operation that is likely to fail. Theory is calculate how many request is failed recently(a timer), if exceed the threshold, stop service.
- drawbacks: harder to test, hard to set properly error threshold and timers.
-
Hardware
- pros: powerful, can handle big throuput
- Cons: Expensive
-
Software
- pros: open-source, free, a lot type/algorithm to choose from
- TCP: cannot inspect the pkg,
- HTTP: can check the content, make decision upon that.
- Algorithm: round robin, lowest live connection, server with the fastest response time, hash-based.
-
DNS
- paritioner service is registerd to DNS
- Monitor the health of paritioner
- Primary-Secondary node, reside in different data-center, to improva the availability. Primary take request, secondary work as backup.
Hot partition, with unbalance load
- hash by video id, event time is not a good choice.
- split hot partition into multiple new partition.
- allocated dedicated paritition just for some hot video.
Service Discovery -> zookeeper
- maintain the info of replication
- monitor the health, replace by backup.
Replication: leader and follower.
Message format
- textual formats, xml, csv, json, simple and readable, but it waste some space to encode the field name
- Binary format, save the space to encode the filed name, but server, client side should be aware of the data strucure in advance.
Extension:
- Save the hisotry data for analysis use
- CDN speed up
- Pre-cache some results with lease.
Save the hisotry data for analysis use
-
Improve the granduality of the table, for time serires data.
hour-> day-> month-> year
-
store the cold data in object storage, like AWS S3. Cold data store the archived and infrequently accessed data.