- How big is the data
- Key ( Latitude 37.40, Longtitude -122.09 )
- Each key size < 20B
- Total key size = 200GB
- Value ( pic and all the building name on this pic )
- Each value size = 100KB
- Total value size = 1PB
- Key ( Latitude 37.40, Longtitude -122.09 )
- App client + Web servers + Storage service
- Hashmap
- Only in memory
- Database (SQL, noSQL)
- Good but no perfect
- Usually optimized for writing
- GFS
- Cannot support key, value lookup
- Only a single file sorted by key stored in GFS
- Memory: index and file address.
- Chunk index table (Key, Chunk index)
- Given a key How do we know which chunk we should read
- 20B * 10 billion = 200G. Can be stored inside memory.
- Cache
- Check index for the given key
- Binary search within the file
- Master has consistent hashmap
- Shard the key according to latitude/longtitude
- Actual do not need the master because consistent hashmap could be stored directly in the web server.
- Slave
- Client sends lookup request key K to web server.
- Web server checks its local consistent hashmap and finds the slave server Id.
- Web server sends the request key K to the slave server.
- Slave server looks up its chunk table (Key, Chunk) table by with Key K and get the chunk index.
- Slave server checks the cache to see whether the specific chunk is already inside the cache.
- If not inside the cache, the slave server asks the specific chunk from GFS by chunk index.
- Read or write intensive
- Whether to optimize read operations
- Large amounts of data
- Whether needs sharding
- value get(Key)
- set(key, value)
- Modify existing entry (key, value)
- Create new entry (key, value)
- Sorted file with (Key, Value) entries
- Disk-based binary search based read O(lgn)
- Linear read operations write O(n)
- Unsorted file with (Key, Value) entries
- Linear read operations O(n)
- Constant time write O(1)
- Combine append-only write and binary search read
- Break the large table into a list of smaller tables 0~N
- 0~N-1 th tables are all stored in disk in sorted order as File 0 ~ File N-1.
- Nth table is stored in disk unsorted as File N.
- Have a in-memory table mapping mapping tables/files to its address.
- Break the large table into a list of smaller tables 0~N
- Write: O(1)
- Write directly goes to the Nth table/file.
- If the Nth table is full, sort it and write it to disk. And then create a new table/file.
- Read: O(n)
- Linearly scan through the Nth table.
- If cannot find, perform binary search on N-1, N-2, ..., 0th.
- Disk-based approach vs in-memory approach
- Disk-based approach: All data Once disk reading + disk writing + in-memory sorting
- In-memory approach: All data Once disk writing + in-memory sorting
- What if memory is lost?
- Problem: Nth in memory table is lost.
- Write ahead log / WAL: The WAL is the lifeline that is needed when disaster strikes. Similar to a BIN log in MySQL it records all changes to the data. This is important in case something happens to the primary storage. So if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails the whole operation must be considered a failure. Have a balance between between latency and durability.
- Consume too much disk space due to repetitive entries (Key, Value)
- Have a background process doing K-way merge for the sorted tables regularly
- Each sorted table should have an index inside memory.
- The index is a sketch of key value pairs
- More advanced way to build index with B tree.
- Each sorted table should have a bloomfilter inside memory.
- Accuracy of bloom filter
- Number of hash functions
- Length of bit vector
- Number of stored entries
- In-memory table: In-memory skip list
- 1~N-1th disk-based tables: Sstable
- Tablet server: Slave server
- First check the Key inside in-memory skip list.
- Check the bloom filter for each file and decide which file might have this key.
- Use the index to find the value for the key.
- Read and return key, value pair.
- Record the write operation inside write ahead log.
- Write directly goes to the in-memory skip list.
- If the in-memory skip list reaches its maximum capacity, sort it and write it to disk as a Sstable. At the same time create index and bloom filter for it.
- Then create a new table/file.
- Master has the hashmap [Key, server address]
- Slave is responsible for storing data
- Client sends request of reading Key K to master server.
- Master returns the server index by checking its consistent hashmap.
- Client sends request of Key to slave server.
- First check the Key pair inside memory.
- Check the bloom filter for each file and decide which file might have this key.
- Use the index to find the value for the key.
- Read and return key, value pair
- Clients send request of writing pair K,V to master server.
- Master returns the server index
- Clients send request of writing pair K,V to slave server.
- Slave records the write operation inside write ahead log.
- Slave writes directly go to the in-memory skip list.
- If the in-memory skip list reaches its maximum capacity, sort it and write it to disk as a Sstable. At the same time create index and bloom filter for it.
- Then create a new table/file.
- Replace local disk with GFS for
- Disk size
- Replica
- Failure and recovery
- Write ahead log and SsTable are all stored inside GFS.
- How to write SsTable to GFS
- Divide SsTable into multiple chunks (64MB) and store each chunk inside GFS.
- How to write SsTable to GFS
- GFS is added as an additional layer
- Master server also has a distributed lock (such as Chubby/Zookeeper)
- Distributed lock
- Consistent hashmap is stored inside the lock server
- Client sends request of reading Key K to master server.
- Master server locks the key. Returns the server index by checking its consistent hashmap.
- Client sends request of Key to slave server.
- First check the Key pair inside memory.
- Check the bloom filter for each file and decide which file might have this key.
- Use the index to find the value for the key.
- Read and return key, value pair
- Read process finishes. Slave notifies the client.
- The client notifies the master server to unlock the key.
- Clients send request of writing pair K,V to master server.
- Master server locks the key. Returns the server index.
- Clients send request of writing pair K,V to slave server.
- Slave records the write operation inside write ahead log.
- Slave writes directly go to the in-memory skip list.
- If the in-memory skip list reaches its maximum capacity, sort it and write it to disk as a Sstable. At the same time create index and bloom filter for it.
- Then create a new table/file.
- Write process finishes. Slave notifies the client.
- The client notifies the master server to unlock the key.