Questions about the implementation of HSE 3.0, e.g., data structures #585
Replies: 3 comments
-
c0 and cn are references to Log-Structured Merge Tree components. c0 is the in-memory data, and cn is the on on-media component. cn is composed of a tree where data gets ingested into the root node and N (1024 might be the max) leaf nodes. Each node is composed of kvsets. Kvsets are composed of 32MB "mblocks" in the form of key blocks (kblocks), value blocks (vblocks), and a single header block (hblock). I believe an mblock is analogous to an SSTable in RocksDB. So to answer the first question, cn isn't an abbreviation. |
Beta Was this translation helpful? Give feedback.
-
Got it. Another question: |
Beta Was this translation helpful? Give feedback.
-
A few clarifications. HSE is based on LSM concepts, and we use some of the terminology from the original LSM paper, but the implementation is very different. An active (open) HSE KVDB consists of c0 in-memory data and cN on-media (persistent) data, as defined below, along with an associated set of WAL files. An inactive (closed) KVDB has no c0 component. HSE c0 is the in-memory portion that holds newly ingested data. The term c0 comes from the original LSM paper, and it is the logical equivalent of memtables in RocksDB. c0 is organized as a collection of Bonsai trees, which is an RCU-friendly structure, and we use RCU to coordinate accesses and updates to c0. See this paper for a description of Bonsai trees. HSE cN is the on-media portion that holds comitted persistent data. The term cN is loosely based on the original LSM paper which refers to c1, c2, ..., cK trees on media so we chose to call our complete on-media structure cN. HSE cN is logically equivalent to the complete collection of sstables in all levels in RocksDB. cN is organized as a collection of two-level trees, with one such tree per KVS. Each cN tree has a root node with a variable number of child nodes, where this number can grow and shrink based on the amount of data stored. The lexicographic key ranges of child nodes in a cN tree are non-overlapping. I.e., the range of keys in Child(J) of a given cN tree never overlaps with the range of keys in Child(K) of that same tree. Each cN node is organized as a collection of kvsets, where each kvset is the logical equivalent of an sstable in RocksDB. And as @tristan957 notes above, each kvset is organized as a collection of key blocks (kblocks) and value blocks (vblocks) which we refer to generically as media blocks (mblocks). HSE has to do compaction to keep data organized and eliminate garbage, like RocksDB and all other systems based on LSM concepts. HSE has several compaction methods that are applied in different circumstances, and we have a lot of flexibility in how we do compactions because we separate keys and values on media. Just one example is what we call K-compaction where we compact (read/rewrite) the keys across multiple kvsets in a cN node but do not compact the associated values (which of course leaves value garbage in place). K-compaction improves look-up performance for the target kvsets with minimal write/read-amplificaiton. This is a very high-level description, but hopefully it gives you a mental image that will facilitate reading the code. |
Beta Was this translation helpful? Give feedback.
-
What dose cn mean and what is its full name in the HSE implementation?
RocksDB use memtable and leveln SST to implement its LSM structure. It seems that c0 works like the memtable and cn the leveln SST in the HSE.
Beta Was this translation helpful? Give feedback.
All reactions