-
-
Notifications
You must be signed in to change notification settings - Fork 141
[WIP] Tianmu In Memory Engine
This is a part of #436.
The In-memory engine is employed since StoneDB version 2.0. In version 1.0, StoneDB only support Tianmu on-disk column-based engine, which acts as another primary engine, parallel to InnoDB. In StoneDB version 2.0, we will use a new feature which was introduced in MySQL 8.0.2, secondary engine.
Before we start to state Tianmu in-memory engine, we want to give more words on this topic. There are some solutions available now, such as version 1.0 does that using two primary engines.
-
Firstly, Secondary engine is a framwork of MySQL, which is used to provide multi-engine ability. With universal interfaces and framewrok, MySQL can route the workloads to corresponding engine according to type of each workloads to leavage their advantages for an excellent service. Secondary engine also be a chance for MySQL to enhance its multi-model ability, such as making ClickHouse as a secondary engine to provide analytical services.
-
Secondly, logically, Routing the some subworks to secondary engine is naturelly coming into mind, and main works are done in primary engine.
-
Thirdly, As we known, The secondary engine feature has already used by Oracle in their online service, MySQL Heatwave, which is an In-Memory Query Accelerator with Built-in ML.
HeatWave. It increases MySQL performance by orders of magnitude for analytics and mixed workloads, without any changes to current applications. With HeatWave enabled, MySQL HeatWave is 6.5X faster than Amazon Redshift at half the cost, 7X faster than Snowflake at one-fifth the cost, and 1,400X faster than Amazon Aurora at half the cost. Customers run analytics on data that’s stored in MySQL databases without a separate analytics database and ETL duplication. https://www.oracle.com/mysql/heatwave/
Before the answer was given, Let's talk about some challenges for analytical processing system.
Traditionally, obtaining good performance for analytic queries meant satisfying several requirements. In a typical data warehouse or mixed-use database, requirements include the following: (1) You must understand user access patterns; (2) You must provide good performance, which typically requires creating indexes, materialized views, and OLAP cubes.
In order to advance the performance of StoneDB, in version 1.0, column-based data format used. The column-based data format orginze the data in column, not in row. For example, in a large sales
table, the sales IDs
reside in one column, and sales regions
reside in a different column.
Analytical workloads usually access few columns while scanning, but scan operation fectches the entire data set. For this reason, the column-based format is the most efficient for analytical workloads. Because, as column-based format depicts, the columns are stored separately, an analytical query can access only columns needed, and avoid reading inessential data. Taking an instance, a report on sales totals
by region can rapidly process many rows while accessing only a few columns.
Database systems typically force users to choose between a column-based and row-based format. For example, if the data format is column-based, then the database stores data in column-based format both in memory and on disk. Gaining the advantages of one format means losing the advantages of the alternate format.
Hence, applications either achieve rapid analytics or rapid transactions, but not both. The performance problems for mixed-workloads databases are not solved by storing data in just only ONE single format.
Based on what we discussed above, in version 2.0, we try to introduce a new data-format engine, in-memory column-based store, for analytical workloads. That in memory column-based engine also called Tianmu
as we called in version 1.0.
The In-Memory feature set includes the IM column store, advanced query optimizations, and availability solutions. These features accelerate analytic queries peformance by orders of magnitude without sacrificing OLTP performance or availability.
In #436 , some brief descriptions of In-memory column-based engine are given. The data is compressed and encoded before being loading into in-memory column-based engine. Not all types of data are suitable for encoding and compressing. In #423, we define which type of data can be encoded and compressed.
In in-memory column-based engine, it holds copies of tables, partitions, or columns in compressed columnar format, which is optimized for scan operations.
In mysql, InnoDB buffer pool is multi-gigabyte range, and the memory distributes in different NUMA node. And, the cross-NUMA accessing is the performance bottle-neck in multi-cores system. So the memory allocation algorithm(or policy) in NUMA nodes should be chosen carefully. In innobase/buf/buf0buf.cc
, it uses buf_block_alloc
function to allocate a buffer block,and make sure that spreads the grace on all buffer pool instances.
buf_block_t *buf_block_alloc(
buf_pool_t *buf_pool) /*!< in/out: buffer pool instance,
or NULL for round-robin selection
of the buffer pool */
{
buf_block_t *block;
ulint index;
static ulint buf_pool_index;
if (buf_pool == nullptr) {
/* We are allocating memory from any buffer pool, ensure
we spread the grace on all buffer pool instances. */
index = buf_pool_index++ % srv_buf_pool_instances;
buf_pool = buf_pool_from_array(index);
}
block = buf_LRU_get_free_block(buf_pool);
buf_block_set_state(block, BUF_BLOCK_MEMORY);
return (block);
}
Refer to NUMA memory allocation policies in RedHat.
In MySQL, we recommend using interleave
memory allocation policy in NUMA architecture. The struct set_numa_interleave_t
is used to set the memory allocation policy to MPOL_INTERLEAVE
.
struct set_numa_interleave_t {
set_numa_interleave_t() {
if (srv_numa_interleave) {
ib::info(ER_IB_MSG_47) << "Setting NUMA memory policy to"
" MPOL_INTERLEAVE";
struct bitmask *numa_nodes = numa_get_mems_allowed();
if (set_mempolicy(MPOL_INTERLEAVE, numa_nodes->maskp, numa_nodes->size) !=
0) {
ib::warn(ER_IB_MSG_48) << "Failed to set NUMA memory"
" policy to MPOL_INTERLEAVE: "
<< strerror(errno);
}
numa_bitmask_free(numa_nodes);
}
}
~set_numa_interleave_t() {
if (srv_numa_interleave) {
ib::info(ER_IB_MSG_49) << "Setting NUMA memory policy to"
" MPOL_DEFAULT";
if (set_mempolicy(MPOL_DEFAULT, nullptr, 0) != 0) {
ib::warn(ER_IB_MSG_50) << "Failed to set NUMA memory"
" policy to MPOL_DEFAULT: "
<< strerror(errno);
}
}
}
};
In general, in MySQL, more than one buffer pool instance is created, and the stance number is governed innodb_buffer_pool_instances
. The number of buffer pool instance should adjust according to the size of buffer pool. The
For systems with buffer pools in the multi-gigabyte range, dividing the buffer pool into separate instances can improve concurrency, by reducing contention as different threads read and write to cached pages.
And, MySQL buffer pool is consist of buffer blocks
and control blocks
, and index page
, data page
, undo page
, insert buffer
, AHI(adaptive hash index)
, lock information
, and data dictionary
, etc. are all in buffer pool.
The buffer pool size is set by innodb_buffer_pool_size
. The size of each instance of buffer pool satisfy
size_of_an_instance = innodb_buffer_pool_size / innodb_buffer_pool_instances
In storage/innobase/buf/buf0buf.cc
, the function buf_pool_init
creates the buffer pool when MySQL is in starting.
/** Creates the buffer pool.
@param[in] total_size Size of the total pool in bytes.
@param[in] n_instances Number of buffer pool instances to create.
@return DB_SUCCESS if success, DB_ERROR if not enough memory or error */
dberr_t buf_pool_init(ulint total_size, ulint n_instances) {
ulint i;
const ulint size = total_size / n_instances;
...
NUMA_MEMPOLICY_INTERLEAVE_IN_SCOPE;
/* Usually buf_pool_should_madvise is protected by buf_pool_t::chunk_mutex-es,
but at this point in time there is no buf_pool_t instances yet, and no risk of
race condition with sys_var modifications or buffer pool resizing because we
have just started initializing the buffer pool.*/
buf_pool_should_madvise = innobase_should_madvise_buf_pool();
buf_pool_resizing = false;
buf_pool_ptr =
(buf_pool_t *)ut_zalloc_nokey(n_instances * sizeof *buf_pool_ptr);
buf_chunk_map_reg = UT_NEW_NOKEY(buf_pool_chunk_map_t());
std::vector<dberr_t> errs;
errs.assign(n_instances, DB_SUCCESS);
#ifdef UNIV_LINUX
ulint n_cores = sysconf(_SC_NPROCESSORS_ONLN);
/* Magic nuber 8 is from empirical testing on a
4 socket x 10 Cores x 2 HT host. 128G / 16 instances
takes about 4 secs, compared to 10 secs without this
optimisation.. */
if (n_cores > 8) {
n_cores = 8;
}
#else
ulint n_cores = 4;
#endif /* UNIV_LINUX */
dberr_t err = DB_SUCCESS;
for (i = 0; i < n_instances; /* no op */) { //initialize every instance, using multi-thread.
ulint n = i + n_cores;
if (n > n_instances) {
n = n_instances;
}
std::vector<std::thread> threads;
std::mutex m;
for (ulint id = i; id < n; ++id) { // create threads to do initialization concurrently.
threads.emplace_back(std::thread(buf_pool_create, &buf_pool_ptr[id], size,
id, &m, std::ref(errs[id])));
}
...
/* Do the next block of instances */
i = n;
}
buf_pool_set_sizes();
buf_LRU_old_ratio_update(100 * 3 / 8, FALSE);
btr_search_sys_create(buf_pool_get_curr_size() / sizeof(void *) / 64);
buf_stat_per_index =
UT_NEW(buf_stat_per_index_t(), mem_key_buf_stat_per_index_t);
return (DB_SUCCESS);
}
The other operation functions on buffer pool can be found in this file, storage/innobase/buf/buf0buf.cc
.
The change buffer is a special buffer to cache all changes made by manipulating secondary index pages
, but these pages are not in the buffer pool. When DML operations occurs, the changes are writing into change buffer, then merged later when the pages are loaded into buffer pool by other read operations.
The change buffer is used to improve the performance of DML operation by reorganizing the random IO to sequential IO,and less IO operations are needed, and the change buffer is a part of buffer pool in memeory. It will write into system tablespace when MySQL is shut down.
storage/innobase/ibuf/ibuf0ibuf.cc
gives all the functions about change buffer.
The Layout of an ibuf record as following:
The log buffer is a type of memory objects in MySQL, which is used to store the data to be written to the log files on disk. And, the default size of log buffer is 16 MB. the data in gog buffer are periodically write to disk. A large log buffer can run long-transactions without the need to write redo log data to disk often before the long-transactions commit. Therefore, If you are running a transaction that upates, inserts, or deletes a bunch of rows. We can gets benifit from increasing the size of the log buffer, which can reduce the disk IO costs.
In storage/innobase/log/log0buf.cc
, it defines the functions of log buffer operations, such as
lsn_t log_buffer_write(log_t &log, const Log_handle &handle, const byte *str,
size_t str_len, lsn_t start_lsn)
describs how to write to the reado log buffer.
In StoneDB version 2.0, Tianmu In-memory engine is employed. FIrst of all, There is problem we should answer at first. Where does Tianmu In-memory engine reside? And, what size of Tianmu in-memory engine does have? As we discussed above, we think that the buffer pool would be a good place to hold the Tianmu in-memory engine. When server boosted up, we can take a part of memory away from buffer pool as Tianmu in-memory engine's memory pool. We think that it would be a good and convenient way to allocate memory from system for Tianm in-memory engine. After we answer the first question we met, the second one we encounter is that what size of Tianm in-memory engine should take. As far as we know, more large buffer pool size we have, more performance can achieve. But, the memory capacity has its limitation.
We add some system variables to control the memory ussage of Tianmu in-memory engine. TIANMU_IN_MEM_SIZE
and TIANMU_IN_MEM_SIZE_PCT
.
TIANMU_IN_MEM_SIZE
is size of memory of Tianmu in-memory engine. it should less than innodb_buffer_pool_size
.
TIANMU_IN_MEM_SIZE_PCT
, despicts that Tianmu in-memory engine's memory size that how many percentage of MySQL buffer pool is. The value of TIANMU_IN_MEM_SIZE_PCT
should be less than 0.5, which means that Tianmu
in-memory engine SHOULD NOT take more than a half of MySQL buffer pool. In mysqld section of my.cnf
, adds these codes to configure the memory usage of Tianmu in-memory engine.
[mysqld]
#configure the mem usage of tianmu in-memory engine
TIANMU_IN_MEM_SIZE = 100GB
TIANMU_IN_MEM_SIZE_PCT = 0.6
You can use show
command to show the value of these variables.
The in-memory space is divided into subpools for columnar data and meta-data. There are two subpools in in-memory space.
- Columnar Data Pool In this part, it stores in-memory column-based unit, which we gave the brief description in #436 firstly.
- Meta-Data Pool This subpool stores metadata of column-based unit.
The size of each subpools is set by StoneDB itself, not user.
In stoneDB version 2.0, all data is stored in memory. The discusssion about this memory are in above. Now, Let's to dive into the details.
As StoneDB version 1.0 does, all data of tables stores sperately in column format. One column, one file. Diferenece with on-disk , the way of data orginzation in-memory will change slightly.
The column data subpool will be divided into N partition(known as In-memory column-based unit, IMCU), N depends on the size of loaded table. That means all data of table will be loaded into memory as partition. As described in #423, if a column is defined as TIANM_COLUMN
, that means this column will be loaded into column data pool in StoneDB version 2.0. Therefore, the partition is where the data loaded into. Each partition store one database object(table). For example, there two tables, table_a and table_b, are loaded into Tianmu, data of table_a is stored only in partion1, and data of table_b is stored in partion2. data of table_a and data of table_a are not stored mixed up in same partition.
Memory has sophisticated memory management mechanisms, such as memory allocation, memory deallocation, memory alignment, etc. even across-NUMA access problem, so that we MUST design the data orginization carefully. Otherwise, the performance will be cause performance reduced.
Each Partition contains several in-memory column-based data pack(IMCDP or abbrv called Chunk). Every chunk contains 65535 lines data.
How many chunks(IMCDP) will be stored in a partition that depends on the total size of table which loaded into Tianmu in-memory engine.
- metadata of IMCUs Before we start to further dicussion on IMCU , firstly, we will talk about the meta-data management of IMCUs, and we think that it's very important thing. Due to StoneDB can process GB data or even TB, and that means we will have amount of IMCU. How to locate a IMCU and how to manage these IMCU, that become a very challenge problem.
Every IMCU should have a meta-data object, IMCU_Header
, to describe its meta-data information, such as address, number of columns, number of IMCDP, TableID, Id of buffer pool instance, etc.
typedef struct IMCU_Header_t {
...
int64_t id;
OID table_id;
short num_of_columns;
int id_of_buffer_pool;
...
} IMCU_Header;
All object of IMCU_Header
reside in meta-data pool which was given in section 4.3.2
.
- In-memory columned-based Data Pack(IMCDP)
Data in in-memory column-oriented Store orginized in hierarchical type. On top is IMCU, and in a IMCU, there is some IMCDPs. and in a IMCDP there is some Data Packs. In short, Data Packs make a IMCDP, IMCDPs make a IMCU. And, IMCUs make
Tainm in-memory column-oriented Store
.
Welcome to the StoneDB wiki!
- Website: https://stonedb.io/
- Docs: https://stonedb.io/docs/about-stonedb/intro
Reach out to us by joining the Slack channel, and let's talk!
Hello, StoneDB