From 7cd9d9ddd6d750d3f32d51d3f24c45edb73250f0 Mon Sep 17 00:00:00 2001 From: Xinyi Zou Date: Fri, 3 Nov 2023 22:44:25 +0800 Subject: [PATCH] [docs](cache) Refactor query-cache docs --- .../advanced/cache/partition-cache-manual.md | 160 +++++++++++ .../en/docs/advanced/cache/partition-cache.md | 268 ------------------ docs/en/docs/advanced/cache/query-cache.md | 118 ++++++++ .../docs/advanced/cache/sql-cache-manual.md | 74 +++++ .../advanced/cache/partition-cache-manual.md | 160 +++++++++++ .../docs/advanced/cache/partition-cache.md | 261 ----------------- docs/zh-CN/docs/advanced/cache/query-cache.md | 146 ++++++++++ .../docs/advanced/cache/sql-cache-manual.md | 74 +++++ 8 files changed, 732 insertions(+), 529 deletions(-) create mode 100644 docs/en/docs/advanced/cache/partition-cache-manual.md delete mode 100644 docs/en/docs/advanced/cache/partition-cache.md create mode 100644 docs/en/docs/advanced/cache/sql-cache-manual.md create mode 100644 docs/zh-CN/docs/advanced/cache/partition-cache-manual.md delete mode 100644 docs/zh-CN/docs/advanced/cache/partition-cache.md create mode 100644 docs/zh-CN/docs/advanced/cache/query-cache.md create mode 100644 docs/zh-CN/docs/advanced/cache/sql-cache-manual.md diff --git a/docs/en/docs/advanced/cache/partition-cache-manual.md b/docs/en/docs/advanced/cache/partition-cache-manual.md new file mode 100644 index 000000000000000..b88fa65b46ba5cc --- /dev/null +++ b/docs/en/docs/advanced/cache/partition-cache-manual.md @@ -0,0 +1,160 @@ +--- +{ + "title": "Partition Cache", + "language": "en" +} +--- + + + +# Partition Cache + +Cache hits can occur when multiple SQLs use the same table partition. + +``` +**Partition Cache is an experimental feature and is not well maintained. Use it with caution** +``` + +## Demand scenarios & solutions + +See query-cache.md. + +## Design principles + +1. SQL can be split in parallel, Q = Q1 ∪ Q2 ... ∪ Qn, R= R1 ∪ R2 ... ∪ Rn, Q is the query statement, R is the result set + +2. Split into read-only partitions and updateable partitions, read-only partitions are cached, update partitions are not cached + +As above, query the number of daily users in the last 7 days. For example, if partitioned by date, the data will only be written to the partition of the current day. The data of other partitions other than the current day are fixed. Under the same query SQL, query a certain area that is not updated. The partition indicators are all fixed. As follows, the number of users in the previous 7 days is queried on 2020-03-09. The data from 2020-03-03 to 2020-03-07 comes from the cache. The first query on 2020-03-08 comes from the partition, and subsequent queries come from the cache. , 2020-03-09 because it was written continuously that day, so it comes from the partition. + +Therefore, to query N days of data, the data is updated for the most recent D days. Similar queries with different date ranges every day only need to query D partitions. The other parts come from the cache, which can effectively reduce the cluster load and query time. + +In addition some restrictions: + +- Only supports grouping by partition fields, not by other fields. Grouping by other fields may cause the group data to be updated, which will cause the cache to become invalid. + +- Only the first half, second half and all hits of the result set are supported. The result set is not supported to be divided into several parts by cached data. + +## Usage + +Make sure cache_enable_partition_mode=true in fe.conf (default is true) + +```text +vim fe/conf/fe.conf +cache_enable_partition_mode=true +``` + +Set variables in MySQL command line + +```sql +MySQL [(none)]> set [global] enable_partition_cache=true; +``` + +If two caching strategies are enabled at the same time, you need to pay attention to the following parameters: + +```text +cache_last_version_interval_second=900 +``` + +If the interval between the latest version of the partition and the present is greater than cache_last_version_interval_second, the entire query result will be cached first. If it is less than this interval, if it meets the conditions of PartitionCache, the PartitionCache data will be pressed. + +## Related parameters + +For detailed parameter introduction, see query-cache.md + +## Implementation principle example + +```sql +MySQL [(none)]> SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-03" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate; ++----------------+-----------------+ +| eventdate | count(`userid`) | ++----------------+-----------------+ +| 2020-03-03 | 15 | +| 2020-03-04 | 20 | +| 2020-03-05 | 25 | +| 2020-03-06 | 30 | +| 2020-03-07 | 35 | +| 2020-03-08 | 40 | //The first time comes from the partition, and the subsequent ones come from the cache +| 2020-03-09 | 25 | //From partition ++----------------+-----------------+ +7 rows in set (0.02 sec) +``` + +In PartitionCache, the cached first-level Key is the 128-bit MD5 signature of the SQL after removing the partition conditions. The following is the rewritten SQL to be signed: + +```sql +SELECT eventdate,count(userid) FROM testdb.appevent GROUP BY eventdate ORDER BY eventdate; +``` + +The cached second-level key is the content of the partition field of the query result set, such as the content of the eventdate column of the query result above. The ancillary information of the second-level key is the version number and version update time of the partition. + +The following demonstrates the process of executing the above SQL for the first time on 2020-03-09: + +1. Get data from cache + +```text ++----------------+-----------------+ +| 2020-03-03 | 15 | +| 2020-03-04 | 20 | +| 2020-03-05 | 25 | +| 2020-03-06 | 30 | +| 2020-03-07 | 35 | ++----------------+-----------------+ +``` + +2. SQL and data to get data from BE + +```sql +SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-08" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate; + ++----------------+-----------------+ +| 2020-03-08 | 40 | ++----------------+-----------------+ +| 2020-03-09 | 25 | ++----------------+-----------------+ +``` + +3. The last data sent to the terminal + +```text ++----------------+-----------------+ +| eventdate | count(`userid`) | ++----------------+-----------------+ +| 2020-03-03 | 15 | +| 2020-03-04 | 20 | +| 2020-03-05 | 25 | +| 2020-03-06 | 30 | +| 2020-03-07 | 35 | +| 2020-03-08 | 40 | +| 2020-03-09 | 25 | ++----------------+-----------------+ +``` + +4. Data sent to cache + +```text ++----------------+-----------------+ +| 2020-03-08 | 40 | ++----------------+-----------------+ +``` + +Partition cache is suitable for partitioning by date, some partitions are updated in real time, and the query SQL is relatively fixed. + +The partition field can also be other fields, but it needs to be ensured that only a small number of partitions are updated. diff --git a/docs/en/docs/advanced/cache/partition-cache.md b/docs/en/docs/advanced/cache/partition-cache.md deleted file mode 100644 index ccf9675c2cca6e6..000000000000000 --- a/docs/en/docs/advanced/cache/partition-cache.md +++ /dev/null @@ -1,268 +0,0 @@ ---- -{ - "title": "Partition Cache", - "language": "en" -} ---- - - - -# Partition Cache - -## Demand Scenarios - -In most data analysis scenarios, write less and read more. Data is written once and read frequently. For example, the dimensions and indicators involved in a report are calculated at one time in the early morning, but there are hundreds or even thousands of times every day. page access, so it is very suitable for caching the result set. In data analysis or BI applications, the following business scenarios exist: - -- **High concurrency scenario**, Doris can better support high concurrency, but a single server cannot carry too high QPS -- **Kanban for complex charts**, complex Dashboard or large-screen applications, the data comes from multiple tables, each page has dozens of queries, although each query is only tens of milliseconds, but the overall query time will be in a few seconds -- **Trend analysis**, the query for a given date range, the indicators are displayed by day, such as querying the trend of the number of users in the last 7 days, this type of query has a large amount of data and a wide range of queries, and the query time often takes tens of seconds -- **User repeated query**, if the product does not have an anti-reload mechanism, the user repeatedly refreshes the page due to hand error or other reasons, resulting in a large number of repeated SQL submissions - -In the above four scenarios, the solution at the application layer is to put the query results in Redis, update the cache periodically or manually refresh the cache by the user, but this solution has the following problems: - -- **Data inconsistency**, unable to perceive the update of data, causing users to often see old data -- **Low hit rate**, cache the entire query result, if the data is written in real time, the cache is frequently invalidated, the hit rate is low and the system load is heavy -- **Additional cost**, the introduction of external cache components will bring system complexity and increase additional costs - -## Solution - -This partitioned caching strategy can solve the above problems, giving priority to ensuring data consistency. On this basis, the cache granularity is refined and the hit rate is improved. Therefore, it has the following characteristics: - -- Users do not need to worry about data consistency, cache invalidation is controlled by version, and the cached data is consistent with the data queried from BE -- No additional components and costs, cached results are stored in BE's memory, users can adjust the cache memory size as needed -- Implemented two caching strategies, SQLCache and PartitionCache, the latter has a finer cache granularity -- Use consistent hashing to solve the problem of BE nodes going online and offline. The caching algorithm in BE is an improved LRU - -## SQL Cache - -SQLCache stores and retrieves the cache according to the SQL signature, the partition ID of the queried table, and the latest version of the partition. The combination of the three determines a cached data set. If any one changes, such as SQL changes, such as query fields or conditions are different, or the version changes after the data is updated, the cache will not be hit. - -If multiple tables are joined, use the latest updated partition ID and the latest version number. If one of the tables is updated, the partition ID or version number will be different, and the cache will also not be hit. - -SQLCache is more suitable for T+1 update scenarios. Data is updated in the early morning. The results obtained from the BE for the first query are put into the cache, and subsequent identical queries are obtained from the cache. Real-time update data can also be used, but there may be a problem of low hit rate. You can refer to the following PartitionCache. - -## Partition Cache - -### Design Principles - -1. SQL can be split in parallel, Q = Q1 ∪ Q2 ... ∪ Qn, R= R1 ∪ R2 ... ∪ Rn, Q is the query statement, R is the result set -2. Split into read-only partitions and updatable partitions, read-only partitions are cached, and update partitions are not cached - -As above, query the number of users per day in the last 7 days, such as partitioning by date, the data is only written to the partition of the day, and the data of other partitions other than the day is fixed. Under the same query SQL, query a certain part that does not update Partition indicators are fixed. As follows, the number of users in the first 7 days is queried on 2020-03-09, the data from 2020-03-03 to 2020-03-07 comes from the cache, the first query on 2020-03-08 comes from the partition, and subsequent queries come from the cache , 2020-03-09 is from the partition because it is constantly being written that day. - -Therefore, when querying N days of data, the data is updated on the most recent D days. Every day is only a query with a different date range and a similar query. Only D partitions need to be queried, and the other parts are from the cache, which can effectively reduce the cluster load and reduce query time. - -```sql -MySQL [(none)]> SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-03" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate; -+------------+-----------------+ -| eventdate | count(`userid`) | -+------------+-----------------+ -| 2020-03-03 | 15 | -| 2020-03-04 | 20 | -| 2020-03-05 | 25 | -| 2020-03-06 | 30 | -| 2020-03-07 | 35 | -| 2020-03-08 | 40 | //First from partition, subsequent from cache -| 2020-03-09 | 25 | //from partition -+------------+-----------------+ -7 rows in set (0.02 sec) -``` - -In PartitionCache, the first-level key of the cache is the 128-bit MD5 signature of the SQL after the partition condition is removed. The following is the rewritten SQL to be signed: - -```sql -SELECT eventdate,count(userid) FROM testdb.appevent GROUP BY eventdate ORDER BY eventdate; -``` - -The cached second-level key is the content of the partition field of the query result set, such as the content of the eventdate column of the query result above, and the auxiliary information of the second-level key is the version number and version update time of the partition. - -The following demonstrates the process of executing the above SQL for the first time on 2020-03-09: - -1. Get data from cache - -```text -+------------+-----------------+ -| 2020-03-03 | 15 | -| 2020-03-04 | 20 | -| 2020-03-05 | 25 | -| 2020-03-06 | 30 | -| 2020-03-07 | 35 | -+------------+-----------------+ -``` - -1. SQL and data to get data from BE SQL and data to get data from BE - -```sql -SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-08" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate; - -+------------+-----------------+ -| 2020-03-08 | 40 | -+------------+-----------------+ -| 2020-03-09 | 25 | -+------------+-----------------+ -``` - -1. The last data sent to the terminal - -```text -+------------+-----------------+ -| eventdate | count(`userid`) | -+------------+-----------------+ -| 2020-03-03 | 15 | -| 2020-03-04 | 20 | -| 2020-03-05 | 25 | -| 2020-03-06 | 30 | -| 2020-03-07 | 35 | -| 2020-03-08 | 40 | -| 2020-03-09 | 25 | -+------------+-----------------+ -``` - -1. data sent to cache - -```text -+------------+-----------------+ -| 2020-03-08 | 40 | -+------------+-----------------+ -``` - -Partition cache is suitable for partitioning by date, some partitions are updated in real time, and the query SQL is relatively fixed. - -Partition fields can also be other fields, but need to ensure that only a small number of partition updates. - -### Some Restrictions - -- Only OlapTable is supported, other tables such as MySQL have no version information and cannot sense whether the data is updated -- Only supports grouping by partition field, does not support grouping by other fields, grouping by other fields, the grouped data may be updated, which will cause the cache to be invalid -- Only the first half of the result set, the second half of the result set and all cache hits are supported, and the result set is not supported to be divided into several parts by the cached data - -## How to Use - -> NOTE: -> -> In the following scenarios, the cache result is wrong -> 1. Use session variable: default_order_by_limit, sql_select_limit -> 2. Use var = cur_date(), var = random() functions that generate random values -> -> There may be other cases where the cache result is wrong, so it is recommended to enable it only in controllable scenarios such as reports. - -### Enable SQLCache - -Make sure cache_enable_sql_mode=true in fe.conf (default is true) - -```text -vim fe/conf/fe.conf -cache_enable_sql_mode=true -``` - -Setting variables in MySQL command line - -```sql -MySQL [(none)]> set [global] enable_sql_cache=true; -``` - -Note: global is a global variable, not referring to the current session variable - -### Enable Partition Cache - -Make sure cache_enable_partition_mode=true in fe.conf (default is true) - -```text -vim fe/conf/fe.conf -cache_enable_partition_mode=true -``` - -Setting variables in MySQL command line - -```sql -MySQL [(none)]> set [global] enable_partition_cache=true; -``` - -If two caching strategies are enabled at the same time, the following parameters need to be paid attention to: - -```text -cache_last_version_interval_second=900 -``` - -If the interval between the latest version of the partition is greater than cache_last_version_interval_second, the entire query result will be cached first. If it is less than this interval, if it meets the conditions of PartitionCache, press PartitionCache data. - -### Monitoring - -FE monitoring items: - -```text -query_table //Number of tables in Query -query_olap_table //Number of Olap tables in Query -cache_mode_sql //Identify the number of queries whose cache mode is sql -cache_hit_sql //The number of Cache hits by Query with mode sql -query_mode_partition //Identify the number of queries whose cache mode is Partition -cache_hit_partition //Number of queries hit by Partition -partition_all //All partitions scanned in Query -partition_hit //Number of partitions hit by Cache - -Cache hit ratio = (cache_hit_sql + cache_hit_partition) / query_olap_table -Partition hit ratio = partition_hit / partition_all -``` - -BE's monitoring items: - -```text -query_cache_memory_total_byte //Cache memory size -query_query_cache_sql_total_count //Number of SQL in Cache -query_cache_partition_total_count //Number of Cache partitions - -SQL average data size = cache_memory_total / cache_sql_total -Partition average data size = cache_memory_total / cache_partition_total -``` - -Other monitoring: You can view the CPU and memory indicators of the BE node, the Query Percentile and other indicators in the Query statistics from Grafana, and adjust the Cache parameters to achieve business goals. - -### Optimization Parameters - -The configuration item cache_result_max_row_count of FE, the maximum number of rows in the cache for the query result set, FE configuration item cache_result_max_data_size, the maximum data size of the query result set put into the cache, can be adjusted according to the actual situation, but it is recommended not to set it too large to avoid taking up too much memory, and the result set exceeding this size will not be cached. - -```text -vim fe/conf/fe.conf -cache_result_max_row_count=3000 -``` - -The maximum number of partitions in BE cache_max_partition_count refers to the maximum number of partitions corresponding to each SQL. If it is partitioned by date, it can cache data for more than 2 years. If you want to keep the cache for a longer time, please set this parameter to a larger value and modify it at the same time. Parameter of cache_result_max_row_count and cache_result_max_data_size. - -```text -vim be/conf/be.conf -cache_max_partition_count=1024 -``` - -The cache memory setting in BE consists of two parameters, query_cache_max_size and query_cache_elasticity_size (in MB). If the memory exceeds query_cache_max_size + cache_elasticity_size, it will start to clean up and control the memory to below query_cache_max_size. These two parameters can be set according to the number of BE nodes, node memory size, and cache hit rate. - -```text -query_cache_max_size_mb=256 -query_cache_elasticity_size_mb=128 -``` - -Calculation method: - -If 10000 queries are cached, each query caches 1000 rows, each row is 128 bytes, distributed on 10 BEs, then each BE requires about 128M memory (10000 * 1000 * 128/10). - -## Unfinished Matters - -- Can the data of T+1 also be cached by Partition? Currently not supported -- Similar SQL, 2 indicators were queried before, but now 3 indicators are queried. Can the cache of 2 indicators be used? Not currently supported -- Partition by date, but need to aggregate data by week dimension, is PartitionCache available? Not currently supported diff --git a/docs/en/docs/advanced/cache/query-cache.md b/docs/en/docs/advanced/cache/query-cache.md index d6fa122b869cd81..b41b32a8011792b 100644 --- a/docs/en/docs/advanced/cache/query-cache.md +++ b/docs/en/docs/advanced/cache/query-cache.md @@ -26,3 +26,121 @@ under the License. # Query Cache +## Demand scenario + +Most data analysis scenarios are to write less and read more. The data is written once and read multiple times frequently. For example, the dimensions and indicators involved in a report are calculated once in the early morning, but hundreds or even thousands of times a day. Page access, so it is very suitable for caching the result set. In data analysis or BI applications, the following business scenarios exist: + +- **High concurrency scenario**, Doris can better support high concurrency, but a single server cannot carry too high QPS +- **Complex Chart Kanban**, complex Dashboard or large-screen application, data comes from multiple tables, and each page has dozens of queries. Although each query only takes tens of milliseconds, the overall query time will be several seconds. +- **Trend Analysis**, for queries within a given date range, indicators are displayed on a daily basis, such as querying the trend of the number of users in the last 7 days. This type of query has a large amount of data, a wide query range, and the query time often takes tens of seconds. +- **User repeated query**, if the product does not have an anti-refresh mechanism, the user repeatedly refreshes the page due to manual error or other reasons, resulting in a large number of repeated SQL submissions. + +In the above four scenarios, the solution at the application layer puts the query results into Redis and periodically updates the cache or the user manually refreshes the cache. However, this solution has the following problems: + +- **Inconsistent data**, unable to detect data updates, causing users to often see old data +- **Low hit rate**, the entire query result is cached. If the data is written in real time, the cache fails frequently, the hit rate is low and the system load is heavy. +- **Additional Cost**, introducing external cache components will bring system complexity and increase additional costs. + +## solution + +This partition cache strategy can solve the above problems, giving priority to ensuring data consistency, and on this basis, refining the cache granularity and improving the hit rate, so it has the following characteristics: + +- Users do not need to worry about data consistency. Cache invalidation is controlled through versioning. The cached data is consistent with the data queried from BE. +- There are no additional components and costs, the cache results are stored in BE's memory, and users can adjust the cache memory size as needed +- Implemented two caching strategies, SQLCache and PartitionCache, the latter has a finer cache granularity +- Use consistent hashing to solve the problem of BE nodes going online and offline. The caching algorithm in BE is an improved LRU + +## scenes to be used + +Currently, it supports two methods: SQL Cache and Partition Cache, and supports OlapTable internal table and Hive external table. + +SQL Cache: Only SQL statements that are completely consistent will hit the cache. For details, see: sql-cache-manual.md + +Partition Cache: Multiple SQLs can hit the cache using the same table partition, so it has a higher hit rate than SQL Cache. For details, see: partition-cache-manual.md + +## Monitoring + +FE monitoring items: + +```text +query_table //The number of tables in Query +query_olap_table //The number of Olap tables in Query +cache_mode_sql //Identify the number of Query whose cache mode is sql +cache_hit_sql //The number of Query hits in Cache with mode sql +query_mode_partition //The number of queries that identify the cache mode as Partition +cache_hit_partition //The number of Query hits through Partition +partition_all //All partitions scanned in Query +partition_hit //Number of partitions hit through Cache + +Cache hit rate = (cache_hit_sql + cache_hit_partition) / query_olap_table +Partition hit rate = partition_hit / partition_all +``` + +BE monitoring items: + +```text +query_cache_memory_total_byte //Cache memory size +query_query_cache_sql_total_count //The number of SQL cached +query_cache_partition_total_count //Number of Cache partitions + +SQL average data size = cache_memory_total / cache_sql_total +Partition average data size = cache_memory_total / cache_partition_total +``` + +Other monitoring: You can view the CPU and memory indicators of the BE node, Query Percentile and other indicators in the Query statistics from Grafana, and adjust the Cache parameters to achieve business goals. + +## Related parameters + +1. cache_result_max_row_count + +The maximum number of rows that the query result set can put into the cache. The default is 3000. + +```text +vim fe/conf/fe.conf +cache_result_max_row_count=3000 +``` + +2. cache_result_max_data_size + +The maximum data size of the query result set placed in the cache is 30M by default. It can be adjusted according to the actual situation, but it is recommended not to set it too large to avoid excessive memory usage. Result sets exceeding this size will not be cached. + +```text +vim fe/conf/fe.conf +cache_result_max_data_size=31457280 +``` + +3. cache_last_version_interval_second + +The minimum time interval between the latest version of the cached query partition and the current version. Only the query results of partitions that are larger than this interval and have not been updated will be cached. The default is 900, in seconds. + +```text +vim fe/conf/fe.conf +cache_last_version_interval_second=900 +``` + +4. query_cache_max_size_mb and query_cache_elasticity_size + +query_cache_max_size_mb is the upper memory limit of the cache, query_cache_elasticity_size is the memory size that the cache can stretch. When the total cache size on BE exceeds query_cache_max_size + cache_elasticity_size, it will start to be cleaned up and the memory will be controlled below query_cache_max_size. + +These two parameters can be set according to the number of BE nodes, node memory size, and cache hit rate. Calculation method: If 10,000 Queries are cached, each Query caches 1,000 rows, each row is 128 bytes, and is distributed on 10 BEs, then each BE requires about 128M memory (10,000 * 1,000 * 128/10). + +```text +vim be/conf/be.conf +query_cache_max_size_mb=256 +query_cache_elasticity_size_mb=128 +``` + +5. cache_max_partition_count + +Parameters unique to Partition Cache. The maximum number of BE partitions refers to the maximum number of partitions corresponding to each SQL. If it is partitioned by date, it can cache data for more than 2 years. If you want to keep the cache for a longer time, please set this parameter larger and modify the parameters at the same time. cache_result_max_row_count and cache_result_max_data_size. + +```text +vim be/conf/be.conf +cache_max_partition_count=1024 +``` + +## Unfinished business + +- Can T+1 data also be cached using Partition? Currently it is not supported. +- Similar SQL, 2 indicators were queried before, and now 3 indicators are queried. Can the cache of 2 indicators be used? Not currently supported +- Partition by date, but the data needs to be summarized by week dimension. Can PartitionCache be used? Not currently supported diff --git a/docs/en/docs/advanced/cache/sql-cache-manual.md b/docs/en/docs/advanced/cache/sql-cache-manual.md new file mode 100644 index 000000000000000..1966ce68b3b7b88 --- /dev/null +++ b/docs/en/docs/advanced/cache/sql-cache-manual.md @@ -0,0 +1,74 @@ +--- +{ + "title": "SQL Cache", + "language": "en" +} +--- + + + +# SQL Cache + +The SQL statement will hit the cache if it is completely consistent. + +## Demand scenarios & solutions + +See query-cache.md. + +## Design principles + +SQLCache stores and obtains the cache based on the SQL signature, the partition ID of the queried table, and the latest version of the partition. The combination of the three determines a cached data set. If any one of them changes, such as the SQL changes, the query fields or conditions are different, or the version changes after the data is updated, the cache will not be hit. + +If multiple tables are joined, the most recently updated partition ID and latest version number are used. If one of the tables is updated, the partition ID or version number will be different, and the cache will not be hit. + +SQLCache is more suitable for T+1 update scenarios. Data is updated in the early morning. The first query obtains the results from BE and puts them into the cache. Subsequent queries of the same nature obtain the results from the cache. Real-time update data can also be used, but there may be a problem of low hit rate. + +Currently supports OlapTable internal table and Hive external table. + +## Usage + +Make sure cache_enable_sql_mode=true in fe.conf (default is true) + +```text +vim fe/conf/fe.conf +cache_enable_sql_mode=true +``` + +Set variables in MySQL command line + +```sql +MySQL [(none)]> set [global] enable_sql_cache=true; +``` + +Note: global is a global variable and does not refer to the current session variable. + +## Cache conditions + +After the first query, if the following three conditions are met, the query results will be cached. + +1. (Current time - the last update time of the queried partition) is greater than cache_last_version_interval_second in fe.conf. + +2. The number of query result rows is less than cache_result_max_row_count in fe.conf. + +3. The query result bytes is less than cache_result_max_data_size in fe.conf. + +## Related parameters + +For detailed parameter introduction, see query-cache.md. diff --git a/docs/zh-CN/docs/advanced/cache/partition-cache-manual.md b/docs/zh-CN/docs/advanced/cache/partition-cache-manual.md new file mode 100644 index 000000000000000..0896df49d950149 --- /dev/null +++ b/docs/zh-CN/docs/advanced/cache/partition-cache-manual.md @@ -0,0 +1,160 @@ +--- +{ + "title": "Partition Cache", + "language": "zh-CN" +} +--- + + + +# Partition Cache + +多个 SQL 使用相同的表分区时可命中缓存。 + +``` +**Partition Cache是个试验性功能,没有得到很好的维护,谨慎使用** +``` + +## 需求场景 & 解决方案 + +见 query-cache.md。 + +## 设计原理 + +1. SQL可以并行拆分,Q = Q1 ∪ Q2 ... ∪ Qn,R= R1 ∪ R2 ... ∪ Rn,Q为查询语句,R为结果集 + +2. 拆分为只读分区和可更新分区,只读分区缓存,更新分区不缓存 + +如上,查询最近7天的每天用户数,如按日期分区,数据只写当天分区,当天之外的其他分区的数据,都是固定不变的,在相同的查询SQL下,查询某个不更新分区的指标都是固定的。如下,在2020-03-09当天查询前7天的用户数,2020-03-03至2020-03-07的数据来自缓存,2020-03-08第一次查询来自分区,后续的查询来自缓存,2020-03-09因为当天在不停写入,所以来自分区。 + +因此,查询N天的数据,数据更新最近的D天,每天只是日期范围不一样相似的查询,只需要查询D个分区即可,其他部分都来自缓存,可以有效降低集群负载,减少查询时间。 + +此外一些限制: + +- 只支持按分区字段分组,不支持按其他字段分组,按其他字段分组,该分组数据都有可能被更新,会导致缓存都失效 + +- 只支持结果集的前半部分、后半部分以及全部命中缓存,不支持结果集被缓存数据分割成几个部分 + +## 使用方式 + +确保 fe.conf 的 cache_enable_partition_mode=true (默认是true) + +```text +vim fe/conf/fe.conf +cache_enable_partition_mode=true +``` + +在MySQL命令行中设置变量 + +```sql +MySQL [(none)]> set [global] enable_partition_cache=true; +``` + +如果同时开启了两个缓存策略,下面的参数,需要注意一下: + +```text +cache_last_version_interval_second=900 +``` + +如果分区的最新版本的时间离现在的间隔,大于cache_last_version_interval_second,则会优先把整个查询结果缓存。如果小于这个间隔,如果符合PartitionCache的条件,则按PartitionCache数据。 + +## 相关参数 + +具体参数介绍见 query-cache.md + +## 实现原理示例 + +```sql +MySQL [(none)]> SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-03" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate; ++------------+-----------------+ +| eventdate | count(`userid`) | ++------------+-----------------+ +| 2020-03-03 | 15 | +| 2020-03-04 | 20 | +| 2020-03-05 | 25 | +| 2020-03-06 | 30 | +| 2020-03-07 | 35 | +| 2020-03-08 | 40 | //第一次来自分区,后续来自缓存 +| 2020-03-09 | 25 | //来自分区 ++------------+-----------------+ +7 rows in set (0.02 sec) +``` + +在PartitionCache中,缓存第一级Key是去掉了分区条件后的SQL的128位MD5签名,下面是改写后的待签名的SQL: + +```sql +SELECT eventdate,count(userid) FROM testdb.appevent GROUP BY eventdate ORDER BY eventdate; +``` + +缓存的第二级Key是查询结果集的分区字段的内容,比如上面查询结果的eventdate列的内容,二级Key的附属信息是分区的版本号和版本更新时间。 + +下面演示上面SQL在2020-03-09当天第一次执行的流程: + +1. 从缓存中获取数据 + +```text ++------------+-----------------+ +| 2020-03-03 | 15 | +| 2020-03-04 | 20 | +| 2020-03-05 | 25 | +| 2020-03-06 | 30 | +| 2020-03-07 | 35 | ++------------+-----------------+ +``` + +2. 从BE中获取数据的SQL和数据 + +```sql +SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-08" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate; + ++------------+-----------------+ +| 2020-03-08 | 40 | ++------------+-----------------+ +| 2020-03-09 | 25 | ++------------+-----------------+ +``` + +3. 最后发送给终端的数据 + +```text ++------------+-----------------+ +| eventdate | count(`userid`) | ++------------+-----------------+ +| 2020-03-03 | 15 | +| 2020-03-04 | 20 | +| 2020-03-05 | 25 | +| 2020-03-06 | 30 | +| 2020-03-07 | 35 | +| 2020-03-08 | 40 | +| 2020-03-09 | 25 | ++------------+-----------------+ +``` + +4. 发送给缓存的数据 + +```text ++------------+-----------------+ +| 2020-03-08 | 40 | ++------------+-----------------+ +``` + +Partition缓存,适合按日期分区,部分分区实时更新,查询SQL较为固定。 + +分区字段也可以是其他字段,但是需要保证只有少量分区更新。 diff --git a/docs/zh-CN/docs/advanced/cache/partition-cache.md b/docs/zh-CN/docs/advanced/cache/partition-cache.md deleted file mode 100644 index 778c27a37ad4cf2..000000000000000 --- a/docs/zh-CN/docs/advanced/cache/partition-cache.md +++ /dev/null @@ -1,261 +0,0 @@ ---- -{ - "title": "分区缓存", - "language": "zh-CN" -} ---- - - - -# 分区缓存 - -## 需求场景 - -大部分数据分析场景是写少读多,数据写入一次,多次频繁读取,比如一张报表涉及的维度和指标,数据在凌晨一次性计算好,但每天有数百甚至数千次的页面访问,因此非常适合把结果集缓存起来。在数据分析或BI应用中,存在下面的业务场景: - -- **高并发场景**,Doris可以较好的支持高并发,但单台服务器无法承载太高的QPS -- **复杂图表的看板**,复杂的Dashboard或者大屏类应用,数据来自多张表,每个页面有数十个查询,虽然每个查询只有数十毫秒,但是总体查询时间会在数秒 -- **趋势分析**,给定日期范围的查询,指标按日显示,比如查询最近7天内的用户数的趋势,这类查询数据量大,查询范围广,查询时间往往需要数十秒 -- **用户重复查询**,如果产品没有防重刷机制,用户因手误或其他原因重复刷新页面,导致提交大量的重复的SQL - -以上四种场景,在应用层的解决方案,把查询结果放到Redis中,周期性的更新缓存或者用户手工刷新缓存,但是这个方案有如下问题: - -- **数据不一致**,无法感知数据的更新,导致用户经常看到旧的数据 -- **命中率低**,缓存整个查询结果,如果数据实时写入,缓存频繁失效,命中率低且系统负载较重 -- **额外成本**,引入外部缓存组件,会带来系统复杂度,增加额外成本 - -## 解决方案 - -本分区缓存策略可以解决上面的问题,优先保证数据一致性,在此基础上细化缓存粒度,提升命中率,因此有如下特点: - -- 用户无需担心数据一致性,通过版本来控制缓存失效,缓存的数据和从BE中查询的数据是一致的 -- 没有额外的组件和成本,缓存结果存储在BE的内存中,用户可以根据需要调整缓存内存大小 -- 实现了两种缓存策略,SQLCache和PartitionCache,后者缓存粒度更细 -- 用一致性哈希解决BE节点上下线的问题,BE中的缓存算法是改进的LRU - -## SQLCache - -SQLCache按SQL的签名、查询的表的分区ID、分区最新版本来存储和获取缓存。三者组合确定一个缓存数据集,任何一个变化了,如SQL有变化,如查询字段或条件不一样,或数据更新后版本变化了,会导致命中不了缓存。 - -如果多张表Join,使用最近更新的分区ID和最新的版本号,如果其中一张表更新了,会导致分区ID或版本号不一样,也一样命中不了缓存。 - -SQLCache,更适合T+1更新的场景,凌晨数据更新,首次查询从BE中获取结果放入到缓存中,后续相同查询从缓存中获取。实时更新数据也可以使用,但是可能存在命中率低的问题,可以参考如下PartitionCache。 - -## PartitionCache - -### 设计原理 - -1. SQL可以并行拆分,Q = Q1 ∪ Q2 ... ∪ Qn,R= R1 ∪ R2 ... ∪ Rn,Q为查询语句,R为结果集 -2. 拆分为只读分区和可更新分区,只读分区缓存,更新分区不缓存 - -如上,查询最近7天的每天用户数,如按日期分区,数据只写当天分区,当天之外的其他分区的数据,都是固定不变的,在相同的查询SQL下,查询某个不更新分区的指标都是固定的。如下,在2020-03-09当天查询前7天的用户数,2020-03-03至2020-03-07的数据来自缓存,2020-03-08第一次查询来自分区,后续的查询来自缓存,2020-03-09因为当天在不停写入,所以来自分区。 - -因此,查询N天的数据,数据更新最近的D天,每天只是日期范围不一样相似的查询,只需要查询D个分区即可,其他部分都来自缓存,可以有效降低集群负载,减少查询时间。 - -```sql -MySQL [(none)]> SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-03" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate; -+------------+-----------------+ -| eventdate | count(`userid`) | -+------------+-----------------+ -| 2020-03-03 | 15 | -| 2020-03-04 | 20 | -| 2020-03-05 | 25 | -| 2020-03-06 | 30 | -| 2020-03-07 | 35 | -| 2020-03-08 | 40 | //第一次来自分区,后续来自缓存 -| 2020-03-09 | 25 | //来自分区 -+------------+-----------------+ -7 rows in set (0.02 sec) -``` - -在PartitionCache中,缓存第一级Key是去掉了分区条件后的SQL的128位MD5签名,下面是改写后的待签名的SQL: - -```sql -SELECT eventdate,count(userid) FROM testdb.appevent GROUP BY eventdate ORDER BY eventdate; -``` - -缓存的第二级Key是查询结果集的分区字段的内容,比如上面查询结果的eventdate列的内容,二级Key的附属信息是分区的版本号和版本更新时间。 - -下面演示上面SQL在2020-03-09当天第一次执行的流程: - -1. 从缓存中获取数据 - -```text -+------------+-----------------+ -| 2020-03-03 | 15 | -| 2020-03-04 | 20 | -| 2020-03-05 | 25 | -| 2020-03-06 | 30 | -| 2020-03-07 | 35 | -+------------+-----------------+ -``` - -1. 从BE中获取数据的SQL和数据 - -```sql -SELECT eventdate,count(userid) FROM testdb.appevent WHERE eventdate>="2020-03-08" AND eventdate<="2020-03-09" GROUP BY eventdate ORDER BY eventdate; - -+------------+-----------------+ -| 2020-03-08 | 40 | -+------------+-----------------+ -| 2020-03-09 | 25 | -+------------+-----------------+ -``` - -1. 最后发送给终端的数据 - -```text -+------------+-----------------+ -| eventdate | count(`userid`) | -+------------+-----------------+ -| 2020-03-03 | 15 | -| 2020-03-04 | 20 | -| 2020-03-05 | 25 | -| 2020-03-06 | 30 | -| 2020-03-07 | 35 | -| 2020-03-08 | 40 | -| 2020-03-09 | 25 | -+------------+-----------------+ -``` - -1. 发送给缓存的数据 - -```text -+------------+-----------------+ -| 2020-03-08 | 40 | -+------------+-----------------+ -``` - -Partition缓存,适合按日期分区,部分分区实时更新,查询SQL较为固定。 - -分区字段也可以是其他字段,但是需要保证只有少量分区更新。 - -### 一些限制 - -- 只支持OlapTable,其他存储如MySQL的表没有版本信息,无法感知数据是否更新 -- 只支持按分区字段分组,不支持按其他字段分组,按其他字段分组,该分组数据都有可能被更新,会导致缓存都失效 -- 只支持结果集的前半部分、后半部分以及全部命中缓存,不支持结果集被缓存数据分割成几个部分 - -## 使用方式 - -### 开启SQLCache - -确保fe.conf的cache_enable_sql_mode=true(默认是true) - -```text -vim fe/conf/fe.conf -cache_enable_sql_mode=true -``` - -在MySQL命令行中设置变量 - -```sql -MySQL [(none)]> set [global] enable_sql_cache=true; -``` - -注:global是全局变量,不加指当前会话变量 - -### 开启PartitionCache - -确保fe.conf的cache_enable_partition_mode=true(默认是true) - -```text -vim fe/conf/fe.conf -cache_enable_partition_mode=true -``` - -在MySQL命令行中设置变量 - -```sql -MySQL [(none)]> set [global] enable_partition_cache=true; -``` - -如果同时开启了两个缓存策略,下面的参数,需要注意一下: - -```text -cache_last_version_interval_second=900 -``` - -如果分区的最新版本的时间离现在的间隔,大于cache_last_version_interval_second,则会优先把整个查询结果缓存。如果小于这个间隔,如果符合PartitionCache的条件,则按PartitionCache数据。 - -### 监控 - -FE的监控项: - -```text -query_table //Query中有表的数量 -query_olap_table //Query中有Olap表的数量 -cache_mode_sql //识别缓存模式为sql的Query数量 -cache_hit_sql //模式为sql的Query命中Cache的数量 -query_mode_partition //识别缓存模式为Partition的Query数量 -cache_hit_partition //通过Partition命中的Query数量 -partition_all //Query中扫描的所有分区 -partition_hit //通过Cache命中的分区数量 - -Cache命中率 = (cache_hit_sql + cache_hit_partition) / query_olap_table -Partition命中率 = partition_hit / partition_all -``` - -BE的监控项: - -```text -query_cache_memory_total_byte //Cache内存大小 -query_query_cache_sql_total_count //Cache的SQL的数量 -query_cache_partition_total_count //Cache分区数量 - -SQL平均数据大小 = cache_memory_total / cache_sql_total -Partition平均数据大小 = cache_memory_total / cache_partition_total -``` - -其他监控: 可以从Grafana中查看BE节点的CPU和内存指标,Query统计中的Query Percentile等指标,配合Cache参数的调整来达成业务目标。 - -### 优化参数 - -FE的配置项cache_result_max_row_count,查询结果集放入缓存的最大行数,FE的配置项cache_result_max_data_size,查询结果集放入缓存的最大数据大小,可以根据实际情况调整,但建议不要设置过大,避免过多占用内存,超过这个大小的结果集不会被缓存。 - -```text -vim fe/conf/fe.conf -cache_result_max_row_count=3000 -``` - -BE最大分区数量cache_max_partition_count,指每个SQL对应的最大分区数,如果是按日期分区,能缓存2年多的数据,假如想保留更长时间的缓存,请把这个参数设置得更大,同时修改cache_result_max_row_count和cache_result_max_data_size的参数。 - -```text -vim be/conf/be.conf -cache_max_partition_count=1024 -``` - -BE中缓存内存设置,有两个参数query_cache_max_size和query_cache_elasticity_size两部分组成(单位MB),内存超过query_cache_max_size + cache_elasticity_size会开始清理,并把内存控制到query_cache_max_size以下。可以根据BE节点数量,节点内存大小,和缓存命中率来设置这两个参数。 - -```text -query_cache_max_size_mb=256 -query_cache_elasticity_size_mb=128 -``` - -计算方法: - -假如缓存10000个Query,每个Query缓存1000行,每行是128个字节,分布在10台BE上,则每个BE需要约128M内存 -(10000 * 1000 * 128/10)。 - -## 未尽事项 - -- T+1的数据,是否也可以用Partition缓存? 目前不支持 -- 类似的SQL,之前查询了2个指标,现在查询3个指标,是否可以利用2个指标的缓存? 目前不支持 -- 按日期分区,但是需要按周维度汇总数据,是否可用PartitionCache? 目前不支持 diff --git a/docs/zh-CN/docs/advanced/cache/query-cache.md b/docs/zh-CN/docs/advanced/cache/query-cache.md new file mode 100644 index 000000000000000..5139f2dde5e6065 --- /dev/null +++ b/docs/zh-CN/docs/advanced/cache/query-cache.md @@ -0,0 +1,146 @@ +--- +{ + "title": "Query Cache", + "language": "zh-CN" +} +--- + + + +# Query Cache + +## 需求场景 + +大部分数据分析场景是写少读多,数据写入一次,多次频繁读取,比如一张报表涉及的维度和指标,数据在凌晨一次性计算好,但每天有数百甚至数千次的页面访问,因此非常适合把结果集缓存起来。在数据分析或BI应用中,存在下面的业务场景: + +- **高并发场景**,Doris可以较好的支持高并发,但单台服务器无法承载太高的QPS +- **复杂图表的看板**,复杂的Dashboard或者大屏类应用,数据来自多张表,每个页面有数十个查询,虽然每个查询只有数十毫秒,但是总体查询时间会在数秒 +- **趋势分析**,给定日期范围的查询,指标按日显示,比如查询最近7天内的用户数的趋势,这类查询数据量大,查询范围广,查询时间往往需要数十秒 +- **用户重复查询**,如果产品没有防重刷机制,用户因手误或其他原因重复刷新页面,导致提交大量的重复的SQL + +以上四种场景,在应用层的解决方案,把查询结果放到Redis中,周期性的更新缓存或者用户手工刷新缓存,但是这个方案有如下问题: + +- **数据不一致**,无法感知数据的更新,导致用户经常看到旧的数据 +- **命中率低**,缓存整个查询结果,如果数据实时写入,缓存频繁失效,命中率低且系统负载较重 +- **额外成本**,引入外部缓存组件,会带来系统复杂度,增加额外成本 + +## 解决方案 + +本分区缓存策略可以解决上面的问题,优先保证数据一致性,在此基础上细化缓存粒度,提升命中率,因此有如下特点: + +- 用户无需担心数据一致性,通过版本来控制缓存失效,缓存的数据和从BE中查询的数据是一致的 +- 没有额外的组件和成本,缓存结果存储在BE的内存中,用户可以根据需要调整缓存内存大小 +- 实现了两种缓存策略,SQLCache和PartitionCache,后者缓存粒度更细 +- 用一致性哈希解决BE节点上下线的问题,BE中的缓存算法是改进的LRU + +## 使用场景 + +当前支持 SQL Cache 和 Partition Cache 两种方式,支持 OlapTable内表 和 Hive外表。 + +SQL Cache: 只有 SQL 语句完全一致才会命中缓存,详情见: sql-cache-manual.md + +Partition Cache: 多个 SQL 使用相同的表分区即可命中缓存,所以相比 SQL Cache 有更高的命中率,详情见: partition-cache-manual.md + +## 监控 + +FE的监控项: + +```text +query_table //Query中有表的数量 +query_olap_table //Query中有Olap表的数量 +cache_mode_sql //识别缓存模式为sql的Query数量 +cache_hit_sql //模式为sql的Query命中Cache的数量 +query_mode_partition //识别缓存模式为Partition的Query数量 +cache_hit_partition //通过Partition命中的Query数量 +partition_all //Query中扫描的所有分区 +partition_hit //通过Cache命中的分区数量 + +Cache命中率 = (cache_hit_sql + cache_hit_partition) / query_olap_table +Partition命中率 = partition_hit / partition_all +``` + +BE的监控项: + +```text +query_cache_memory_total_byte //Cache内存大小 +query_query_cache_sql_total_count //Cache的SQL的数量 +query_cache_partition_total_count //Cache分区数量 + +SQL平均数据大小 = cache_memory_total / cache_sql_total +Partition平均数据大小 = cache_memory_total / cache_partition_total +``` + +其他监控: 可以从Grafana中查看BE节点的CPU和内存指标,Query统计中的Query Percentile等指标,配合Cache参数的调整来达成业务目标。 + +## 相关参数 + +1. cache_result_max_row_count + +查询结果集放入缓存的最大行数,默认 3000。 + +```text +vim fe/conf/fe.conf +cache_result_max_row_count=3000 +``` + +2. cache_result_max_data_size + +查询结果集放入缓存的最大数据大小,默认 30M,可以根据实际情况调整,但建议不要设置过大,避免过多占用内存,超过这个大小的结果集不会被缓存。 + +```text +vim fe/conf/fe.conf +cache_result_max_data_size=31457280 +``` + +3. cache_last_version_interval_second + +缓存的查询分区最新版本离现在的最小时间间隔,只有大于这个间隔没有被更新的分区的查询结果才会被缓存,默认 900,单位秒。 + +```text +vim fe/conf/fe.conf +cache_last_version_interval_second=900 +``` + +4. query_cache_max_size_mb 和 query_cache_elasticity_size + +query_cache_max_size_mb 缓存的内存上限,query_cache_elasticity_size 缓存可拉伸的内存大小,BE上的缓存总大小超过 query_cache_max_size + cache_elasticity_size 后会开始清理,并把内存控制到 query_cache_max_size 以下。 + +可以根据BE节点数量,节点内存大小,和缓存命中率来设置这两个参数。计算方法:假如缓存10000个Query,每个Query缓存1000行,每行是128个字节,分布在10台BE上,则每个BE需要约128M内存(10000 * 1000 * 128/10)。 + +```text +vim be/conf/be.conf +query_cache_max_size_mb=256 +query_cache_elasticity_size_mb=128 +``` + +5. cache_max_partition_count + +Partition Cache 独有的参数。BE最大分区数量,指每个SQL对应的最大分区数,如果是按日期分区,能缓存2年多的数据,假如想保留更长时间的缓存,请把这个参数设置得更大,同时修改参数 cache_result_max_row_count 和 cache_result_max_data_size。 + +```text +vim be/conf/be.conf +cache_max_partition_count=1024 +``` + +## 未尽事项 + +- T+1的数据,是否也可以用Partition缓存? 目前不支持 +- 类似的SQL,之前查询了2个指标,现在查询3个指标,是否可以利用2个指标的缓存? 目前不支持 +- 按日期分区,但是需要按周维度汇总数据,是否可用PartitionCache? 目前不支持 diff --git a/docs/zh-CN/docs/advanced/cache/sql-cache-manual.md b/docs/zh-CN/docs/advanced/cache/sql-cache-manual.md new file mode 100644 index 000000000000000..c9e6b68ff756364 --- /dev/null +++ b/docs/zh-CN/docs/advanced/cache/sql-cache-manual.md @@ -0,0 +1,74 @@ +--- +{ + "title": "SQL Cache", + "language": "zh-CN" +} +--- + + + +# SQL Cache + +SQL 语句完全一致时将命中缓存。 + +## 需求场景 & 解决方案 + +见 query-cache.md。 + +## 设计原理 + +SQLCache按SQL的签名、查询的表的分区ID、分区最新版本来存储和获取缓存。三者组合确定一个缓存数据集,任何一个变化了,如SQL有变化,如查询字段或条件不一样,或数据更新后版本变化了,会导致命中不了缓存。 + +如果多张表Join,使用最近更新的分区ID和最新的版本号,如果其中一张表更新了,会导致分区ID或版本号不一样,也一样命中不了缓存。 + +SQLCache,更适合T+1更新的场景,凌晨数据更新,首次查询从BE中获取结果放入到缓存中,后续相同查询从缓存中获取。实时更新数据也可以使用,但是可能存在命中率低的问题。 + +当前支持 OlapTable内表 和 Hive外表。 + +## 使用方式 + +确保fe.conf的cache_enable_sql_mode=true(默认是true) + +```text +vim fe/conf/fe.conf +cache_enable_sql_mode=true +``` + +在MySQL命令行中设置变量 + +```sql +MySQL [(none)]> set [global] enable_sql_cache=true; +``` + +注:global是全局变量,不加指当前会话变量 + +## 缓存条件 + +第一次查询后,如果满足下面三个条件,查询结果就会被缓存。 + +1. (当前时间 - 查询的分区最后更新时间) 大于 fe.conf 中的 cache_last_version_interval_second。 + +2. 查询结果行数 小于 fe.conf 中的 cache_result_max_row_count。 + +3. 查询结果bytes 小于 fe.conf 中的 cache_result_max_data_size。 + +## 相关参数 + +具体参数介绍见 query-cache.md。