You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched in the issues and found nothing similar.
Motivation
Currently we have introduced bitmap indexes. There are some optimisations that we can do when the user queries only against the bitmap indexed columns.
Suppose we have a table usershop_behavior with a bitmap index on the gender column and a bsi index on the gmv column.
The bitmap index and bsi index can be used not only for filtering, but also for some simple aggregation like:
SELECT
gender,
COUNT(*) AS total,
SUM(gmv) AS total_gmv,
AVG(gmv) AS avg_gmv
FROM usershop_behavior
GROUP BY gender
SELECT
gender,
COUNT(*) AS total,
SUM(gmv) AS total_gmv,
AVG(gmv) AS avg_gmv
FROM usershop_behavior
WHERE gender='M'GROUP BY gender
BSI index can be useful in topk scenarios
SELECT*FROM usershop_behavior
ORDER BY gmv DESCLIMIT10;
SELECT*FROM usershop_behavior
WHERE gender='M'ORDER BY gmv DESCLIMIT10;
Solution
Apache Flink and Apache Spark are already provides some interfaces. e.g.
When queries hit the bitmap index rules, we can rewrite TableScan to BitmapIndexScan.
Anything else?
Currently our index is designed to be used only for Data Skipping and it is not as reliable as filtering using partitioned keys. (We can't tell the flink&spark engine that filtering with indexes is reliable.)
This is because creating the index is split into several steps.
stop the ingesting task
using alter table to add index options
call rewrite index procedure
restart the ingesting task
We need to find a way to make indexes reliable, like partition keys. (e.g. throw exception when index is empty?)
Otherwise it's hard for our index to do its job.
Are you willing to submit a PR?
I'm willing to submit a PR!
The text was updated successfully, but these errors were encountered:
This is a good idea, about make indexes reliable, maybe we can do it by fallback. For example, if the file does not contain index, we can fallback it to do computation, I mean, do the computation by Paimon.
Sounds a little bit more complicated to implement, but it is a very good idea.
If file-index.read.enabled is true and the index satisfies the predicate, push it down directly. If the index is empty on execution, codegen the filter and aggregate operators and use them in the paimon source.
Search before asking
Motivation
Currently we have introduced bitmap indexes. There are some optimisations that we can do when the user queries only against the bitmap indexed columns.
Suppose we have a table
usershop_behavior
with a bitmap index on thegender
column and a bsi index on thegmv
column.The bitmap index and bsi index can be used not only for filtering, but also for some simple aggregation like:
BSI index can be useful in topk scenarios
Solution
Apache Flink and Apache Spark are already provides some interfaces. e.g.
Apache Flink:
org.apache.flink.table.connector.source.abilities.SupportsAggregatePushDown
Apache Spark:
org.apache.spark.sql.connector.read.SupportsPushDownTopN
org.apache.spark.sql.connector.read.SupportsPushDownAggregates
When queries hit the bitmap index rules, we can rewrite TableScan to BitmapIndexScan.
Anything else?
Currently our index is designed to be used only for Data Skipping and it is not as reliable as filtering using partitioned keys. (We can't tell the flink&spark engine that filtering with indexes is reliable.)
This is because creating the index is split into several steps.
We need to find a way to make indexes reliable, like partition keys. (e.g. throw exception when index is empty?)
Otherwise it's hard for our index to do its job.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: