Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Using bitmap index to accelerate the query #4530

Open
1 of 2 tasks
Tan-JiaLiang opened this issue Nov 14, 2024 · 2 comments
Open
1 of 2 tasks

[Feature] Using bitmap index to accelerate the query #4530

Tan-JiaLiang opened this issue Nov 14, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@Tan-JiaLiang
Copy link
Contributor

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Currently we have introduced bitmap indexes. There are some optimisations that we can do when the user queries only against the bitmap indexed columns.

Suppose we have a table usershop_behavior with a bitmap index on the gender column and a bsi index on the gmv column.

CREATE TABLE usershop_behavior (
    uid BIGINT
    gender STRING,
    gmv BIGINT
) WITH (
    'file-index.bitmap.columns' = 'gender',
    'file-index.bsi.columns' = 'gmv'
);

The bitmap index and bsi index can be used not only for filtering, but also for some simple aggregation like:

SELECT 
    gender,
    COUNT(*) AS total,
    SUM(gmv) AS total_gmv,
    AVG(gmv) AS avg_gmv
FROM usershop_behavior
GROUP BY gender

SELECT 
    gender,
    COUNT(*) AS total,
    SUM(gmv) AS total_gmv,
    AVG(gmv) AS avg_gmv
FROM usershop_behavior
WHERE gender='M'
GROUP BY gender

BSI index can be useful in topk scenarios

SELECT *
FROM usershop_behavior
ORDER BY gmv DESC
LIMIT 10;

SELECT *
FROM usershop_behavior
WHERE gender='M'
ORDER BY gmv DESC
LIMIT 10;

Solution

Apache Flink and Apache Spark are already provides some interfaces. e.g.

Apache Flink:
org.apache.flink.table.connector.source.abilities.SupportsAggregatePushDown

Apache Spark:
org.apache.spark.sql.connector.read.SupportsPushDownTopN
org.apache.spark.sql.connector.read.SupportsPushDownAggregates

When queries hit the bitmap index rules, we can rewrite TableScan to BitmapIndexScan.

Anything else?

Currently our index is designed to be used only for Data Skipping and it is not as reliable as filtering using partitioned keys. (We can't tell the flink&spark engine that filtering with indexes is reliable.)

This is because creating the index is split into several steps.

  1. stop the ingesting task
  2. using alter table to add index options
  3. call rewrite index procedure
  4. restart the ingesting task

We need to find a way to make indexes reliable, like partition keys. (e.g. throw exception when index is empty?)
Otherwise it's hard for our index to do its job.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@Tan-JiaLiang Tan-JiaLiang added the enhancement New feature or request label Nov 14, 2024
@JingsongLi
Copy link
Contributor

This is a good idea, about make indexes reliable, maybe we can do it by fallback. For example, if the file does not contain index, we can fallback it to do computation, I mean, do the computation by Paimon.

@Tan-JiaLiang
Copy link
Contributor Author

Sounds a little bit more complicated to implement, but it is a very good idea.

If file-index.read.enabled is true and the index satisfies the predicate, push it down directly. If the index is empty on execution, codegen the filter and aggregate operators and use them in the paimon source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants