Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using zorder for the layout of some data in NDS queries #131

Open
revans2 opened this issue Oct 21, 2022 · 0 comments
Open

Explore using zorder for the layout of some data in NDS queries #131

revans2 opened this issue Oct 21, 2022 · 0 comments
Labels
performance Related to plugin performance improvements

Comments

@revans2
Copy link

revans2 commented Oct 21, 2022

This is a follow on issue to #130.

I did some analysis on what columns are impacted by predicate push down in various NDS queries and some swags about how many rows we might be able to skip if we could make the push down perfect.

For a few tables the there was only 1 column that was ever impacted by predicate push down catalog_returns and inventory. I personally think that these should stay as just being sorted. We could do an experiment to see what happens if we just do range partitioning instead of sorting, but that would be a trade off in the time to do the ingest/transocde vs the time it takes to do the actual queries.

There are also a number of other small fact tables that I don't think we should look at, because they are small enough that they are almost always in a single row group anyways so there would be little to no savings.

For these others it would be nice to see what happens if we try to zorder the data. Unfortunately out of the box we can only do this with deltalake on deltalake 2.0 and above. If the numbers look good we might be able to do something similar with iceberg once we support zorder for it. We also could write our own utility that would let us do zorder how we wanted. Because this only would work for deltalake we need to make sure that the maintenance phase does not undo the ordering that we did before. It is known to do this in some cases. We might need to do the zorder optimizations as a part of maintenance.

For web_returns there were three columns that were involved in a predicate push down, but only two of them really appeared to have a decent sized impact. I would like to see a comparison for the following.

  1. web_sales
    a. zorder by ws_net_profit and ws_sales_price
    b. just order by ws_net_profit
  2. web_returns
    a. zorder by wr_return_amt and wr_returned_date
    b. zorder by wr_return_amt, wr_returned_date, wr_returning_addr_sk
    c. just order by wr_return_amt
  3. catalog_sales
    a. zorder by cs_ship_addr_sk and cs_net_profit
    b. zorder by cs_ship_addr_sk, cs_net_profit and cs_sold_date_sk
    c. zorder by cs_ship_addr_sk, cs_net_profit, cs_sold_date_sk and cs_bill_customer_sk
    d. just sort by cs_ship_addr_sk
  4. store_returns
    a. zorder by sr_return_amt and sr_returned_date_sk
    b. zorder by sr_return_amt, sr_returned_date_sk and sr_customer_sk
    c. zorder by sr_return_amt, sr_returned_date_sk, sr_customer_sk and sr_store_sk
    d. just sort by sr_return_amt
  5. store_sales this one is more complicated there are a number of different problems.
    a. There are 14 different columns that have some impact to the queries. But 14 columns is way too much for zorder to work well with.
    b. The column we care the most about ss_quantity has a low cardinality (100) which does not work well with the deltalake zorder implementation.
    c. I am not 100% sure what happens when you optimize a partitioned deltalake table with zorder, unless you optimize each partition individually, which would be a real pain to deal with.
    d. deltalake zorder only clusters the data into files. It does not actually sort the data so unless there are multiple gigabytes of compressed data under each ss_quantity partition it si going to not show any benefit at all.
    e. Because of all of this I would like to see just one experiment. partition by ss_quantity and zorder by ss_wholesale_cost, ss_list_price and ss_coupon_amt vs just partition by ss_quantity and order by ss_wholesale_cost paying special attention to query 28, which is the one most likely to see a performance improvement here.
@mattahrens mattahrens changed the title Exlore using zored for the layout of some data in NDS quereis. Exlore using zorder for the layout of some data in NDS quereis. Oct 21, 2022
@mattahrens mattahrens changed the title Exlore using zorder for the layout of some data in NDS quereis. Exlore using zorder for the layout of some data in NDS queries Oct 21, 2022
@mattahrens mattahrens changed the title Exlore using zorder for the layout of some data in NDS queries Explore using zorder for the layout of some data in NDS queries Oct 25, 2022
@mattahrens mattahrens added the performance Related to plugin performance improvements label Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to plugin performance improvements
Projects
None yet
Development

No branches or pull requests

2 participants