You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did some analysis on what columns are impacted by predicate push down in various NDS queries and some swags about how many rows we might be able to skip if we could make the push down perfect.
For a few tables the there was only 1 column that was ever impacted by predicate push down catalog_returns and inventory. I personally think that these should stay as just being sorted. We could do an experiment to see what happens if we just do range partitioning instead of sorting, but that would be a trade off in the time to do the ingest/transocde vs the time it takes to do the actual queries.
There are also a number of other small fact tables that I don't think we should look at, because they are small enough that they are almost always in a single row group anyways so there would be little to no savings.
For these others it would be nice to see what happens if we try to zorder the data. Unfortunately out of the box we can only do this with deltalake on deltalake 2.0 and above. If the numbers look good we might be able to do something similar with iceberg once we support zorder for it. We also could write our own utility that would let us do zorder how we wanted. Because this only would work for deltalake we need to make sure that the maintenance phase does not undo the ordering that we did before. It is known to do this in some cases. We might need to do the zorder optimizations as a part of maintenance.
For web_returns there were three columns that were involved in a predicate push down, but only two of them really appeared to have a decent sized impact. I would like to see a comparison for the following.
web_sales
a. zorder by ws_net_profit and ws_sales_price
b. just order by ws_net_profit
web_returns
a. zorder by wr_return_amt and wr_returned_date
b. zorder by wr_return_amt, wr_returned_date, wr_returning_addr_sk
c. just order by wr_return_amt
catalog_sales
a. zorder by cs_ship_addr_sk and cs_net_profit
b. zorder by cs_ship_addr_sk, cs_net_profit and cs_sold_date_sk
c. zorder by cs_ship_addr_sk, cs_net_profit, cs_sold_date_sk and cs_bill_customer_sk
d. just sort by cs_ship_addr_sk
store_returns
a. zorder by sr_return_amt and sr_returned_date_sk
b. zorder by sr_return_amt, sr_returned_date_sk and sr_customer_sk
c. zorder by sr_return_amt, sr_returned_date_sk, sr_customer_sk and sr_store_sk
d. just sort by sr_return_amt
store_sales this one is more complicated there are a number of different problems.
a. There are 14 different columns that have some impact to the queries. But 14 columns is way too much for zorder to work well with.
b. The column we care the most about ss_quantity has a low cardinality (100) which does not work well with the deltalake zorder implementation.
c. I am not 100% sure what happens when you optimize a partitioned deltalake table with zorder, unless you optimize each partition individually, which would be a real pain to deal with.
d. deltalake zorder only clusters the data into files. It does not actually sort the data so unless there are multiple gigabytes of compressed data under each ss_quantity partition it si going to not show any benefit at all.
e. Because of all of this I would like to see just one experiment. partition by ss_quantity and zorder by ss_wholesale_cost, ss_list_price and ss_coupon_amt vs just partition by ss_quantity and order by ss_wholesale_cost paying special attention to query 28, which is the one most likely to see a performance improvement here.
The text was updated successfully, but these errors were encountered:
mattahrens
changed the title
Exlore using zored for the layout of some data in NDS quereis.
Exlore using zorder for the layout of some data in NDS quereis.
Oct 21, 2022
mattahrens
changed the title
Exlore using zorder for the layout of some data in NDS quereis.
Exlore using zorder for the layout of some data in NDS queries
Oct 21, 2022
mattahrens
changed the title
Exlore using zorder for the layout of some data in NDS queries
Explore using zorder for the layout of some data in NDS queries
Oct 25, 2022
This is a follow on issue to #130.
I did some analysis on what columns are impacted by predicate push down in various NDS queries and some swags about how many rows we might be able to skip if we could make the push down perfect.
For a few tables the there was only 1 column that was ever impacted by predicate push down
catalog_returns
andinventory
. I personally think that these should stay as just being sorted. We could do an experiment to see what happens if we just do range partitioning instead of sorting, but that would be a trade off in the time to do the ingest/transocde vs the time it takes to do the actual queries.There are also a number of other small fact tables that I don't think we should look at, because they are small enough that they are almost always in a single row group anyways so there would be little to no savings.
For these others it would be nice to see what happens if we try to
zorder
the data. Unfortunately out of the box we can only do this with deltalake on deltalake 2.0 and above. If the numbers look good we might be able to do something similar with iceberg once we support zorder for it. We also could write our own utility that would let us do zorder how we wanted. Because this only would work for deltalake we need to make sure that the maintenance phase does not undo the ordering that we did before. It is known to do this in some cases. We might need to do the zorder optimizations as a part of maintenance.For
web_returns
there were three columns that were involved in a predicate push down, but only two of them really appeared to have a decent sized impact. I would like to see a comparison for the following.web_sales
a. zorder by
ws_net_profit
andws_sales_price
b. just order by
ws_net_profit
web_returns
a. zorder by
wr_return_amt
andwr_returned_date
b. zorder by
wr_return_amt
,wr_returned_date
,wr_returning_addr_sk
c. just order by
wr_return_amt
catalog_sales
a. zorder by
cs_ship_addr_sk
andcs_net_profit
b. zorder by
cs_ship_addr_sk
,cs_net_profit
andcs_sold_date_sk
c. zorder by
cs_ship_addr_sk
,cs_net_profit
,cs_sold_date_sk
andcs_bill_customer_sk
d. just sort by
cs_ship_addr_sk
store_returns
a. zorder by
sr_return_amt
andsr_returned_date_sk
b. zorder by
sr_return_amt
,sr_returned_date_sk
andsr_customer_sk
c. zorder by
sr_return_amt
,sr_returned_date_sk
,sr_customer_sk
andsr_store_sk
d. just sort by
sr_return_amt
store_sales
this one is more complicated there are a number of different problems.a. There are 14 different columns that have some impact to the queries. But 14 columns is way too much for zorder to work well with.
b. The column we care the most about
ss_quantity
has a low cardinality (100) which does not work well with the deltalake zorder implementation.c. I am not 100% sure what happens when you optimize a partitioned deltalake table with zorder, unless you optimize each partition individually, which would be a real pain to deal with.
d. deltalake zorder only clusters the data into files. It does not actually sort the data so unless there are multiple gigabytes of compressed data under each
ss_quantity
partition it si going to not show any benefit at all.e. Because of all of this I would like to see just one experiment. partition by
ss_quantity
and zorder byss_wholesale_cost
,ss_list_price
andss_coupon_amt
vs just partition byss_quantity
and order byss_wholesale_cost
paying special attention to query 28, which is the one most likely to see a performance improvement here.The text was updated successfully, but these errors were encountered: