[fix](catalog) opt the count pushdown rule for iceberg/paimon/hive scan node (#44038) #45564

morningman · 2024-12-17T23:35:39Z

…an node (apache#44038) 1. Opt the parallelism when doing count push down optimization Count push down optimization is used to optimize queries such as `select count(*) from table`. In this scenario, we can directly obtain the number of rows through the row count statistics of the external table, or the metadata of the Parquet/ORC file, without reading the actual file content, thereby speeding up such queries. Currently, we support count push down optimization for Hive, Iceberg, and Paimon tables. There are two ways to obtain the number of rows: 1. Obtain directly from statistics For Iceberg tables, we can obtain the number of rows directly from statistics. However, due to the historical issues of Iceberg, if there is position/equality delete in the table, this method cannot be used to prevent incorrect row count. In this case, it will degenerate to obtaining from the metadata of the file. 2. Obtain from the metadata of the file For Hive, Paimon, and some of Iceberg tables, the number of rows can be obtained directly from the metadata of the Parquet/ORC file. For Text format tables, efficiency can also be improved by only performing row separation, without column separation. In the task splitting logic, for Count push-down optimization, the number of split tasks should comprehensively consider the file format, number of files, parallelism, number of BE nodes, and the Local Shuffle: 1. Count push-down optimization should avoid Local Shuffle, so the number of split tasks should be greater than or equal to `parallelism * number of BE nodes`. 2. Fix the incorrect logic of Count push-down optimization In the previous code, for Iceberg and Paimon tables, Count push-down optimization did not take effect because we did not push CountPushDown information to FileFormatReader inside TableForamtReader. This PR fixes this problem. 3. Store SessionVaraible variables in FileQueryScanNode. SessionVaraible is a variable in ConnectionContext. And ConnectionContext is a ThreadLocal variable. In FileQueryScanNode, SessionVaraible may be accessed in other threads in some cases, so ThreadLocal variables may not be obtained. Therefore, the SessionVaraible reference is stored in FileQueryScanNode to prevent illegal access. 4. Independent FileSplitter class. The FileSplitter class is a tool class that allows users to split `Split` according to different strategies. This PR does not modify the splitting strategy, but only extracts this part of the logic separately, to be able to perform logic optimization later.

Thearas · 2024-12-17T23:35:45Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

github-actions · 2024-12-17T23:43:10Z

clang-tidy review says "All clean, LGTM! 👍"

morningman · 2024-12-17T23:44:00Z

run buildall

github-actions · 2024-12-17T23:50:55Z

clang-tidy review says "All clean, LGTM! 👍"

doris-robot · 2024-12-18T00:20:59Z

TeamCity be ut coverage result:
Function Coverage: 36.46% (9568/26241)
Line Coverage: 27.91% (78617/281650)
Region Coverage: 26.59% (40364/151803)
Branch Coverage: 23.35% (20441/87560)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6290c6e810f774c00814d36fec3e786c00ee01eb_6290c6e810f774c00814d36fec3e786c00ee01eb/report/index.html

morningman closed this Dec 17, 2024

morningman reopened this Dec 17, 2024

morningman merged commit 855e9a5 into apache:branch-2.1 Dec 18, 2024
35 of 38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](catalog) opt the count pushdown rule for iceberg/paimon/hive scan node (#44038) #45564

[fix](catalog) opt the count pushdown rule for iceberg/paimon/hive scan node (#44038) #45564

morningman commented Dec 17, 2024

Thearas commented Dec 17, 2024

github-actions bot commented Dec 17, 2024

morningman commented Dec 17, 2024

github-actions bot commented Dec 17, 2024

doris-robot commented Dec 18, 2024

[fix](catalog) opt the count pushdown rule for iceberg/paimon/hive scan node (#44038) #45564

[fix](catalog) opt the count pushdown rule for iceberg/paimon/hive scan node (#44038) #45564

Conversation

morningman commented Dec 17, 2024

Thearas commented Dec 17, 2024

github-actions bot commented Dec 17, 2024

morningman commented Dec 17, 2024

github-actions bot commented Dec 17, 2024

doris-robot commented Dec 18, 2024