Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][format] Optimize manifest reading performance,add pushdown for manifest and orc. #4497

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

ranxianglei
Copy link
Contributor

@ranxianglei ranxianglei commented Nov 11, 2024

Purpose

English:Optimize the manifest reading performance, optimize the format object creation performance, and reduce the total time spent on the actual test manifest to less than 3ms (of course there is room for optimization to reduce it to less than 1ms). With the orc push-down function turned on, the metadata format is changed to orc, which can handle high-concurrency (qps greater than 10,000) and low-latency (overall rt less than 50ms) scenarios.

Chinese:优化manifest读取性能,优化format对象创建性能,实际测试manifest总耗时降低到3ms以下(当然还有优化空间降低到1ms以下)。配合元数据缓存开启,orc下推功能开启,元数据格式改成orc,可以承接高并发(qps 大于1万)低延迟(整体rt 50ms以下)场景

Linked issue: close #xxx

Tests

API and Format

Documentation

@ranxianglei
Copy link
Contributor Author

with #4231 together

@ranxianglei ranxianglei changed the title [core][format] Optimize manifest reading performance,add pushdown for manifest . [core][format] Optimize manifest reading performance,add pushdown for manifest and orc. Nov 12, 2024
@ranxianglei ranxianglei reopened this Nov 13, 2024
@ranxianglei ranxianglei reopened this Nov 13, 2024
@ranxianglei ranxianglei reopened this Nov 15, 2024
@ranxianglei ranxianglei reopened this Nov 18, 2024
@ranxianglei
Copy link
Contributor Author

Note: Since the cache code related to manifest and fileformat has been withdrawn in this PR and will be submitted to the next PR, this PR cannot yet achieve the performance introduced by Purpose.

注意:由于manifest和fileformat相关的cache代码本pr已经撤回,留到下一个pr提交,本pr尚不能达到Purpose介绍的性能。


return Optional.empty();
FileFormatFactory fileFormatFactory =
FactoryUtil.discoverFactory(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just create a PR for FileFormatFactory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingsongLi Of course you can, but I’ll change it in a few days. I’ve been a little busy lately.

当然可以,不过过几天再改,最近有点忙

* Read the corresponding entries based on the current required bucket, but push down into file
* format .
*/
private static List<Predicate> createPushDownFilter(Collection<Integer> buckets) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the query performance mainly gain from the bucket field push down for the ORC manifest file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than half of the performance improvement comes from the orc pushdown of the manifest, the other part comes from the optimization of OrcFileFormat creation, and the other part comes from the caching of some time-consuming object operations on Scan.

性能提升一多半来自于manifest的orc下推,另外一部分来自于OrcFileFormat创建的优化,还有一部分来自于Scan上部分耗时的对象操作缓存 @Aitozi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with #4231 together, bucket data with orc pushdown . Tests see this issue #4586 , current orc impl is faster more than Parquet 10 times! . @Aitozi

entryType,
fileFormat.createReaderFactory(entryType),
fileFormat.createReaderFactory(entryType, filters),
Copy link
Contributor

@Aitozi Aitozi Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we enable the reader filter and the manifest cache, will we miss data from other buckets when reading data from bucket-x? Previously, data was stored in ObjectCache after passing through the loadFilter, but now it must pass through this filter first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ObjectCache is enabled and push-down withBuckets is used, the problem you mentioned may indeed occur. So I originally planned to add a Filter condition to ObjectCache, but it was too complicated to change and I didn't have so much time to do these things, so I could only push down withBuckets for the time being. Because, in most scenarios, there will be no problem. If it is in flink or spark, I have seen that withBuckets will not be called at all. If it is an olap query and the corresponding bucket is read in segments, the bucket and segment will remain mapped. There will be no problems with the relationship.
If it were not based on this consideration, I suggest that the partition should also be pushed down.
If you feel the risk is too great, you can even turn off the manifest's metadata cache, and the performance will still improve significantly. @Aitozi

如果开启了ObjectCache缓存,有使用了withBuckets的下推,确实可能出现你说的问题。所以我本来打算给ObjectCache增加一个Filter条件,但是改起来太复杂而我没有那么多时间做这些东西,只能暂时先把withBuckets下推做了。因为,大部分场景下都不会出现问题,如果是flink里面或者spark里面,我看了根本就不会调用withBuckets,如果是olap查询,分segment读取对应的bucket,则bucket和segment会保持映射关系,也不会出现问题。
如果不是基于这个考虑,我建议分区也应该下推。
如果你觉得风险太大,甚至可以关闭manifest的元数据缓存,性能依然提升很明显。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation, If we can not handle the push down when the cache enabled, I think we can disable the filter push down when the cache is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to choose the latter between metadata caching and manifest pushdown. The performance of paimon's ObjectCache implementation is very low. After testing, sometimes it is not even as fast as manifest pushdown. I will submit a PR later to fix the performance problem of ObjectCache.

在元数据缓存和manifest下推之间建议选择后者。paimon的ObjectCache实现的性能非常低,经测试有时候甚至比不上manifest下推快。后面我会提交一个pr修复ObjectCache的性能问题。

@Aitozi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aitozi This is a scenario that is quite different from mainstream applications in the community. The author's internal analysis engine does not have the ability of a central node, and can only plan by each computing node themselves. Each computing node only cares about its own bucket.

Actually, this is more like a manifest cache in the writer node than the current design.

Copy link
Contributor

@Aitozi Aitozi Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingsongLi In the writer node, it could still may need to read more than one bucket entry from the manifest if the parallelism is lower than the bucket number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aitozi It is true, there are problems in this PR's implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
great! Read more than 2G of metadata at one time

@Aitozi
Copy link
Contributor

Aitozi commented Nov 28, 2024

@ranxianglei thanks for your work, happy to see some effort to improve the manifest file reading performance, left two comments.

@JingsongLi
Copy link
Contributor

Hi @ranxianglei
You can create multiple PRs to complete multiple optimizations, but currently there are still various changes mixed together, and each change requires a lot of discussion, performance testing, and evaluation of previous behavior changes.

The purpose of PR and review is not to achieve great accomplishments within a single PR, but to provide higher quality code and better architecture.

@ranxianglei
Copy link
Contributor Author

你好@ranxianglei 你可以创建多个 PR 来完成多项优化,但目前仍有各种更改混杂在一起,每个更改都需要大量的讨论、性能测试和对以前的行为更改的评估。

PR和review的目的不是为了在一次PR内取得多大的成就,而是为了提供更高质量的代码和更好的架构。

I've been quite busy lately. I'll split the PR when I'm done. @JingsongLi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants