Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] Fixes lookup & searching in the descriptor #17004

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

jblomer
Copy link
Contributor

@jblomer jblomer commented Nov 21, 2024

Fixes several instances of lookups in the descriptor from linear to logarithmic complexity. As a result, many of the limit tests results improve significantly. So much so that I think we can turn on most of them on a regular basis.

Relies on #16986

@jblomer jblomer self-assigned this Nov 21, 2024
@jblomer jblomer force-pushed the ntuple-fix-descriptor-search branch from 81ca30f to e9dd2d5 Compare November 21, 2024 14:00
/// May contain only a subset of all the available clusters, e.g. the clusters of the current file
/// from a chain of files
std::unordered_map<DescriptorId_t, RClusterDescriptor> fClusterDescriptors;
std::vector<RExtraTypeInfoDescriptor> fExtraTypeInfoDescriptors;
std::unique_ptr<RHeaderExtension> fHeaderExtension;

// We don't expose this publicy because when we add sharded clusters, this interface does not make sense anymore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: publicy
Also, it should probably be a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a TODO. We never plan to make this a public interface. Because with sharded clusters, this interface would be confusing (multiple possible clusters for the same entry number).

Copy link

github-actions bot commented Nov 21, 2024

Test Results

    18 files      18 suites   4d 8h 12m 53s ⏱️
 2 665 tests  2 662 ✅ 0 💤 3 ❌
46 252 runs  46 244 ✅ 0 💤 8 ❌

For more details on these failures, see this check.

Results for commit d5eae81.

♻️ This comment has been updated with latest results.

Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the limits test, we should also check the memory consumption in addition to the time...

Comment on lines +384 to +396
const auto &clusterIds = GetClusterGroupDescriptor(fSortedClusterGroupIds[cgMidpoint]).GetClusterIds();
R__ASSERT(!clusterIds.empty());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated (for now): this requires deserializing all page lists to populate all cluster group descriptors. In the future, we may first want to search loaded cluster groups under the assumption that by loading the (global) entry first, we already have the necessary information...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. When we implement partial loading of page lists, we need to modify this, e.g. to first look into the available page lists and then load the remaining ones or so.

Comment on lines +487 to +499
const auto firstEntryInNextCluster = clusterDesc.GetFirstEntryIndex() + clusterDesc.GetNEntries();
return FindClusterId(firstEntryInNextCluster);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth shortcutting the common case here and check if clusterId + 1 contains firstEntryInNextCluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. The "problem" is that I think currently that shortcut will always trigger. However, I also don't want to rely on descriptor ID ordering... I need to think about it.

return kInvalidDescriptorId;
if (clusterDesc.GetFirstEntryIndex() == 0)
return kInvalidDescriptorId;
return FindClusterId(clusterDesc.GetFirstEntryIndex() - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here

@jblomer jblomer force-pushed the ntuple-fix-descriptor-search branch from e9dd2d5 to 2c5a443 Compare November 26, 2024 22:06
@hahnjo
Copy link
Member

hahnjo commented Nov 27, 2024

Limits_ManyPagesOneEntry fails in the CI: it fills a std::vector<int> with 100 million elements and writes it as a single page. That probably means memory usage in the order of GBs, we probably should leave it disabled.

@jblomer jblomer force-pushed the ntuple-fix-descriptor-search branch from 2c5a443 to c23b2fa Compare December 12, 2024 22:43
@jblomer jblomer requested a review from hahnjo December 12, 2024 22:43
Comment on lines -163 to +167
TEST(RNTuple, DISABLED_Limits_ManyPagesOneEntry)
TEST(RNTuple, Limits_ManyPagesOneEntry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails in the CI so we should not enable it either

TEST(RNTuple, DISABLED_Limits_LargePageOneEntry)
TEST(RNTuple, Limits_LargePageOneEntry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe don't enable it to disable it again two commits later...

@jblomer jblomer force-pushed the ntuple-fix-descriptor-search branch from c23b2fa to d5eae81 Compare December 13, 2024 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants