-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ntuple] Fixes lookup & searching in the descriptor #17004
base: master
Are you sure you want to change the base?
Conversation
81ca30f
to
e9dd2d5
Compare
/// May contain only a subset of all the available clusters, e.g. the clusters of the current file | ||
/// from a chain of files | ||
std::unordered_map<DescriptorId_t, RClusterDescriptor> fClusterDescriptors; | ||
std::vector<RExtraTypeInfoDescriptor> fExtraTypeInfoDescriptors; | ||
std::unique_ptr<RHeaderExtension> fHeaderExtension; | ||
|
||
// We don't expose this publicy because when we add sharded clusters, this interface does not make sense anymore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: publicy
Also, it should probably be a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a TODO. We never plan to make this a public interface. Because with sharded clusters, this interface would be confusing (multiple possible clusters for the same entry number).
Test Results 18 files 18 suites 4d 8h 12m 53s ⏱️ For more details on these failures, see this check. Results for commit d5eae81. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the limits test, we should also check the memory consumption in addition to the time...
const auto &clusterIds = GetClusterGroupDescriptor(fSortedClusterGroupIds[cgMidpoint]).GetClusterIds(); | ||
R__ASSERT(!clusterIds.empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated (for now): this requires deserializing all page lists to populate all cluster group descriptors. In the future, we may first want to search loaded cluster groups under the assumption that by loading the (global) entry first, we already have the necessary information...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. When we implement partial loading of page lists, we need to modify this, e.g. to first look into the available page lists and then load the remaining ones or so.
const auto firstEntryInNextCluster = clusterDesc.GetFirstEntryIndex() + clusterDesc.GetNEntries(); | ||
return FindClusterId(firstEntryInNextCluster); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth shortcutting the common case here and check if clusterId + 1
contains firstEntryInNextCluster
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. The "problem" is that I think currently that shortcut will always trigger. However, I also don't want to rely on descriptor ID ordering... I need to think about it.
return kInvalidDescriptorId; | ||
if (clusterDesc.GetFirstEntryIndex() == 0) | ||
return kInvalidDescriptorId; | ||
return FindClusterId(clusterDesc.GetFirstEntryIndex() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here
e9dd2d5
to
2c5a443
Compare
|
2c5a443
to
c23b2fa
Compare
TEST(RNTuple, DISABLED_Limits_ManyPagesOneEntry) | ||
TEST(RNTuple, Limits_ManyPagesOneEntry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails in the CI so we should not enable it either
TEST(RNTuple, DISABLED_Limits_LargePageOneEntry) | ||
TEST(RNTuple, Limits_LargePageOneEntry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe don't enable it to disable it again two commits later...
c23b2fa
to
d5eae81
Compare
Fixes several instances of lookups in the descriptor from linear to logarithmic complexity. As a result, many of the limit tests results improve significantly. So much so that I think we can turn on most of them on a regular basis.
Relies on #16986