[ntuple] Fixes lookup & searching in the descriptor #17004

jblomer · 2024-11-21T13:59:11Z

Fixes several instances of lookups in the descriptor from linear to logarithmic complexity. As a result, many of the limit tests results improve significantly. So much so that I think we can turn on most of them on a regular basis.

Relies on #16986

silverweed · 2024-11-21T14:11:02Z

tree/ntuple/v7/inc/ROOT/RNTupleDescriptor.hxx

   /// May contain only a subset of all the available clusters, e.g. the clusters of the current file
   /// from a chain of files
   std::unordered_map<DescriptorId_t, RClusterDescriptor> fClusterDescriptors;
   std::vector<RExtraTypeInfoDescriptor> fExtraTypeInfoDescriptors;
   std::unique_ptr<RHeaderExtension> fHeaderExtension;

+   // We don't expose this publicy because when we add sharded clusters, this interface does not make sense anymore


Typo: publicy
Also, it should probably be a TODO?

I don't think this is a TODO. We never plan to make this a public interface. Because with sharded clusters, this interface would be confusing (multiple possible clusters for the same entry number).

github-actions · 2024-11-21T16:03:54Z

Test Results

18 files 18 suites 4d 8h 12m 53s ⏱️
2 665 tests 2 662 ✅ 0 💤 3 ❌
46 252 runs 46 244 ✅ 0 💤 8 ❌

For more details on these failures, see this check.

Results for commit d5eae81.

♻️ This comment has been updated with latest results.

hahnjo

For the limits test, we should also check the memory consumption in addition to the time...

hahnjo · 2024-11-22T10:48:30Z

tree/ntuple/v7/src/RNTupleDescriptor.cxx

+      const auto &clusterIds = GetClusterGroupDescriptor(fSortedClusterGroupIds[cgMidpoint]).GetClusterIds();
+      R__ASSERT(!clusterIds.empty());


unrelated (for now): this requires deserializing all page lists to populate all cluster group descriptors. In the future, we may first want to search loaded cluster groups under the assumption that by loading the (global) entry first, we already have the necessary information...

Indeed. When we implement partial loading of page lists, we need to modify this, e.g. to first look into the available page lists and then load the remaining ones or so.

hahnjo · 2024-11-22T10:49:51Z

tree/ntuple/v7/src/RNTupleDescriptor.cxx

+   const auto firstEntryInNextCluster = clusterDesc.GetFirstEntryIndex() + clusterDesc.GetNEntries();
+   return FindClusterId(firstEntryInNextCluster);


Is it worth shortcutting the common case here and check if clusterId + 1 contains firstEntryInNextCluster?

That's a good question. The "problem" is that I think currently that shortcut will always trigger. However, I also don't want to rely on descriptor ID ordering... I need to think about it.

hahnjo · 2024-11-22T10:50:11Z

tree/ntuple/v7/src/RNTupleDescriptor.cxx

-   return kInvalidDescriptorId;
+   if (clusterDesc.GetFirstEntryIndex() == 0)
+      return kInvalidDescriptorId;
+   return FindClusterId(clusterDesc.GetFirstEntryIndex() - 1);


same question here

hahnjo · 2024-11-27T08:11:14Z

Limits_ManyPagesOneEntry fails in the CI: it fills a std::vector<int> with 100 million elements and writes it as a single page. That probably means memory usage in the order of GBs, we probably should leave it disabled.

hahnjo · 2024-12-13T08:16:18Z

tree/ntuple/v7/test/ntuple_limits.cxx

-TEST(RNTuple, DISABLED_Limits_ManyPagesOneEntry)
+TEST(RNTuple, Limits_ManyPagesOneEntry)


This test fails in the CI so we should not enable it either

hahnjo · 2024-12-13T08:16:40Z

tree/ntuple/v7/test/ntuple_limits.cxx

-TEST(RNTuple, DISABLED_Limits_LargePageOneEntry)
+TEST(RNTuple, Limits_LargePageOneEntry)


maybe don't enable it to disable it again two commits later...

…(logn)

jblomer added the in:RNTuple label Nov 21, 2024

jblomer requested review from hahnjo, pcanal, silverweed and enirolf November 21, 2024 13:59

jblomer self-assigned this Nov 21, 2024

jblomer force-pushed the ntuple-fix-descriptor-search branch from 81ca30f to e9dd2d5 Compare November 21, 2024 14:00

silverweed reviewed Nov 21, 2024

View reviewed changes

hahnjo reviewed Nov 22, 2024

View reviewed changes

jblomer force-pushed the ntuple-fix-descriptor-search branch from e9dd2d5 to 2c5a443 Compare November 26, 2024 22:06

jblomer force-pushed the ntuple-fix-descriptor-search branch from 2c5a443 to c23b2fa Compare December 12, 2024 22:43

jblomer requested a review from hahnjo December 12, 2024 22:43

hahnjo reviewed Dec 13, 2024

View reviewed changes

jblomer added 8 commits December 13, 2024 10:51

[ntuple] sort cluster groups in descriptor

beb5dc7

[ntuple] sort clusters in cluster group descriptor

0c0ae40

[ntuple] improve FindClusterId() complexity from O(n) to O(logn)

bf99fa1

[ntuple] improve Find[Next|Prev]ClusterId() complexity from O(n) to O…

31f1d94

…(logn)

[ntuple] improve RPageRange::Find() complexity from O(n) to O(logn)

0f6cc50

[ntuple] update and enable some limits tests

f3c9d40

[NFC][ntuple] fix typo in code comment

5bd2205

[NFC][ntuple] note down peak RSS of limits tests

d5eae81

jblomer force-pushed the ntuple-fix-descriptor-search branch from c23b2fa to d5eae81 Compare December 13, 2024 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ntuple] Fixes lookup & searching in the descriptor #17004

[ntuple] Fixes lookup & searching in the descriptor #17004

jblomer commented Nov 21, 2024

silverweed Nov 21, 2024

jblomer Nov 28, 2024

github-actions bot commented Nov 21, 2024 •

edited

Loading

hahnjo left a comment

hahnjo Nov 22, 2024

jblomer Dec 12, 2024

hahnjo Nov 22, 2024

jblomer Dec 12, 2024

hahnjo Nov 22, 2024

hahnjo commented Nov 27, 2024

hahnjo Dec 13, 2024

hahnjo Dec 13, 2024

		const auto &clusterIds = GetClusterGroupDescriptor(fSortedClusterGroupIds[cgMidpoint]).GetClusterIds();
		R__ASSERT(!clusterIds.empty());

		const auto firstEntryInNextCluster = clusterDesc.GetFirstEntryIndex() + clusterDesc.GetNEntries();
		return FindClusterId(firstEntryInNextCluster);

		TEST(RNTuple, DISABLED_Limits_ManyPagesOneEntry)
		TEST(RNTuple, Limits_ManyPagesOneEntry)

		TEST(RNTuple, DISABLED_Limits_LargePageOneEntry)
		TEST(RNTuple, Limits_LargePageOneEntry)

[ntuple] Fixes lookup & searching in the descriptor #17004

Are you sure you want to change the base?

[ntuple] Fixes lookup & searching in the descriptor #17004

Conversation

jblomer commented Nov 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 21, 2024 • edited Loading

Test Results

hahnjo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hahnjo commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 21, 2024 •

edited

Loading