Quick cache manager #3550

dbutenhof · 2023-09-07T21:31:05Z

PBENCH-1249

On large datasets, our direct tarball extraction method can time out the API call. Unlike on a long intake, there is no persistent artifact so a retry will always time out as well. This applies to any get_inventory call, and therefore to the /inventory, /visualize, and /compare APIs; and given the central importance of those APIs for our Server 1.0 story, that's not an acceptable failure mode.

This PR mitigates that problem with a "compromise" partial cache manager, leveraging the existing unpack method but adding a file lock to manage shared access. The idea is that any consumer of tarball contents (including the indexer) will unpack the entire tarball, but leave a "last reference" timestamp. A periodic timer service will check the cache unpack timestamps, and delete the unpack directories which aren't currently locked and which haven't been referenced for longer than a set time period.

webbnh · 2023-09-08T17:21:50Z

Given that you lost access to Jenkins in the middle of this, are you sure that you still exist? 😆

The accesses to RHEL RPM repo and the container registry seem like transient infrastructure (e.g., network) problems. (I think I've seen each of them on occasion, although never with this level of concentration....) The Tox/Python packaging issue is not one I've seen in steady state before...I wonder if some external package got updated.... 😒

webbnh · 2023-09-08T17:26:40Z

I have an awkward question regarding this output:

py39: exit 1 (232.49 seconds) /home/jenkins/pbench> bash -c '/home/jenkins/pbench/exec-tests /var/tmp/jenkins/tox/py39 ' pid=407
.pkg: _exit> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
py39: FAIL code 1 (285.15=setup[52.66]+cmd[232.49] seconds)
evaluation failed :( (285.22 seconds)

Why are we referencing the 3.11 site-packages directory in what I think is a 3.9-based run? (Is that py39 thing just a red-herring??)

lib/pbench/cli/server/tree_manage.py

lib/pbench/server/cache_manager.py

lib/pbench/test/unit/server/test_cache_manager.py

webbnh

First installment: I'll try to get to the tests on Monday.

lib/pbench/cli/server/tree_manage.py

lib/pbench/server/cache_manager.py

webbnh

OK, here's the next installment. (I'd like to get this in front of you in case there is time to address it today, and I have to go on a quick errand now....) I'll resume with the tests when I get back.

I think that we should consider tweaking the LockRef interface and reconsider what it means to "release" a lock reference. (My instinct is that "keep" should be specified on the instantiation rather than via a setter, "wait" should be a parameter to the acquisition operation, exclusive/shared might be a characteristic of the acquisition rather than instantiation, and that release shouldn't imply destruction.) I'm guessing that a lot of your choices have been driven by the desire to use the LockRef as a context manager, but perhaps we can sustain that while still providing a less opinionated interface.

lib/pbench/cli/server/tree_manage.py

lib/pbench/server/cache_manager.py

lib/pbench/server/indexing_tarballs.py

lib/pbench/server/cache_manager.py

dbutenhof

Apparently I'm going to have to deal with heading out to vacation with this unresolved, which is a bit depressing. I guess I'll make an attempt at a few changes, but even if I finish them (which seems unlikely at this point) I'm not sure that'll be enough. 😦

lib/pbench/cli/server/tree_manage.py

lib/pbench/server/cache_manager.py

lib/pbench/server/indexing_tarballs.py

lib/pbench/server/cache_manager.py

webbnh

No blockers in the tests, but there's some low-level stuff which you might consider polishing.

lib/pbench/test/unit/server/test_cache_manager.py

lib/pbench/test/unit/server/test_indexing_tarballs.py

PBENCH-1192 This has been on my wishlist for a while, but was blocked by not actually having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache manager, and this PR layers on top of that. Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the first actual use of the in-memory cache map, and some adjustments were necessary to make it work outside of the unit test environment.

webbnh

This is looking really good now. However, I did find a potential bug (or a pair or trio) and a probably-missing assertion. I also have a bunch of pointed question-type suggestions and other nits.

Also, there's lingering comment from my previous review -- did you want to consider removing a try block?

lib/pbench/server/cache_manager.py

webbnh · 2023-10-09T17:29:15Z

lib/pbench/server/cache_manager.py

+    def __enter__(self) -> "LockManager":
        """Enter a lock context manager by acquiring the lock"""
-        return self.acquire()
+        self.lock.acquire(exclusive=self.exclusive, wait=self.wait)
+        return self


Should we be instantiating the value of self.lock here instead of doing that in the c'tor?

That is, because releasing the lock (by calling either LockManager.release() or LockManager.__exit__()) will render the LockRef unusable (because the underlying lock file will have been closed), when this function returns, the LockManager here becomes useless, and the caller will be forced to instantiate a new one in order to reacquire the lock. Deferring the instantiation of the LockRef to here would save the caller from having to reinstantiate the LockManager object.

lib/pbench/server/cache_manager.py

lib/pbench/server/indexing_tarballs.py

lib/pbench/server/cache_manager.py

lib/pbench/test/unit/server/test_cache_manager.py

lib/pbench/cli/server/tree_manage.py

PBENCH-1249 On large datasets, our direct tarball extraction method can time out the API call. Unlike on a long intake, there is no persistent artifact so a retry will always time out as well. This applies to any `get_inventory` call, and therefore to the `/inventory`, `/visualize`, and `/compare` APIs; and given the central importance of those APIs for our Server 1.0 story, that's not an acceptable failure mode. This PR mitigates that problem with a "compromise" partial cache manager, leveraging the existing `unpack` method but adding a file lock to manage shared access. The idea is that any consumer of tarball contents (including the indexer) will unpack the entire tarball, but leave a "last reference" timestamp. A periodic timer service will check the cache unpack timestamps, and delete the unpack directories which aren't currently locked and which haven't been referenced for longer than a set time period. __NOTE__: I'm posting a draft mostly for coverage data after a lot of drift in the cache manager unit tests, to determine whether more work is necessary. The "last reference" and reclaim mechanism isn't yet implemented, though that should be the "easy part" now that I've got the server code working.

I've verified that the timer service removes sufficiently old cache data and that the data is unpacked again on request. The reclaim operation is audited. I should probably audit the unpack a well, but haven't done that here. I'm still hoping for a successful CI run to check cobertura coverage.

We probably won't want to audit cache load longer term, but right now it probably makes sense to keep track.

Allow holding lock from unpack to stream, and conversion between `EX` and `SH` lock modes.

I can't figure out why the default `ubi9` container configuration + EPEL is no longer finding `rsyslog-mmjsonparse`. I've found no relevant hits on searches nor any obvious workaround. For now, try changing `Pipeline.gy` to override the default `BASE_IMAGE` and use `centos:stream9` instead.

1. Fix Inventory.close() to always close the stream. 2. Make cache load more transparent by upgrading lock if we need to unpack.

webbnh

I think that there is still an assertion missing from one of the tests; other than that, the below are just nits but you might want to consider addressing them. If not, I've approved....

lib/pbench/test/unit/server/test_cache_manager.py

lib/pbench/server/cache_manager.py

lib/pbench/test/unit/server/test_indexing_tarballs.py

lib/pbench/test/unit/server/test_cache_manager.py

lib/pbench/server/cache_manager.py

webbnh

Ship it!

lib/pbench/test/unit/server/test_indexing_tarballs.py

PBENCH-1192 This has been on my wishlist for a while, but was blocked by not actually having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache manager, and this PR layers on top of that. Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the first actual use of the in-memory cache map, and some adjustments were necessary to make it work outside of the unit test environment.

PBENCH-1192 This has been on my wishlist for a while, but was blocked by not actually having a usable cache. PR #3550 introduces a functioning (if minimal) cache manager, and this PR layers on top of that. The immediate motivation stems from an email exchange regarding Crucible, and the fact that Andrew would like (not surprisingly) to be able to access the contents of an archived tarball. Having TOC code relying on the Pbench-specific run-toc Elasticsearch index is not sustainable. Note that, despite #3550 introducing a live cache, this PR represents the first actual use of the in-memory cache map, and some adjustments were necessary to make it work outside of the unit test environment.

dbutenhof added Server DO NOT MERGE!!! API Of and relating to application programming interfaces to services and functions labels Sep 7, 2023

dbutenhof requested a review from webbnh September 7, 2023 21:31

dbutenhof self-assigned this Sep 7, 2023

dbutenhof force-pushed the cache branch from 1fe5660 to f0c1dea Compare September 8, 2023 04:59

This comment was marked as outdated.

Sign in to view

dbutenhof marked this pull request as ready for review September 8, 2023 18:46

dbutenhof removed the DO NOT MERGE!!! label Sep 8, 2023

dbutenhof commented Sep 8, 2023

View reviewed changes

webbnh reviewed Sep 8, 2023

View reviewed changes

dbutenhof marked this pull request as draft September 11, 2023 11:33

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

dbutenhof marked this pull request as ready for review September 13, 2023 00:26

This comment was marked as outdated.

Sign in to view

dbutenhof requested a review from webbnh September 13, 2023 20:34

webbnh reviewed Sep 18, 2023

View reviewed changes

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved

dbutenhof commented Sep 18, 2023

View reviewed changes

webbnh reviewed Sep 18, 2023

View reviewed changes

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved

webbnh reviewed Sep 18, 2023

View reviewed changes

dbutenhof force-pushed the cache branch from 00c02f2 to 73d9688 Compare October 2, 2023 20:52

dbutenhof mentioned this pull request Oct 5, 2023

Convert TOC to use cache map #3555

Merged

webbnh reviewed Oct 9, 2023

View reviewed changes

dbutenhof added 13 commits October 9, 2023 17:40

Implement cache reclaim and timer service.

24424da

Audit unpack

c40c844

We probably won't want to audit cache load longer term, but right now it probably makes sense to keep track.

Encapsulate and package locking

4d37469

Allow holding lock from unpack to stream, and conversion between `EX` and `SH` lock modes.

Add some test coverage

d82c482

Separate basic LockRef from context manager

ff7b134

Clean up merge

fb60572

Clean up some docstrings

78ca1b2

Revert BASE_IMAGE hack...

f8f3bc5

Some refactoring

ae00530

1. Fix Inventory.close() to always close the stream. 2. Make cache load more transparent by upgrading lock if we need to unpack.

More cleanup

82b1097

dbutenhof force-pushed the cache branch from 73d9688 to 82b1097 Compare October 9, 2023 21:41

dbutenhof requested a review from webbnh October 9, 2023 21:42

webbnh previously approved these changes Oct 10, 2023

View reviewed changes

Removed unused mock variable

3063126

dbutenhof dismissed webbnh’s stale review via 3063126 October 10, 2023 19:36

dbutenhof requested a review from webbnh October 10, 2023 19:46

webbnh approved these changes Oct 10, 2023

View reviewed changes

lib/pbench/test/unit/server/test_indexing_tarballs.py Show resolved Hide resolved

dbutenhof merged commit 8d830a8 into distributed-system-analysis:main Oct 10, 2023
3 checks passed

dbutenhof deleted the cache branch October 10, 2023 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick cache manager #3550

Quick cache manager #3550

dbutenhof commented Sep 7, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

webbnh commented Sep 8, 2023

webbnh commented Sep 8, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

webbnh left a comment

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

webbnh left a comment

dbutenhof left a comment

webbnh left a comment

webbnh left a comment

webbnh Oct 9, 2023

webbnh left a comment

webbnh left a comment

Quick cache manager #3550

Quick cache manager #3550

Conversation

dbutenhof commented Sep 7, 2023 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

webbnh commented Sep 8, 2023

webbnh commented Sep 8, 2023 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

webbnh left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

webbnh left a comment

Choose a reason for hiding this comment

dbutenhof left a comment

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

webbnh Oct 9, 2023

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

dbutenhof commented Sep 7, 2023 •

edited

Loading

webbnh commented Sep 8, 2023 •

edited

Loading