Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick cache manager #3550

Merged
merged 14 commits into from
Oct 10, 2023
Merged

Conversation

dbutenhof
Copy link
Member

@dbutenhof dbutenhof commented Sep 7, 2023

PBENCH-1249

On large datasets, our direct tarball extraction method can time out the API call. Unlike on a long intake, there is no persistent artifact so a retry will always time out as well. This applies to any get_inventory call, and therefore to the /inventory, /visualize, and /compare APIs; and given the central importance of those APIs for our Server 1.0 story, that's not an acceptable failure mode.

This PR mitigates that problem with a "compromise" partial cache manager, leveraging the existing unpack method but adding a file lock to manage shared access. The idea is that any consumer of tarball contents (including the indexer) will unpack the entire tarball, but leave a "last reference" timestamp. A periodic timer service will check the cache unpack timestamps, and delete the unpack directories which aren't currently locked and which haven't been referenced for longer than a set time period.

@dbutenhof dbutenhof added Server DO NOT MERGE!!! API Of and relating to application programming interfaces to services and functions labels Sep 7, 2023
@dbutenhof dbutenhof requested a review from webbnh September 7, 2023 21:31
@dbutenhof dbutenhof self-assigned this Sep 7, 2023
@dbutenhof

This comment was marked as outdated.

@dbutenhof

This comment was marked as outdated.

@dbutenhof

This comment was marked as outdated.

@webbnh
Copy link
Member

webbnh commented Sep 8, 2023

Given that you lost access to Jenkins in the middle of this, are you sure that you still exist? 😆

The accesses to RHEL RPM repo and the container registry seem like transient infrastructure (e.g., network) problems. (I think I've seen each of them on occasion, although never with this level of concentration....) The Tox/Python packaging issue is not one I've seen in steady state before...I wonder if some external package got updated.... 😒

@webbnh
Copy link
Member

webbnh commented Sep 8, 2023

I have an awkward question regarding this output:

py39: exit 1 (232.49 seconds) /home/jenkins/pbench> bash -c '/home/jenkins/pbench/exec-tests /var/tmp/jenkins/tox/py39 ' pid=407
.pkg: _exit> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
py39: FAIL code 1 (285.15=setup[52.66]+cmd[232.49] seconds)
evaluation failed :( (285.22 seconds)

Why are we referencing the 3.11 site-packages directory in what I think is a 3.9-based run? (Is that py39 thing just a red-herring??)

@dbutenhof

This comment was marked as outdated.

@dbutenhof

This comment was marked as outdated.

@dbutenhof

This comment was marked as outdated.

@dbutenhof dbutenhof marked this pull request as ready for review September 8, 2023 18:46
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First installment: I'll try to get to the tests on Monday.

lib/pbench/cli/server/tree_manage.py Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
@dbutenhof dbutenhof marked this pull request as draft September 11, 2023 11:33
@dbutenhof

This comment was marked as outdated.

@webbnh

This comment was marked as resolved.

@dbutenhof dbutenhof marked this pull request as ready for review September 13, 2023 00:26
@dbutenhof

This comment was marked as outdated.

@dbutenhof

This comment was marked as outdated.

@dbutenhof dbutenhof requested a review from webbnh September 13, 2023 20:34
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here's the next installment. (I'd like to get this in front of you in case there is time to address it today, and I have to go on a quick errand now....) I'll resume with the tests when I get back.

I think that we should consider tweaking the LockRef interface and reconsider what it means to "release" a lock reference. (My instinct is that "keep" should be specified on the instantiation rather than via a setter, "wait" should be a parameter to the acquisition operation, exclusive/shared might be a characteristic of the acquisition rather than instantiation, and that release shouldn't imply destruction.) I'm guessing that a lot of your choices have been driven by the desire to use the LockRef as a context manager, but perhaps we can sustain that while still providing a less opinionated interface.

lib/pbench/cli/server/tree_manage.py Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/indexing_tarballs.py Outdated Show resolved Hide resolved
Copy link
Member Author

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I'm going to have to deal with heading out to vacation with this unresolved, which is a bit depressing. I guess I'll make an attempt at a few changes, but even if I finish them (which seems unlikely at this point) I'm not sure that'll be enough. 😦

lib/pbench/cli/server/tree_manage.py Show resolved Hide resolved
lib/pbench/server/cache_manager.py Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/indexing_tarballs.py Outdated Show resolved Hide resolved
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers in the tests, but there's some low-level stuff which you might consider polishing.

lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_indexing_tarballs.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_indexing_tarballs.py Outdated Show resolved Hide resolved
dbutenhof added a commit to dbutenhof/pbench that referenced this pull request Oct 5, 2023
PBENCH-1192

This has been on my wishlist for a while, but was blocked by not actually
having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache
manager, and this PR layers on top of that.

Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the
first actual use of the in-memory cache map, and some adjustments were
necessary to make it work outside of the unit test environment.
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good now. However, I did find a potential bug (or a pair or trio) and a probably-missing assertion. I also have a bunch of pointed question-type suggestions and other nits.

Also, there's lingering comment from my previous review -- did you want to consider removing a try block?

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
Comment on lines 272 to 288
def __enter__(self) -> "LockManager":
"""Enter a lock context manager by acquiring the lock"""
return self.acquire()
self.lock.acquire(exclusive=self.exclusive, wait=self.wait)
return self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be instantiating the value of self.lock here instead of doing that in the c'tor?

That is, because releasing the lock (by calling either LockManager.release() or LockManager.__exit__()) will render the LockRef unusable (because the underlying lock file will have been closed), when this function returns, the LockManager here becomes useless, and the caller will be forced to instantiate a new one in order to reacquire the lock. Deferring the instantiation of the LockRef to here would save the caller from having to reinstantiate the LockManager object.

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/indexing_tarballs.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/cli/server/tree_manage.py Outdated Show resolved Hide resolved
PBENCH-1249

On large datasets, our direct tarball extraction method can time out the API
call. Unlike on a long intake, there is no persistent artifact so a retry will
always time out as well. This applies to any `get_inventory` call, and
therefore to the `/inventory`, `/visualize`, and `/compare` APIs; and given
the central importance of those APIs for our Server 1.0 story, that's not an
acceptable failure mode.

This PR mitigates that problem with a "compromise" partial cache manager,
leveraging the existing `unpack` method but adding a file lock to manage
shared access. The idea is that any consumer of tarball contents (including
the indexer) will unpack the entire tarball, but leave a "last reference"
timestamp. A periodic timer service will check the cache unpack timestamps,
and delete the unpack directories which aren't currently locked and which
haven't been referenced for longer than a set time period.

__NOTE__: I'm posting a draft mostly for coverage data after a lot of drift
in the cache manager unit tests, to determine whether more work is necessary.
The "last reference" and reclaim mechanism isn't yet implemented, though that
should be the "easy part" now that I've got the server code working.
I've verified that the timer service removes sufficiently old cache data and
that the data is unpacked again on request. The reclaim operation is audited.
I should probably audit the unpack a well, but haven't done that here. I'm
still hoping for a successful CI run to check cobertura coverage.
We probably won't want to audit cache load longer term, but right now it
probably makes sense to keep track.
Allow holding lock from unpack to stream, and conversion between `EX` and `SH`
lock modes.
I can't figure out why the default `ubi9` container configuration + EPEL is no
longer finding `rsyslog-mmjsonparse`. I've found no relevant hits on searches
nor any obvious workaround.

For now, try changing `Pipeline.gy` to override the default `BASE_IMAGE` and
use `centos:stream9` instead.
1. Fix Inventory.close() to always close the stream.
2. Make cache load more transparent by upgrading lock if we need to unpack.
webbnh
webbnh previously approved these changes Oct 10, 2023
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is still an assertion missing from one of the tests; other than that, the below are just nits but you might want to consider addressing them. If not, I've approved....

lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Show resolved Hide resolved
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

@dbutenhof dbutenhof merged commit 8d830a8 into distributed-system-analysis:main Oct 10, 2023
3 checks passed
@dbutenhof dbutenhof deleted the cache branch October 10, 2023 20:36
dbutenhof added a commit to dbutenhof/pbench that referenced this pull request Oct 11, 2023
PBENCH-1192

This has been on my wishlist for a while, but was blocked by not actually
having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache
manager, and this PR layers on top of that.

Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the
first actual use of the in-memory cache map, and some adjustments were
necessary to make it work outside of the unit test environment.
dbutenhof added a commit to dbutenhof/pbench that referenced this pull request Oct 12, 2023
PBENCH-1192

This has been on my wishlist for a while, but was blocked by not actually
having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache
manager, and this PR layers on top of that.

Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the
first actual use of the in-memory cache map, and some adjustments were
necessary to make it work outside of the unit test environment.
dbutenhof added a commit to dbutenhof/pbench that referenced this pull request Oct 16, 2023
PBENCH-1192

This has been on my wishlist for a while, but was blocked by not actually
having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache
manager, and this PR layers on top of that.

Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the
first actual use of the in-memory cache map, and some adjustments were
necessary to make it work outside of the unit test environment.
dbutenhof added a commit that referenced this pull request Oct 16, 2023
PBENCH-1192

This has been on my wishlist for a while, but was blocked by not actually having a usable cache. PR #3550 introduces a functioning (if minimal) cache manager, and this PR layers on top of that. The immediate motivation stems from an email exchange regarding Crucible, and the fact that Andrew would like (not surprisingly) to be able to access the contents of an archived tarball. Having TOC code relying on the Pbench-specific run-toc Elasticsearch index is not sustainable.

Note that, despite #3550 introducing a live cache, this PR represents the first actual use of the in-memory cache map, and some adjustments were necessary to make it work outside of the unit test environment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Of and relating to application programming interfaces to services and functions Server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants