-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quick cache manager #3550
Quick cache manager #3550
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Given that you lost access to Jenkins in the middle of this, are you sure that you still exist? 😆 The accesses to RHEL RPM repo and the container registry seem like transient infrastructure (e.g., network) problems. (I think I've seen each of them on occasion, although never with this level of concentration....) The Tox/Python packaging issue is not one I've seen in steady state before...I wonder if some external package got updated.... 😒 |
I have an awkward question regarding this output:
Why are we referencing the 3.11 site-packages directory in what I think is a 3.9-based run? (Is that |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First installment: I'll try to get to the tests on Monday.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, here's the next installment. (I'd like to get this in front of you in case there is time to address it today, and I have to go on a quick errand now....) I'll resume with the tests when I get back.
I think that we should consider tweaking the LockRef
interface and reconsider what it means to "release" a lock reference. (My instinct is that "keep" should be specified on the instantiation rather than via a setter, "wait" should be a parameter to the acquisition operation, exclusive/shared might be a characteristic of the acquisition rather than instantiation, and that release shouldn't imply destruction.) I'm guessing that a lot of your choices have been driven by the desire to use the LockRef
as a context manager, but perhaps we can sustain that while still providing a less opinionated interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently I'm going to have to deal with heading out to vacation with this unresolved, which is a bit depressing. I guess I'll make an attempt at a few changes, but even if I finish them (which seems unlikely at this point) I'm not sure that'll be enough. 😦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blockers in the tests, but there's some low-level stuff which you might consider polishing.
PBENCH-1192 This has been on my wishlist for a while, but was blocked by not actually having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache manager, and this PR layers on top of that. Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the first actual use of the in-memory cache map, and some adjustments were necessary to make it work outside of the unit test environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good now. However, I did find a potential bug (or a pair or trio) and a probably-missing assertion. I also have a bunch of pointed question-type suggestions and other nits.
Also, there's lingering comment from my previous review -- did you want to consider removing a try
block?
lib/pbench/server/cache_manager.py
Outdated
def __enter__(self) -> "LockManager": | ||
"""Enter a lock context manager by acquiring the lock""" | ||
return self.acquire() | ||
self.lock.acquire(exclusive=self.exclusive, wait=self.wait) | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be instantiating the value of self.lock
here instead of doing that in the c'tor?
That is, because releasing the lock (by calling either LockManager.release()
or LockManager.__exit__()
) will render the LockRef
unusable (because the underlying lock file will have been closed), when this function returns, the LockManager
here becomes useless, and the caller will be forced to instantiate a new one in order to reacquire the lock. Deferring the instantiation of the LockRef
to here would save the caller from having to reinstantiate the LockManager
object.
PBENCH-1249 On large datasets, our direct tarball extraction method can time out the API call. Unlike on a long intake, there is no persistent artifact so a retry will always time out as well. This applies to any `get_inventory` call, and therefore to the `/inventory`, `/visualize`, and `/compare` APIs; and given the central importance of those APIs for our Server 1.0 story, that's not an acceptable failure mode. This PR mitigates that problem with a "compromise" partial cache manager, leveraging the existing `unpack` method but adding a file lock to manage shared access. The idea is that any consumer of tarball contents (including the indexer) will unpack the entire tarball, but leave a "last reference" timestamp. A periodic timer service will check the cache unpack timestamps, and delete the unpack directories which aren't currently locked and which haven't been referenced for longer than a set time period. __NOTE__: I'm posting a draft mostly for coverage data after a lot of drift in the cache manager unit tests, to determine whether more work is necessary. The "last reference" and reclaim mechanism isn't yet implemented, though that should be the "easy part" now that I've got the server code working.
I've verified that the timer service removes sufficiently old cache data and that the data is unpacked again on request. The reclaim operation is audited. I should probably audit the unpack a well, but haven't done that here. I'm still hoping for a successful CI run to check cobertura coverage.
We probably won't want to audit cache load longer term, but right now it probably makes sense to keep track.
Allow holding lock from unpack to stream, and conversion between `EX` and `SH` lock modes.
I can't figure out why the default `ubi9` container configuration + EPEL is no longer finding `rsyslog-mmjsonparse`. I've found no relevant hits on searches nor any obvious workaround. For now, try changing `Pipeline.gy` to override the default `BASE_IMAGE` and use `centos:stream9` instead.
1. Fix Inventory.close() to always close the stream. 2. Make cache load more transparent by upgrading lock if we need to unpack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there is still an assertion missing from one of the tests; other than that, the below are just nits but you might want to consider addressing them. If not, I've approved....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it!
PBENCH-1192 This has been on my wishlist for a while, but was blocked by not actually having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache manager, and this PR layers on top of that. Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the first actual use of the in-memory cache map, and some adjustments were necessary to make it work outside of the unit test environment.
PBENCH-1192 This has been on my wishlist for a while, but was blocked by not actually having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache manager, and this PR layers on top of that. Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the first actual use of the in-memory cache map, and some adjustments were necessary to make it work outside of the unit test environment.
PBENCH-1192 This has been on my wishlist for a while, but was blocked by not actually having a usable cache. PR distributed-system-analysis#3550 introduces a functioning (if minimal) cache manager, and this PR layers on top of that. Note that, despite distributed-system-analysis#3550 introducing a live cache, this PR represents the first actual use of the in-memory cache map, and some adjustments were necessary to make it work outside of the unit test environment.
PBENCH-1192 This has been on my wishlist for a while, but was blocked by not actually having a usable cache. PR #3550 introduces a functioning (if minimal) cache manager, and this PR layers on top of that. The immediate motivation stems from an email exchange regarding Crucible, and the fact that Andrew would like (not surprisingly) to be able to access the contents of an archived tarball. Having TOC code relying on the Pbench-specific run-toc Elasticsearch index is not sustainable. Note that, despite #3550 introducing a live cache, this PR represents the first actual use of the in-memory cache map, and some adjustments were necessary to make it work outside of the unit test environment.
PBENCH-1249
On large datasets, our direct tarball extraction method can time out the API call. Unlike on a long intake, there is no persistent artifact so a retry will always time out as well. This applies to any
get_inventory
call, and therefore to the/inventory
,/visualize
, and/compare
APIs; and given the central importance of those APIs for our Server 1.0 story, that's not an acceptable failure mode.This PR mitigates that problem with a "compromise" partial cache manager, leveraging the existing
unpack
method but adding a file lock to manage shared access. The idea is that any consumer of tarball contents (including the indexer) will unpack the entire tarball, but leave a "last reference" timestamp. A periodic timer service will check the cache unpack timestamps, and delete the unpack directories which aren't currently locked and which haven't been referenced for longer than a set time period.