-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use OpenSearch for locking with fallback to peer databag #211
Conversation
6d149c7
to
5f53d7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the locking index will be widely used, we should exclude it from a backup / recovery scenario. @carlcsaposs-canonical can you add the correct index name to this constant value here?
5f53d7e
to
030ec8e
Compare
8f800fe
to
a395108
Compare
4244d68
to
614e2c1
Compare
Not sure how to move forward with some of the integration test failures; looking for help
Traceback
For 2, my first guess is that maybe the timeout needs to be increased now that the lock is (maybe) working correctly and each unit is starting one at a time. (Second guess: lock not getting released) |
b1b28c1
to
1c37e61
Compare
Definitions Reachable: Successful socket connection Online: Successful HTTP GET request to `/_nodes` Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch Security not initialized. (Fixes transient integration test failure—test 1 from #211 (comment))
Definitions Reachable: Successful socket connection Online: Successful HTTP GET request to `/_nodes` Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch Security not initialized. (Fixes transient integration test failure—test 1 from #211 (comment))
Definitions Reachable: Successful socket connection Online: Successful HTTP GET request to `/_nodes` Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch Security not initialized. (Fixes transient integration test failure—test 1 from #211 (comment))
3b9ccfc
to
c333c1c
Compare
Definitions Reachable: Successful socket connection Online: Successful HTTP GET request to `/_nodes` Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch Security not initialized. (Fixes transient integration test failure—test 1 from #211 (comment))
c3f1374
to
6c35ea2
Compare
Definitions Reachable: Successful socket connection Online: Successful HTTP GET request to `/_nodes` Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch Security not initialized. (Fixes transient integration test failure—test 1 from #211 (comment)) --------- Co-authored-by: Mehdi Bendriss <[email protected]>
This reverts commit a428778.
This reverts commit cb0f179.
fixes issue in the case where: - unit 1 gets opensearch lock, begins to start - all units of opensearch go offline - unit 2 gets peer databag lock, begins to start - 1+ units of opensearch go online neither unit 1 or 2 can use lock context: https://chat.canonical.com/canonical/pl/c1gjp1z45jbr5mppe1gzr9o3kh
See #230 Implemented in a minimal, hacky way to proceed with testing. Will be implemented fully (i.e. unnecessary code removed) by @Mehdi-Bendriss in another PR
This reverts commit 2f82682.
This reverts commit 590410f.
This reverts commit c838c4f.
a193b43
to
7edf547
Compare
Issue
Fix issues with current lock implementation
Prepare for in-place upgrades
Options considered
no units onlineless than 2 units onlineCons of each option:
opensearch-operator/lib/charms/opensearch/v0/opensearch_locking.py
Lines 227 to 228 in b1b28c1
Pros of each option:
juju refresh
will immediately rollback highest unit even if leader/other units in error statejuju refresh
will quickly rollback highest unit even if leader unit charm code in error stateMore context:
Discussion: https://chat.canonical.com/canonical/pl/9fah5tfxd38ybnx3tq7zugxfyh
Option 1 in discussion is option 2 here, option 2 in discussion is option 1 here
Option chosen: Option 4
Opensearch index vs document for lock
Current "ops lock" implementation with opensearch index:
Each unit requests the lock by trying to create an index. If the index does not exist, the "lock" is granted.
However, if a unit requests the lock, charm goes into error state, and error state is resolved (e.g. after rollback) it will not be able to use the lock—no unit will be aware that it has the lock and no unit will be able to release the lock
Solution: use document id 0 that stores "unit-name" as lock
Discussion: https://chat.canonical.com/canonical/pl/biddxzzk3fbpjgbhmatzr8n6bw
Solution
Design
(Option 4): Use opensearch document as lock (for any (re)start, join cluster, leave cluster, or upgrade). Fallback to peer databag if all units offline.
Implementation
Create custom events
_StartOpensearch
and_RestartOpensearch
opensearch-operator/lib/charms/opensearch/v0/opensearch_base_charm.py
Lines 121 to 132 in b1b28c1
When opensearch should be (re)started, emit the custom event.
Custom event requests the lock. If granted, it (re)starts opensearch.
Once opensearch fully ready, the lock is released.
If opensearch fails to start, the lock is released.
While opensearch is starting or while the lock is not granted, the custom event will be continually deferred.
Note: the original event is not deferred—only the custom event. This is so that any logic that ran before the request to (re)start opensearch does not get re-ran.
By requesting the lock within the custom event, and attempting to reacquire the lock each time the custom event runs (i.e. after every time it's deferred), we solve the design issue with rollingops and deferred events detailed in #183