Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use OpenSearch for locking with fallback to peer databag #211

Merged
merged 45 commits into from
Apr 17, 2024

Conversation

carlcsaposs-canonical
Copy link
Contributor

@carlcsaposs-canonical carlcsaposs-canonical commented Apr 2, 2024

Issue

Fix issues with current lock implementation

Prepare for in-place upgrades

  • In order to maintain HA, only one unit should start, restart, join cluster (scale up), leave cluster (scale down), or upgrade at a time
  • Upgrades adds an additional challenge: if the charm code is upgraded (in addition to the workload), it's possible for the leader unit to go into error state—which prevents coordination of locks via peer databag (current rolling ops implementation) and can block rollback (for other units)

Options considered

  1. Use peer relation databag for lock
    • Unit requests lock by adding data to unit databag -> relation-changed event triggered on leader
    • Leader grants lock by adding data to app databag -> relation-changed event triggered on unit
    • Unit proceeds & releases lock from unit databag
  2. Don't use lock for upgrades. Make upgrade order deterministic (so that each unit can independently determine upgrade order & upgrade order does not change) and each unit upgrades when it sees the units before it have upgraded
  3. Use opensearch index/document for lock
  4. Option 3 but fallback to option 1 if no units online less than 2 units online

Cons of each option:

  1. if leader unit is in error state (raises uncaught exception), rollback will be blocked for all units
  2. a unit can restart for non-upgrade related reasons (role re-balancing from network cut or scale up) while another unit is restarting to upgrade
  3. doesn't work if all units offline (initial start, full cluster crash/network cut)
  4. implementation complexity, some edge cases not supported e.g.
    Limitation: if lock acquired via OpenSearch document and all units offline, OpenSearch
    document lock will not be released

Pros of each option:

  1. only one unit will (re)start at a time, for any reason
  2. rollback with juju refresh will immediately rollback highest unit even if leader/other units in error state
  3. only one unit will (re)start at a time, for any reason and rollback with juju refresh will quickly rollback highest unit even if leader unit charm code in error state
  4. same as 3 and locking (mostly, except aforementioned edge cases) works when all units offline

More context:
Discussion: https://chat.canonical.com/canonical/pl/9fah5tfxd38ybnx3tq7zugxfyh
Option 1 in discussion is option 2 here, option 2 in discussion is option 1 here

Option chosen: Option 4

Opensearch index vs document for lock

Current "ops lock" implementation with opensearch index:
Each unit requests the lock by trying to create an index. If the index does not exist, the "lock" is granted.

However, if a unit requests the lock, charm goes into error state, and error state is resolved (e.g. after rollback) it will not be able to use the lock—no unit will be aware that it has the lock and no unit will be able to release the lock

Solution: use document id 0 that stores "unit-name" as lock
Discussion: https://chat.canonical.com/canonical/pl/biddxzzk3fbpjgbhmatzr8n6bw

Solution

Design

(Option 4): Use opensearch document as lock (for any (re)start, join cluster, leave cluster, or upgrade). Fallback to peer databag if all units offline.

Implementation

Create custom events _StartOpensearch and _RestartOpensearch

class _StartOpenSearch(EventBase):
"""Attempt to acquire lock & start OpenSearch.
This event will be deferred until OpenSearch starts.
"""
class _RestartOpenSearch(EventBase):
"""Attempt to acquire lock & restart OpenSearch.
This event will be deferred until OpenSearch stops. Then, `_StartOpenSearch` will be emitted.
"""

When opensearch should be (re)started, emit the custom event.
Custom event requests the lock. If granted, it (re)starts opensearch.
Once opensearch fully ready, the lock is released.
If opensearch fails to start, the lock is released.
While opensearch is starting or while the lock is not granted, the custom event will be continually deferred.

Note: the original event is not deferred—only the custom event. This is so that any logic that ran before the request to (re)start opensearch does not get re-ran.

By requesting the lock within the custom event, and attempting to reacquire the lock each time the custom event runs (i.e. after every time it's deferred), we solve the design issue with rollingops and deferred events detailed in #183

@carlcsaposs-canonical carlcsaposs-canonical force-pushed the fix-lock branch 2 times, most recently from 6d149c7 to 5f53d7e Compare April 2, 2024 13:38
Copy link
Contributor

@phvalguima phvalguima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the locking index will be widely used, we should exclude it from a backup / recovery scenario. @carlcsaposs-canonical can you add the correct index name to this constant value here?

@carlcsaposs-canonical carlcsaposs-canonical changed the base branch from main to fix-request-error April 3, 2024 07:47
@carlcsaposs-canonical carlcsaposs-canonical force-pushed the fix-lock branch 4 times, most recently from 8f800fe to a395108 Compare April 3, 2024 09:48
@carlcsaposs-canonical carlcsaposs-canonical changed the base branch from fix-request-error to fix-request-error-old April 3, 2024 11:32
@carlcsaposs-canonical
Copy link
Contributor Author

Not sure how to move forward with some of the integration test failures; looking for help

  1. ha/test_ha.py fails with 503 opensearch HTTP request: OpenSearch Security not initialized.
Traceback
unit-opensearch-1: 13:05:01 ERROR unit.opensearch/1.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 261, in request
    resp = call(urls[0])
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 218, in call
    for attempt in Retrying(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 245, in call
    response.raise_for_status()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://10.16.209.166:9200/.charm_node_lock

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/./src/charm.py", line 94, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 454, in main
    framework.reemit()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 863, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 943, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 618, in _start_opensearch
    if not self.node_lock.acquired:
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 172, in acquired
    self._opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 271, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={}
self.response_text='OpenSearch Security not initialized.'
  1. ha/test_large_deployments.py Timeout waiting for opensearch to start
  2. At first glance, appears to be a juju error (not uncaught charm exception): juju.worker.uniter.context cannot apply changes: creating secrets: Regular expression is invalid: nothing to repeat
  3. relations/test_opensearch_provider.py Timeout

For 2, my first guess is that maybe the timeout needs to be increased now that the lock is (maybe) working correctly and each unit is starting one at a time. (Second guess: lock not getting released)
For 3 and 4, planning to investigate more
For 1, I'm not sure how to proceed—what should the charm do in this situation? Should it ever be getting into this situation? I'm not familiar with how the opensearch security initialization works, when it happens, what depends on it, etc.

@carlcsaposs-canonical carlcsaposs-canonical marked this pull request as ready for review April 3, 2024 14:28
@carlcsaposs-canonical carlcsaposs-canonical changed the base branch from fix-request-error-old to fix-request-error April 4, 2024 13:02
Base automatically changed from fix-request-error to main April 5, 2024 11:37
carlcsaposs-canonical added a commit that referenced this pull request Apr 5, 2024
Definitions
Reachable: Successful socket connection
Online: Successful HTTP GET request to `/_nodes`

Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch Security not initialized. (Fixes transient integration test failure—test 1 from #211 (comment))
carlcsaposs-canonical added a commit that referenced this pull request Apr 5, 2024
Definitions
Reachable: Successful socket connection
Online: Successful HTTP GET request to `/_nodes`

Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch Security not initialized. (Fixes transient integration test failure—test 1 from #211 (comment))
carlcsaposs-canonical added a commit that referenced this pull request Apr 5, 2024
Definitions
Reachable: Successful socket connection
Online: Successful HTTP GET request to `/_nodes`

Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch Security not initialized. (Fixes transient integration test failure—test 1 from #211 (comment))
carlcsaposs-canonical added a commit that referenced this pull request Apr 9, 2024
Definitions
Reachable: Successful socket connection
Online: Successful HTTP GET request to `/_nodes`

Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch Security not initialized. (Fixes transient integration test failure—test 1 from #211 (comment))
carlcsaposs-canonical added a commit that referenced this pull request Apr 12, 2024
Definitions
Reachable: Successful socket connection
Online: Successful HTTP GET request to `/_nodes`

Fixes uncaught OpenSearchHttpError (status code 503): OpenSearch
Security not initialized. (Fixes transient integration test failure—test
1 from
#211 (comment))

---------

Co-authored-by: Mehdi Bendriss <[email protected]>
@carlcsaposs-canonical carlcsaposs-canonical merged commit 15c7368 into main Apr 17, 2024
22 of 26 checks passed
@carlcsaposs-canonical carlcsaposs-canonical deleted the fix-lock branch April 17, 2024 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deferring start/restart event causes lock to be prematurely released
3 participants