Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hook "leader-elected" fails when adding a unit after scale down to zero units #306

Closed
reneradoi opened this issue May 23, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@reneradoi
Copy link
Contributor

Steps to reproduce

juju add-model opensearch
# apply the kernel parameters required for opensearch
juju model-config --file ./cloudinit-userdata.yaml
juju create-storage-pool opensearch-storage lxd volume-type=standard
juju deploy opensearch -n 2 --channel 2/edge --storage opensearch-data=opensearch-storage,1G,1
juju deploy self-signed-certificates
juju config self-signed-certificates ca-common-name="CN_CA"
juju relate self-signed-certificates opensearch
juju remove-unit opensearch/1
juju remove-unit opensearch/0
juju add-unit opensearch --attach-storage=opensearch-data/0

Expected behavior

The newly added unit should start up without error.

Actual behavior

$ juju status --storage
Model  Controller  Cloud/Region         Version  SLA          Timestamp
dev    opensearch  localhost/localhost  3.1.8    unsupported  06:52:18Z

App                       Version  Status  Scale  Charm                     Channel  Rev  Exposed  Message
opensearch                         active      1  opensearch                           1  no       
self-signed-certificates           active      1  self-signed-certificates  stable    72  no       

Unit                         Workload  Agent  Machine  Public address  Ports  Message
opensearch/2*                error     idle   5        10.27.170.244          hook failed: "leader-elected"
self-signed-certificates/0*  active    idle   2        10.27.170.141          

Machine  State    Address        Inst id        Base          AZ  Message
2        started  10.27.170.141  juju-622e8b-2  [email protected]      Running
5        started  10.27.170.244  juju-622e8b-5  [email protected]      Running

Storage Unit  Storage ID         Type        Pool                Mountpoint                   Size     Status    Message
              opensearch-data/1  filesystem  opensearch-storage                               1.0 GiB  detached  
opensearch/2  opensearch-data/0  filesystem  opensearch-storage  /var/snap/opensearch/common  1.0 GiB  attached  

Versions

Operating system: Ubuntu 24.04 LTS, Ubuntu 22.04 LTS
Juju CLI: 3.1.8-genericlinux-amd64
Juju agent: 3.1.8
Charm revision: 47
LXD: 5.21.1 LTS

Log output

unit-opensearch-2: 06:53:05 ERROR unit.opensearch/2.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-2/charm/./src/charm.py", line 267, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 302, in _on_leader_elected
    self._put_or_update_internal_user_leader(user)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 1244, in _put_or_update_internal_user_leader
    self.user_manager.update_user_password(user, hashed_pwd)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_users.py", line 268, in update_user_password
    resp = self.opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 266, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=None
self.response_text='Host 10.27.170.244:9200 and alternative_hosts: [] not reachable.'
unit-opensearch-4: 06:53:06 ERROR juju.worker.uniter.operation hook "leader-elected" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

I assume the issue is with security_index_initialised, this is not in the peer data anymore:

$ jhack show-relation opensearch:opensearch-peers opensearch:opensearch-peers
                                                                                             relation data v0.6                                                                                             
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ peer relation (id: 2) ┃ opensearch                                                                                                                                                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ type                  │ peer                                                                                                                                                                             │
│ interface             │ opensearch_peers                                                                                                                                                                 │
│ model                 │ the current model                                                                                                                                                                │
│ relation ID           │ 2                                                                                                                                                                                │
│ endpoint              │ opensearch-peers                                                                                                                                                                 │
│ leader unit           │ 2                                                                                                                                                                                │
├───────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ application data      │ ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
│                       │ │                                                                                                                                                                              │ │
│                       │ │  admin_user_initialized                     True                                                                                                                             │ │
│                       │ │  allocation-exclusions-to-delete            ,opensearch-2                                                                                                                    │ │
│                       │ │  delete-voting-exclusions                   True                                                                                                                             │ │
│                       │ │  deployment-description                     {"config": {"cluster_name": "opensearch-attz", "init_hold": false, "roles": [], "data_temperature": null}, "start":              │ │
│                       │ │                                             "start-with-generated-roles", "pending_directives": [], "typ": "main-orchestrator", "app": "opensearch", "state": {"value":      │ │
│                       │ │                                             "active", "message": ""}, "promotion_time": 1716446675.797672}                                                                   │ │
│                       │ │  opensearch:app:admin-password              secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7ebls8c16j9paghi7g                                                               │ │
│                       │ │  opensearch:app:admin-password-hash         secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7ebls8c16j9paghi80                                                               │ │
│                       │ │  opensearch:app:app-admin                   secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eblc8c16j9paghi50                                                               │ │
│                       │ │  opensearch:app:kibanaserver-password       secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eblk8c16j9paghi6g                                                               │ │
│                       │ │  opensearch:app:kibanaserver-password-hash  secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eblk8c16j9paghi70                                                               │ │
│                       │ │  opensearch:app:monitor-password            secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7ec248c16j9paghib0                                                               │ │
│                       │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ unit data             │ ╭─ opensearch/opensearch/2 ──────────────────────────────────────────────────────────────────────────────╮                                                                       │
│                       │ │                                                                                                        │                                                                       │
│                       │ │  opensearch:unit:2:unit-http       secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eevc8c16j9paghic0  │                                                                       │
│                       │ │  opensearch:unit:2:unit-transport  secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eevc8c16j9paghibg  │                                                                       │
│                       │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────╯                                                                       │
└───────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

This is where an adjustment might be necessary: https://github.com/canonical/opensearch-operator/blob/main/lib/charms/opensearch/v0/opensearch_base_charm.py#L271

@reneradoi reneradoi added the bug Something isn't working label May 23, 2024
@reneradoi reneradoi self-assigned this May 23, 2024
Copy link
Contributor

reneradoi added a commit that referenced this issue Jun 11, 2024
## Issue
When attaching an existing storage to a new unit, 2 issues happen:

- Snap install failed because of permissions / ownership of directories 
- snap_common gets completely deleted

## Solution
- bump snap version, use the fixed one (the fixed revision is 47, this
is already outdated as a newer version of the snap is already available
and merged to main prior to this PR)
- enhance test coverage for integration tests

## Integration Testing
Tests for attaching existing storage can be found in
integration/ha/test_storage.py. There are now three test cases:
1. test_storage_reuse_after_scale_down: remove one unit from the
deployment, afterwards add a new one re-using the storage from the
removed unit. check if the continuous writes are ok and a testfile that
was created intially is still there.
2. test_storage_reuse_after_scale_to_zero: remove both units from the
deployment, keep the application, add two new units using the storage
again. check the continuous writes.
3. test_storage_reuse_in_new_cluster_after_app_removal: from a cluster
of three units, remove all of them and remove the application. deploy a
new application (with one unit) to the same model, attach the storage,
then add two more units with the other storage volumes. check the
continuous writes.

## Other Issues
- As part of this PR, another issue is addressed:
#306. It is
resolved with this commit:
19f843c
- Furthermore problems with acquiring the OpenSearch lock are worked around with this PR, especially when the shards for the locking index within OpenSearch are not assigned to a new primary when removing the former primary. This was also reported in #243 and will be further investigated in #327.
@reneradoi
Copy link
Contributor Author

Resolved with #272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant