Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update opensearch to 2.9.0 #299

Merged
merged 7 commits into from
Aug 10, 2023
Merged

Update opensearch to 2.9.0 #299

merged 7 commits into from
Aug 10, 2023

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Aug 9, 2023

Updates opensearch to v2.9.0, required as opensearch 2.4.0 fails* on podman v4.4.1.

Also:

  • Pulls container before starting systemd service to eliminate unit startup timeouts on slow networks

  • Refactors role to provide separate install & runtime task books for later speed optimisation.

  • Changes filebeat configuration to derive opensearch document IDs from the Slurm job id; this prevents duplicate records after an image-based upgrade where filebeat ingests the same records from slurm/sacct again. Note that when upgrading a cluster, opensearch data from before this PR (with unsafe document IDs) will be archived to /var/lib/state/opensearch/data-$TIMESTAMP. Filebeat will then reingest all jobs within the last year from slurm/sacct.

  • Reviewed relevant changelogs for any changes of significance

  • Checked that this works when performing image-based upgrades

* Container startup fails with

Duplicate cpuset controllers detected.
...
Error: Could not find or load main class 

Actual problem is /sys/fs/cgroup gets mounted twice inside the container with podman v4.4.1, opensearch 2.4.0 cannot tolerate this.

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 9, 2023

Cancelled CI, need image build first.

Image build running in https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/5810024081/job/15750069870

edit: building image openhpc-230809-1401-2aa07061

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 9, 2023

Image build running in https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/5811371127/job/15754445881

Built image openhpc-230809-1602-2250239e

@sjpb sjpb marked this pull request as ready for review August 10, 2023 10:53
@sjpb sjpb requested a review from a team as a code owner August 10, 2023 10:53
Copy link
Collaborator

@m-bull m-bull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 10, 2023

I've checked that upgrading a cluster from current main (e6645fd) to 2937725 works ok, in that:

  • pre-upgrade opensearch state gets archived:
[root@main-control rocky]# ls /var/lib/state/opensearch/
config  data  data-2023-08-10T15:45+00:00.tar.gz
[root@main-control rocky]# ls /var/lib/state/opensearch/data/
batch_metrics_enabled.conf  logging_enabled.conf  nodes  performance_analyzer_enabled.conf  rca_enabled.conf  slurm_jobid_index  thread_contention_monitoring_enabled.conf
  • slurm jobs dashboard shows no duplicate jobs

I also then reimaged the cluster again (at 2937725) to check the case where the slurm_jobid_index flag file does exist, reran site.yml, and checked that the opensearch document IDs did not change and monitoring was not duplicated.

Note that document IDs are not slurm job ids (but are stable):

[root@main-control rocky]# curl -ks -u admin:${vault_elasticsearch_admin_password} https://localhost:9200/filebeat-7.12.1-2023.08.10/_search?pretty | grep id
        "_id" : "7add60a6e14c4a7c931b298885049ce202050131faeb42a1cdffdd8cbda18e15",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784",
        "_id" : "705983cec81172db226a753f22a1d2adf3667021c8acaf9e3441c47613652955",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784",
        "_id" : "07ca87294ea583986bf129b4ad84e2ed2539c8e7d1eabe6738bfe90d90dfe01d",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784"
        "_id" : "e252989764ecf0ebb95af485cd8741dccaf9fdd74d46020351a3ffe1cb05dafb",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784"
        "_id" : "c57f31eee7910a1c04dbbf0e4a2e96dffd46b48dc157d5a8d91bad4287e7a070",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784"

See comment in environments/common/files/filebeat/filebeat.yml for why they're not actual job IDs.

Copy link
Collaborator

@m-bull m-bull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjpb sjpb merged commit 7f0b3c0 into ci/SMS-fatimage Aug 10, 2023
1 check passed
@sjpb sjpb deleted the update/opensearch-2.9.0 branch August 10, 2023 16:30
@sjpb sjpb mentioned this pull request Aug 11, 2023
@sjpb sjpb restored the update/opensearch-2.9.0 branch August 11, 2023 13:18
@sjpb sjpb mentioned this pull request Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants