Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ais-operator: cannot shutdown cluster #200

Open
1 task done
eahydra opened this issue Jan 6, 2025 · 4 comments
Open
1 task done

ais-operator: cannot shutdown cluster #200

eahydra opened this issue Jan 6, 2025 · 4 comments
Labels

Comments

@eahydra
Copy link

eahydra commented Jan 6, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

Hi, I originally wanted to report an issue in the ais-operator project, but the ais-operator project does not enable issue function.
When the AIStoreSpec.ShutdownCluster field is updated, the pod will exit but then be re-started again. So the AIStore is pending shutting down state.

Expected Behavior

Continue to perform subsequent scaling replicas to 0 and update to shutdown state

Current Behavior

the AIStore CR is pending shutting down state.

Steps To Reproduce

  1. Create an AIStore CR object
  2. When AIStore is ready, set the AIStoreSpec.ShutdownCluster to true.

Possible Solution

just shutting the service, but does not exit the process to keep the shutting down state in ais daemons?

Additional Information/Context

No response

AIStore build/version

main

Environment details (OS name and version, etc.)

Linux

@eahydra eahydra added the bug label Jan 6, 2025
@eahydra
Copy link
Author

eahydra commented Jan 6, 2025

Here are some logs for reference

W 16:54:25.207456 config:1406 control and data share the same intra-cluster network: test-aistore-proxy-0.test-aistore-proxy.ais.svc.cluster.local
I 16:54:25.208251 metasync:444 p[am0nyh4yp9et0]: Conf v3, aism[set-config]
I 16:54:25.746829 kalive:655 Sending "suspend" on the control channel
I 16:54:27.157358 prxclu:737 p[am0nyh4yp9et0] node t[BQgiNuza] is already _in_ - nothing to do
W 16:54:27.747187 proxy:3432 Stopping p[am0nyh4yp9et0](primary): shutdown
I 16:54:27.747200 htrun:583 Shutting down HTTP
I 16:54:27.801128 prxclu:737 p[am0nyh4yp9et0] node p[mdta82v4o2zkc] is already _in_ - nothing to do
I 16:54:27.803244 prxclu:737 p[am0nyh4yp9et0] node t[DbEdbwdY] is already _in_ - nothing to do
I 16:54:28.247788 common_prom:335 Stopping proxystats, err: <nil>
I 16:54:28.247796 kalive:649 Stopping palive, err: <nil>
I 16:54:28.247799 metasync:232 Stopping metasyncer
I 16:54:28.247902 daemon:342 Terminated OK
tail: /var/log/ais/aisproxy.INFO has been replaced; following end of new file
Started up at 2025/01/06 16:54:29, host test-aistore, go1.23.4 for linux/amd64
W 16:54:29.113175 config:1406 control and data share the same intra-cluster network: test-aistore-proxy-0.test-aistore-proxy.ais.svc.cluster.local
I 16:54:29.113320 config:2000 log.dir: "/var/log/ais"; l4.proto: tcp; pub port: 3080; verbosity: 3
I 16:54:29.113326 config:2002 config: "/etc/ais/.ais.conf"; stats_time: 10s; authentication: false; backends: [gcp]
I 16:54:29.113341 daemon:315 Version 3.25.a7ac713, build 2024-12-09T21:29:31+0000, CPUs(30, runtime=30), containerized
I 16:54:29.113606 k8s:59 Checking pod: "test-aistore-proxy-0", node: "fargate-ip-10-0-83-128.ec2.internal"

@aaronnw
Copy link
Collaborator

aaronnw commented Jan 6, 2025

the ais-operator project does not enable issue function

This is intended for now, helps us keep everything in one place for reference.

Thanks for opening the issue! I'll try to replicate and report back.

@aaronnw
Copy link
Collaborator

aaronnw commented Jan 6, 2025

I am able to replicate the issue.

The operator attempts to first shutdown the AIS cluster gracefully before scaling down the statefulset. It requires the cluster to no longer be responding to requests before it starts that scaling. But if k8s restarts the unresponsive pods they resume responding. AIS itself has no concept of a "shutdown" state so if this happens we get stuck with the operator waiting for the cluster to quit answering. Operator does not repeat the shutdown call (and if it did, we'd likely see the same thing) so it's just stuck waiting on something that will never happen.

To resolve we need to

  1. Move the targets to a more stable state, likely through individual node shutdowns, and more intelligently check if they are ready to scale down.
  2. Handle the case where AIS is stuck in ShuttingDown state but shutdownCluster is set back to false. This should resume regular reconciliation.

I'll test out these changes and hopefully we can get a fix out in a release later this week or next.

@eahydra
Copy link
Author

eahydra commented Jan 7, 2025

Awesome! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants