ais-operator: cannot shutdown cluster #200

eahydra · 2025-01-06T17:09:06Z

Is there an existing issue for this?

I have searched the existing issues

Describe the bug

Hi, I originally wanted to report an issue in the ais-operator project, but the ais-operator project does not enable issue function.
When the AIStoreSpec.ShutdownCluster field is updated, the pod will exit but then be re-started again. So the AIStore is pending shutting down state.

Expected Behavior

Continue to perform subsequent scaling replicas to 0 and update to shutdown state

Current Behavior

the AIStore CR is pending shutting down state.

Steps To Reproduce

Create an AIStore CR object
When AIStore is ready, set the AIStoreSpec.ShutdownCluster to true.

Possible Solution

just shutting the service, but does not exit the process to keep the shutting down state in ais daemons?

Additional Information/Context

No response

AIStore build/version

main

Environment details (OS name and version, etc.)

Linux

eahydra · 2025-01-06T17:13:32Z

Here are some logs for reference

W 16:54:25.207456 config:1406 control and data share the same intra-cluster network: test-aistore-proxy-0.test-aistore-proxy.ais.svc.cluster.local
I 16:54:25.208251 metasync:444 p[am0nyh4yp9et0]: Conf v3, aism[set-config]
I 16:54:25.746829 kalive:655 Sending "suspend" on the control channel
I 16:54:27.157358 prxclu:737 p[am0nyh4yp9et0] node t[BQgiNuza] is already _in_ - nothing to do
W 16:54:27.747187 proxy:3432 Stopping p[am0nyh4yp9et0](primary): shutdown
I 16:54:27.747200 htrun:583 Shutting down HTTP
I 16:54:27.801128 prxclu:737 p[am0nyh4yp9et0] node p[mdta82v4o2zkc] is already _in_ - nothing to do
I 16:54:27.803244 prxclu:737 p[am0nyh4yp9et0] node t[DbEdbwdY] is already _in_ - nothing to do
I 16:54:28.247788 common_prom:335 Stopping proxystats, err: <nil>
I 16:54:28.247796 kalive:649 Stopping palive, err: <nil>
I 16:54:28.247799 metasync:232 Stopping metasyncer
I 16:54:28.247902 daemon:342 Terminated OK
tail: /var/log/ais/aisproxy.INFO has been replaced; following end of new file
Started up at 2025/01/06 16:54:29, host test-aistore, go1.23.4 for linux/amd64
W 16:54:29.113175 config:1406 control and data share the same intra-cluster network: test-aistore-proxy-0.test-aistore-proxy.ais.svc.cluster.local
I 16:54:29.113320 config:2000 log.dir: "/var/log/ais"; l4.proto: tcp; pub port: 3080; verbosity: 3
I 16:54:29.113326 config:2002 config: "/etc/ais/.ais.conf"; stats_time: 10s; authentication: false; backends: [gcp]
I 16:54:29.113341 daemon:315 Version 3.25.a7ac713, build 2024-12-09T21:29:31+0000, CPUs(30, runtime=30), containerized
I 16:54:29.113606 k8s:59 Checking pod: "test-aistore-proxy-0", node: "fargate-ip-10-0-83-128.ec2.internal"

aaronnw · 2025-01-06T17:19:34Z

the ais-operator project does not enable issue function

This is intended for now, helps us keep everything in one place for reference.

Thanks for opening the issue! I'll try to replicate and report back.

aaronnw · 2025-01-06T22:26:12Z

I am able to replicate the issue.

The operator attempts to first shutdown the AIS cluster gracefully before scaling down the statefulset. It requires the cluster to no longer be responding to requests before it starts that scaling. But if k8s restarts the unresponsive pods they resume responding. AIS itself has no concept of a "shutdown" state so if this happens we get stuck with the operator waiting for the cluster to quit answering. Operator does not repeat the shutdown call (and if it did, we'd likely see the same thing) so it's just stuck waiting on something that will never happen.

To resolve we need to

Move the targets to a more stable state, likely through individual node shutdowns, and more intelligently check if they are ready to scale down.
Handle the case where AIS is stuck in ShuttingDown state but shutdownCluster is set back to false. This should resume regular reconciliation.

I'll test out these changes and hopefully we can get a fix out in a release later this week or next.

eahydra · 2025-01-07T03:06:55Z

Awesome! Thanks!

eahydra added the bug label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ais-operator: cannot shutdown cluster #200

ais-operator: cannot shutdown cluster #200

eahydra commented Jan 6, 2025

eahydra commented Jan 6, 2025

aaronnw commented Jan 6, 2025

aaronnw commented Jan 6, 2025

eahydra commented Jan 7, 2025

ais-operator: cannot shutdown cluster #200

ais-operator: cannot shutdown cluster #200

Comments

eahydra commented Jan 6, 2025

Is there an existing issue for this?

Describe the bug

Expected Behavior

Current Behavior

Steps To Reproduce

Possible Solution

Additional Information/Context

AIStore build/version

Environment details (OS name and version, etc.)

eahydra commented Jan 6, 2025

aaronnw commented Jan 6, 2025

aaronnw commented Jan 6, 2025

eahydra commented Jan 7, 2025