-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CosmosFullNode: Detect Crashloops and restore replicas #205
Comments
Examples of pods in a CrashLoop state
|
I'm backlogging and de-prioritizing this one. I feel it's too risky of a feature. If there were a reliable way to detect data corruption, that is the ideal. Also, this feature was to solve for only 1 problem chain. The rest of the chains do not randomly start crashing on start. The problem chain has a fix in their upstream which will eventually make it into a release. With the autoDataSource feature, the fix is much easier - simply delete the PVC and the operator recreates it using a recent VolumeSnapshot. It still requires human intervention, but only take a minute or 2 to fix. Also, our redundancy gives us grace. |
Closing because of the comment above. Needs to be a reliable way to detect "data corruption". If we could feed an ML classifier the logs, that could be one way. |
Occasionally (some chains are worse than others), the data becomes corrupted and a replica continually crashes on start.
The typical workaround is a human user deletes the pod and pvc and restores from a VolumeSnapshot. This feature automates destroying the pod/pvc and paired with other features, like autoDataSource should be able to recover on its own.
The text was updated successfully, but these errors were encountered: