CosmosFullNode: Detect Crashloops and restore replicas #205

DavidNix · 2023-02-16T15:49:43Z

Occasionally (some chains are worse than others), the data becomes corrupted and a replica continually crashes on start.

The typical workaround is a human user deletes the pod and pvc and restores from a VolumeSnapshot. This feature automates destroying the pod/pvc and paired with other features, like autoDataSource should be able to recover on its own.

DavidNix · 2023-02-16T15:50:36Z

Examples of pods in a CrashLoop state

apiVersion: v1
kind: Pod
metadata:
  annotations:
    app.kubernetes.io/ordinal: "0"
  creationTimestamp: "2023-02-16T01:07:08Z"
  labels:
    app.kubernetes.io/component: CosmosFullNode
    app.kubernetes.io/created-by: cosmos-operator
    app.kubernetes.io/instance: agoric-mainnet-fullnode-0
    app.kubernetes.io/name: agoric-mainnet-fullnode
    app.kubernetes.io/revision: cdbe4958
    app.kubernetes.io/version: "30"
    cosmos.strange.love/network: mainnet
  name: agoric-mainnet-fullnode-0
  namespace: strangelove
  ownerReferences:
  - apiVersion: cosmos.strange.love/v1
    blockOwnerDeletion: true
    controller: true
    kind: CosmosFullNode
    name: agoric-mainnet-fullnode
    uid: e241b539-32f9-4b18-987a-8c2961ca89ff
  resourceVersion: "39963658"
  uid: c802446c-34c2-4549-a6ff-ce960f9d3abc
spec:
  containers:
  - args:
    - start
    - --home
    - /home/operator/cosmos
    - --x-crisis-skip-assert-invariants
    command:
    - agd
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/heighliner/agoric:30
    imagePullPolicy: IfNotPresent
    name: node
    ports:
    - containerPort: 1317
      name: api
      protocol: TCP
    - containerPort: 8080
      name: rosetta
      protocol: TCP
    - containerPort: 9090
      name: grpc
      protocol: TCP
    - containerPort: 26660
      name: prometheus
      protocol: TCP
    - containerPort: 26656
      name: p2p
      protocol: TCP
    - containerPort: 26657
      name: rpc
      protocol: TCP
    - containerPort: 9091
      name: grpc-web
      protocol: TCP
    readinessProbe:
      failureThreshold: 5
      httpGet:
        path: /health
        port: 26657
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      requests:
        cpu: "1"
        memory: 16Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-vzffv
      readOnly: true
    workingDir: /home/operator
  - command:
    - /manager
    - healthcheck
    image: ghcr.io/strangelove-ventures/cosmos-operator:v0.7.0
    imagePullPolicy: IfNotPresent
    name: healthcheck
    ports:
    - containerPort: 1251
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: 1251
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      requests:
        cpu: 5m
        memory: 16Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-vzffv
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - args:
    - -c
    - "\nset -eu\nif [ ! -d \"$CHAIN_HOME/data\" ]; then\n\techo \"Initializing chain...\"\n\tagd
      init agoric-mainnet-fullnode-0 --chain-id agoric-3 --home \"$CHAIN_HOME\"\n\t#
      Remove because downstream containers check the presence of this file.\n\trm
      \"$GENESIS_FILE\"\nelse\n\techo \"Skipping chain init; already initialized.\"\nfi\n\necho
      \"Initializing into tmp dir for downstream processing...\"\nagd init agoric-mainnet-fullnode-0
      --chain-id agoric-3 --home \"$HOME/.tmp\"\n"
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/heighliner/agoric:30
    imagePullPolicy: IfNotPresent
    name: chain-init
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-vzffv
      readOnly: true
    workingDir: /home/operator
  - args:
    - -c
    - "if [ -f \"$GENESIS_FILE\" ]; then\n\techo \"Genesis file $GENESIS_FILE already
      exists; skipping initialization.\"\n\texit 0\nfi\n\nset -eu\n\n# $GENESIS_FILE
      and $CONFIG_DIR already set via pod env vars.\n\nGENESIS_URL=\"$1\"\n\necho
      \"Downloading genesis file $GENESIS_URL to $GENESIS_FILE...\"\n\ndownload_json()
      {\n  echo \"Downloading plain json...\"\n  wget -c -O \"$GENESIS_FILE\" \"$GENESIS_URL\"\n}\n\ndownload_jsongz()
      {\n  echo \"Downloading json.gz...\"\n  wget -c -O - \"$GENESIS_URL\" | gunzip
      -c >\"$GENESIS_FILE\"\n}\n\ndownload_tar() {\n  echo \"Downloading and extracting
      tar...\"\n  wget -c -O - \"$GENESIS_URL\" | tar -x -C \"$CONFIG_DIR\"\n}\n\ndownload_targz()
      {\n  echo \"Downloading and extracting compressed tar...\"\n  wget -c -O - \"$GENESIS_URL\"
      | tar -xz -C \"$CONFIG_DIR\"\n}\n\ndownload_zip() {\n  echo \"Downloading and
      extracting zip...\"\n  wget -c -O tmp_genesis.zip \"$GENESIS_URL\"\n  unzip
      tmp_genesis.zip\n  rm tmp_genesis.zip\n  mv genesis.json \"$GENESIS_FILE\"\n}\n\nrm
      -f \"$GENESIS_FILE\"\n\ncase \"$GENESIS_URL\" in\n*.json.gz) download_jsongz
      ;;\n*.json) download_json ;;\n*.tar.gz) download_targz ;;\n*.tar.gzip) download_targz
      ;;\n*.tar) download_tar ;;\n*.zip) download_zip ;;\n*)\n  echo \"Unable to handle
      file extension for $GENESIS_URL\"\n  exit 1\n  ;;\nesac\n\necho \"Saved genesis
      file to $GENESIS_FILE.\"\necho \"Download genesis file complete.\"\n\n\necho
      \"Genesis $GENESIS_FILE initialized.\"\n"
    - -s
    - https://main.agoric.net/genesis.json
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imagePullPolicy: IfNotPresent
    name: genesis-init
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-vzffv
      readOnly: true
    workingDir: /home/operator
  - args:
    - -c
    - |2

      set -eu
      CONFIG_DIR="$CHAIN_HOME/config"
      TMP_DIR="$HOME/.tmp/config"
      OVERLAY_DIR="$HOME/.config"
      echo "Merging config..."
      set -x
      config-merge -f toml "$TMP_DIR/config.toml" "$OVERLAY_DIR/config-overlay.toml" > "$CONFIG_DIR/config.toml"
      config-merge -f toml "$TMP_DIR/app.toml" "$OVERLAY_DIR/app-overlay.toml" > "$CONFIG_DIR/app.toml"
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imagePullPolicy: IfNotPresent
    name: config-merge
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-vzffv
      readOnly: true
    workingDir: /home/operator
  - args:
    - -c
    - "set -eu\nif test -n \"$(find $DATA_DIR -maxdepth 1 -name '*.db' -print -quit)\";
      then\n\techo \"Databases in $DATA_DIR already exists; skipping initialization.\"\n\texit
      0\nfi\n\nset -eu\n\n# $CHAIN_HOME already set via pod env vars.\n\nSNAPSHOT_URL=\"$1\"\n\necho
      \"Downloading snapshot archive $SNAPSHOT_URL to $CHAIN_HOME...\"\n\ndownload_tar()
      {\n  echo \"Downloading and extracting tar...\"\n  wget -c -O - \"$SNAPSHOT_URL\"
      | tar -x -C \"$CHAIN_HOME\"\n}\n\ndownload_targz() {\n  echo \"Downloading and
      extracting compressed tar...\"\n  wget -c -O - \"$SNAPSHOT_URL\" | tar -xz -C
      \"$CHAIN_HOME\"\n}\n\ndownload_lz4() {\n  echo \"Downloading and extracting
      lz4...\"\n  wget -c -O - \"$SNAPSHOT_URL\" | lz4 -c -d | tar -x -C \"$CHAIN_HOME\"\n}\n\ncase
      \"$SNAPSHOT_URL\" in\n*.tar.lz4) download_lz4 ;;\n*.tar.gzip) download_targz
      ;;\n*.tar.gz) download_targz ;;\n*.tar) download_tar ;;\n*)\n  echo \"Unable
      to handle file extension for $SNAPSHOT_URL\"\n  exit 1\n  ;;\nesac\n\necho \"Download
      and extract snapshot complete.\"\n\n\necho \"$DATA_DIR initialized.\"\n"
    - -s
    - https://storage.googleapis.com/strangelove-agoric-mainnet-snapshots/latest/snapshot-20230109.tar.lz4
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imagePullPolicy: IfNotPresent
    name: snapshot-restore
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-vzffv
      readOnly: true
    workingDir: /home/operator
  nodeName: gke-agoric-mainnet-fu-chain-node-pool-2a979ff8-3wc9
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  readinessGates:
  - conditionType: cloud.google.com/load-balancer-neg-ready
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1025
    fsGroupChangePolicy: OnRootMismatch
    runAsGroup: 1025
    runAsNonRoot: true
    runAsUser: 1025
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: vol-chain-home
    persistentVolumeClaim:
      claimName: pvc-agoric-mainnet-fullnode-0
  - emptyDir: {}
    name: vol-tmp
  - configMap:
      defaultMode: 420
      items:
      - key: config-overlay.toml
        path: config-overlay.toml
      - key: app-overlay.toml
        path: app-overlay.toml
      name: agoric-mainnet-fullnode-0
    name: vol-config
  - name: kube-api-access-vzffv
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'Pod is in NEG "Key{\"k8s1-61594a5f-strangelov-agoric-mainnet-fullnode--2665-d22c41df\",
      zone: \"us-central1-a\"}". NEG is not attached to any BackendService with health
      checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.'
    reason: LoadBalancerNegWithoutHealthCheck
    status: "True"
    type: cloud.google.com/load-balancer-neg-ready
  - lastProbeTime: null
    lastTransitionTime: "2023-02-16T01:07:14Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-02-16T01:07:08Z"
    message: 'containers with unready status: [node healthcheck[]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-02-16T01:07:08Z"
    message: 'containers with unready status: [node healthcheck[]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-02-16T01:07:08Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://1a3d012f8bd700422ca83ef964adca1d17a6d75e2470bff717e89d5c48b99548
    image: ghcr.io/strangelove-ventures/cosmos-operator:v0.7.0
    imageID: ghcr.io/strangelove-ventures/cosmos-operator@sha256:638a7a2bba0c48673ef3efc2e694ad6eb48023b4e47871c1bd0bd72baead42a2
    lastState: {}
    name: healthcheck
    ready: false
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-02-16T01:07:14Z"
  - containerID: containerd://8f7a479044805f2deb40802f4c512fd20761b293b8f940a773866d8b2b98c9e3
    image: ghcr.io/strangelove-ventures/heighliner/agoric:30
    imageID: ghcr.io/strangelove-ventures/heighliner/agoric@sha256:70a01de6999da60bb8346dbd825e7e4934e862afad65997cfb665c38b4686c08
    lastState:
      terminated:
        containerID: containerd://8f7a479044805f2deb40802f4c512fd20761b293b8f940a773866d8b2b98c9e3
        exitCode: 2
        finishedAt: "2023-02-16T14:47:51Z"
        reason: Error
        startedAt: "2023-02-16T14:44:04Z"
    name: node
    ready: false
    restartCount: 100
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=node pod=agoric-mainnet-fullnode-0_strangelove(c802446c-34c2-4549-a6ff-ce960f9d3abc)
        reason: CrashLoopBackOff
  hostIP: 192.168.5.9
  initContainerStatuses:
  - containerID: containerd://c042226f1ec9ffd7b0d33a4a06d2ff09c409f980fc8f26eb841db61022f94779
    image: ghcr.io/strangelove-ventures/heighliner/agoric:30
    imageID: ghcr.io/strangelove-ventures/heighliner/agoric@sha256:70a01de6999da60bb8346dbd825e7e4934e862afad65997cfb665c38b4686c08
    lastState: {}
    name: chain-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://c042226f1ec9ffd7b0d33a4a06d2ff09c409f980fc8f26eb841db61022f94779
        exitCode: 0
        finishedAt: "2023-02-16T01:07:10Z"
        reason: Completed
        startedAt: "2023-02-16T01:07:10Z"
  - containerID: containerd://21145cf8f6458b2fa04fbf4cfc92243ffc6bb05cc60e555c4547f0b1baf67921
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imageID: ghcr.io/strangelove-ventures/infra-toolkit@sha256:3aecfa18d9f0d730fd8821a7f556ba89dbb8cb3683a5e6ac11ec272884db8776
    lastState: {}
    name: genesis-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://21145cf8f6458b2fa04fbf4cfc92243ffc6bb05cc60e555c4547f0b1baf67921
        exitCode: 0
        finishedAt: "2023-02-16T01:07:11Z"
        reason: Completed
        startedAt: "2023-02-16T01:07:11Z"
  - containerID: containerd://2d5797ae38ed114992735046a91831991f0901c279bb5dfa118742821dcc7c71
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imageID: ghcr.io/strangelove-ventures/infra-toolkit@sha256:3aecfa18d9f0d730fd8821a7f556ba89dbb8cb3683a5e6ac11ec272884db8776
    lastState: {}
    name: config-merge
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://2d5797ae38ed114992735046a91831991f0901c279bb5dfa118742821dcc7c71
        exitCode: 0
        finishedAt: "2023-02-16T01:07:12Z"
        reason: Completed
        startedAt: "2023-02-16T01:07:12Z"
  - containerID: containerd://af84478d45664661db16aebc6f9ed2b4a4e51347d0ae4b469fab4a2e76ca2570
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imageID: ghcr.io/strangelove-ventures/infra-toolkit@sha256:3aecfa18d9f0d730fd8821a7f556ba89dbb8cb3683a5e6ac11ec272884db8776
    lastState: {}
    name: snapshot-restore
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://af84478d45664661db16aebc6f9ed2b4a4e51347d0ae4b469fab4a2e76ca2570
        exitCode: 0
        finishedAt: "2023-02-16T01:07:13Z"
        reason: Completed
        startedAt: "2023-02-16T01:07:13Z"
  phase: Running
  podIP: 10.7.3.32
  podIPs:
  - ip: 10.7.3.32
  qosClass: Burstable
  startTime: "2023-02-16T01:07:08Z"

DavidNix · 2023-02-17T21:15:37Z

I'm backlogging and de-prioritizing this one. I feel it's too risky of a feature. If there were a reliable way to detect data corruption, that is the ideal.

Also, this feature was to solve for only 1 problem chain. The rest of the chains do not randomly start crashing on start. The problem chain has a fix in their upstream which will eventually make it into a release.

With the autoDataSource feature, the fix is much easier - simply delete the PVC and the operator recreates it using a recent VolumeSnapshot. It still requires human intervention, but only take a minute or 2 to fix. Also, our redundancy gives us grace.

DavidNix · 2023-09-28T14:25:43Z

Closing because of the comment above. Needs to be a reliable way to detect "data corruption". If we could feed an ML classifier the logs, that could be one way.

DavidNix self-assigned this Feb 16, 2023

DavidNix mentioned this issue Feb 17, 2023

feat(CosmosFullNode): Self healing from crashlooping pod #206

Closed

DavidNix removed their assignment Feb 17, 2023

DavidNix closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CosmosFullNode: Detect Crashloops and restore replicas #205

CosmosFullNode: Detect Crashloops and restore replicas #205

DavidNix commented Feb 16, 2023

DavidNix commented Feb 16, 2023

DavidNix commented Feb 17, 2023

DavidNix commented Sep 28, 2023

CosmosFullNode: Detect Crashloops and restore replicas #205

CosmosFullNode: Detect Crashloops and restore replicas #205

Comments

DavidNix commented Feb 16, 2023

DavidNix commented Feb 16, 2023

DavidNix commented Feb 17, 2023

DavidNix commented Sep 28, 2023