Setting `ALWAYS_OFFLOAD_NODE_STATUS` without setting `nodeStatusOffLoad` results in workflow errors #12563

abhijeetviswa · 2024-01-22T13:42:50Z

Pre-requisites

I have double-checked my configuration
I can confirm the issue exists when I tested with :latest
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

I was in the process of enabling node status offloading. As part of this I set the ALWAYS_OFFLOAD_NODE_STATUS environment variable and missed making the nodeStatusOffload property in the configmap to true.

This resulted in the workflows failing with the error: offload node status is not supported (cli) and Workflow operation error (ui).

While this is definitely a configuration issue, it would be nice to have the controller ignore the ALWAYS_OFFLOAD_NODE_STATUS flag if the nodeStatusOffload setting is set to false so that users don't have to disable both if they need to disable node status offloading.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

# This is an issue with any workflow, so pasting the default workflow here. 
metadata:
  name: fantastic-python
  labels:
    example: 'true'
spec:
  arguments:
    parameters:
      - name: message
        value: hello argo
  entrypoint: argosay
  templates:
    - name: argosay
      inputs:
        parameters:
          - name: message
            value: '{{workflow.parameters.message}}'
      container:
        name: main
        image: 'argoproj/argosay:v2'
        command:
          - /argosay
        args:
          - echo
          - '{{inputs.parameters.message}}'
  ttlStrategy:
    secondsAfterCompletion: 300
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

Type     Reason           Age   From                 Message
  ----     ------           ----  ----                 -------
  Normal   WorkflowRunning  65s   workflow-controller  Workflow Running
  Warning  WorkflowFailed   65s   workflow-controller  offload node status is not supported

Logs from in your workflow's wait container

This was not done :(

The text was updated successfully, but these errors were encountered:

Joibel · 2024-01-22T13:51:10Z

You have asked for something to happen and not configured it. The controller is invalidly configured.

I wouldn't support being functional in this case, but we could just refused to start the controller instead, but it feels less helpful. The cli error is being as useful as it can be. Blindly carrying on in the light of bad configuration might be sane for some kinds of applications, but I disagree with doing it for a controller like argo workflows.

I'd support an improvement to the UI error message in this case, but otherwise think the current behavior is otherwise correct.

agilgur5 · 2024-01-22T18:07:10Z

but we could just refused to start the controller instead

Yea I think this would be the proper route, or at the very least have the Controller log out a critical error

Logs from the workflow controller

Notably, these logs are actually missing from the issue

and Workflow operation error (ui).

This is actually coming from the Controller, it's a message on the entry node

offload node status is not supported (cli)

this one is a bit more interesting as it's mentioned in the docs but this scenario is not mentioned as what could happen.

This error message comes from the DB code. I'm not sure exactly how the CLI chose this error to show specifically. Which command in the CLI did you use to get that? argo list?

abhijeetviswa · 2024-01-23T06:47:31Z

I'm not aware of the command that was run to get the logs since it was given to me by a team member. From what I understood it was a kubectl describe on the controller itself.

I'll try reproducing this again with a local argo setup over the weekend and come up with a more concrete set of logs and commands to reproduce this issue.

I'm not sure exactly how the CLI chose this error to show specifically.

From my very limited understanding of the controller log and from memory at looking at the kubectl logs of the controller, it comes up when a node is "hydrated/dehydrated". This if condition succeeds since alwaysOffloadNodeStatus is true, while h.offLoadNodeStatusRepo has the default value to ExplosiveOffloadNodeStatusRepo since nodeStatusOffload is false.

abhijeetviswa added the type/bug label Jan 22, 2024

agilgur5 added area/controller Controller issues, panics P3 Low priority labels Jan 22, 2024

tooptoop4 linked a pull request Nov 6, 2024 that will close this issue

fix(controller): abort loudly for ALWAYS_OFFLOAD_NODE_STATUS misconfiguration. Fixes #12563 #13874

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting `ALWAYS_OFFLOAD_NODE_STATUS` without setting `nodeStatusOffLoad` results in workflow errors #12563

Setting `ALWAYS_OFFLOAD_NODE_STATUS` without setting `nodeStatusOffLoad` results in workflow errors #12563

abhijeetviswa commented Jan 22, 2024

Joibel commented Jan 22, 2024 •

edited

Loading

agilgur5 commented Jan 22, 2024

abhijeetviswa commented Jan 23, 2024

Setting ALWAYS_OFFLOAD_NODE_STATUS without setting nodeStatusOffLoad results in workflow errors #12563

Setting ALWAYS_OFFLOAD_NODE_STATUS without setting nodeStatusOffLoad results in workflow errors #12563

Comments

abhijeetviswa commented Jan 22, 2024

Pre-requisites

What happened/what did you expect to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

Joibel commented Jan 22, 2024 • edited Loading

agilgur5 commented Jan 22, 2024

abhijeetviswa commented Jan 23, 2024

Setting `ALWAYS_OFFLOAD_NODE_STATUS` without setting `nodeStatusOffLoad` results in workflow errors #12563

Setting `ALWAYS_OFFLOAD_NODE_STATUS` without setting `nodeStatusOffLoad` results in workflow errors #12563

Joibel commented Jan 22, 2024 •

edited

Loading