Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to start on Talos Linux #4526

Open
nniehoff opened this issue Sep 15, 2024 · 13 comments
Open

[BUG] Unable to start on Talos Linux #4526

nniehoff opened this issue Sep 15, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@nniehoff
Copy link

Kube-OVN Version

v1.13.0

Kubernetes Version

v1.31.1

Operation-system/Kernel Version

Talos (v1.7.6)
6.6.43-talos

Description

When the ovs-ovn daemonset tries to start the openvswitch container fails to start with this error:

/kube-ovn/start-ovs.sh: line 83: /usr/local/sbin/iptables: Operation not permitted

I have tried changing the securityContext to run this container privileged and as root but it still has this error.

Steps To Reproduce

  1. Deploy a Talos Linux cluster
  2. Deploy the helm chart from the master branch
  3. Use these values:
---
cni_conf:
  CNI_CONFIG_PRIORITY: "10"
  MOUNT_LOCAL_BIN_DIR: false
func:
  ENABLE_NP: false
ipv4:
  POD_CIDR: "10.131.0.0/16"
  POD_GATEWAY: "10.131.0.1"
  SVC_CIDR: "10.128.0.0/20"
  JOIN_CIDR: "10.130.0.0/24"
  # PINGER_EXTERNAL_ADDRESS: "1.1.1.1"
  # PINGER_EXTERNAL_DOMAIN: "kube-ovn.io."
namespace: "kube-system"
DISABLE_MODULES_MANAGEMENT: true
OPENVSWITCH_DIR: "/var/kube-ovn/openvswitch"
OVN_DIR: "/var/kube-ovn/ovn"

I have followed the directions in the chart README I used a different path for OPENVSWITCH_DIR and OVN_DIR but I believe this should still work as I believe the primary intention is to move this path to a writeable path under /var.

Current Behavior

Pods in both ovs-ovn and kube-ovn-cni DaemonSets fail to start correctly

Expected Behavior

Pods should be running and healthy

@nniehoff nniehoff added the bug Something isn't working label Sep 15, 2024
Copy link

dosubot bot commented Sep 15, 2024

Hey @nniehoff! 👋 I'm here to help you with your Kube-OVN issue. I can assist with bug fixes, answer questions, and guide you on how to contribute.

The error you're encountering is likely due to missing permissions or capabilities required by the start-ovs.sh script, particularly around the use of iptables. Here are some steps to address this issue:

  1. Ensure Required Capabilities:
    Make sure the openvswitch container has the necessary capabilities. The default securityContext settings should include the following capabilities:

    securityContext:
      runAsUser: 0
      privileged: true
      capabilities:
        add:
          - NET_ADMIN
          - NET_BIND_SERVICE
          - SYS_MODULE
          - SYS_NICE
          - SYS_ADMIN
  2. Declare the openvswitch Module:
    Ensure that the openvswitch module is declared in the Talos machine configuration:

    machine:
      kernel:
        modules:
        - name: openvswitch
  3. Use Specific Helm Chart Options:
    When installing the Helm chart, use the following options to ensure compatibility with Talos Linux:

    --set cni_conf.MOUNT_LOCAL_BIN_DIR=false
    --set OPENVSWITCH_DIR=/var/lib/openvswitch
    --set OVN_DIR=/var/lib/ovn
    --set DISABLE_MODULES_MANAGEMENT=true
  4. Verify File Paths:
    Ensure that the paths specified for OPENVSWITCH_DIR and OVN_DIR are correct and writable. The paths you provided should work, but double-check for any typos or permission issues.

By following these steps, you should be able to resolve the issue and get the openvswitch container running on Talos Linux [1][2].

To continue talking to Dosu, mention @dosu.

@nniehoff
Copy link
Author

I have tried these exact suggestions provided by @dosu, however, they do not resolve the issue

Copy link

dosubot bot commented Sep 15, 2024

The issue might be due to missing securityContext settings in the initContainers section of the ovs-ovn DaemonSet configuration. Specifically, Talos Linux requires certain permissions for the initContainers to function correctly.

Ensure that the initContainers section includes the following securityContext settings:

initContainers:
  - name: hostpath-init
    image: "$REGISTRY/kube-ovn:$VERSION"
    command:
      - sh
      - -xec
      - |
        chown -R nobody: /var/run/ovn /var/log/ovn /etc/openvswitch /var/run/openvswitch /var/log/openvswitch
        iptables -V
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        drop:
          - ALL
      privileged: true
      runAsUser: 0
      runAsGroup: 0
    volumeMounts:
      - mountPath: /usr/local/sbin
        name: usr-local-sbin
      - mountPath: /var/log/ovn
        name: host-log-ovn
      - mountPath: /var/run/ovn
        name: host-run-ovn
      - mountPath: /etc/openvswitch
        name: host-config-openvswitch
      - mountPath: /var/run/openvswitch
        name: host-run-ovs
      - mountPath: /var/log/openvswitch
        name: host-log-ovs

These settings should allow the necessary permissions for the DaemonSets to start correctly on Talos Linux [1].

To continue talking to Dosu, mention @dosu.

@nniehoff
Copy link
Author

@dosu running iptables in the init container has the same Operation not permitted error. I have tried both containers with:

securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        allow:
          - ALL
      privileged: true
      runAsUser: 0
      runAsGroup: 0

with no luck.

Copy link

dosubot bot commented Sep 15, 2024

The iptables command requires specific capabilities to run successfully. Ensure that the NET_ADMIN capability is included in the securityContext settings. Here is an updated configuration for the initContainers section:

initContainers:
  - name: hostpath-init
    image: "$REGISTRY/kube-ovn:$VERSION"
    command:
      - sh
      - -xec
      - |
        chown -R nobody: /var/run/ovn /var/log/ovn /etc/openvswitch /var/run/openvswitch /var/log/openvswitch
        iptables -V
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        add:
          - NET_ADMIN
          - NET_BIND_SERVICE
          - NET_RAW
          - SYS_ADMIN
          - CAP_SYS_PTRACE
          - SYS_MODULE
          - SYS_NICE
      privileged: true
      runAsUser: 0
      runAsGroup: 0
    volumeMounts:
      - mountPath: /usr/local/sbin
        name: usr-local-sbin
      - mountPath: /var/log/ovn
        name: host-log-ovn
      - mountPath: /var/run/ovn
        name: host-run-ovn
      - mountPath: /etc/openvswitch
        name: host-config-openvswitch
      - mountPath: /var/run/openvswitch
        name: host-run-ovs
      - mountPath: /var/log/openvswitch
        name: host-log-ovs

Ensure that the NET_ADMIN capability is included, as it is crucial for the iptables command to function properly [1][2].

To continue talking to Dosu, mention @dosu.

@nniehoff
Copy link
Author

it occured to me to try symlinking iptables to /bin/true as is done for several other utilities that can't be used in talos however I have the same results with ovs-vswitchd which I know we can't shortcut. @dosu any other ideas

/usr/share/openvswitch/scripts/ovs-ctl: 1: ovs-vswitchd: Operation not permitted

Copy link

dosubot bot commented Sep 17, 2024

To resolve the "Operation not permitted" error for ovs-vswitchd in the ovs-ovn DaemonSet on Talos Linux, ensure that the privileged security context is set to true for the openvswitch container. This allows the container to perform operations that require elevated privileges.

Here is the relevant part of the DaemonSet configuration:

containers:
  - name: openvswitch
    image: "{{ .Values.global.registry.address }}/{{ .Values.global.images.kubeovn.repository }}:{{ .Values.global.images.kubeovn.tag }}"
    imagePullPolicy: {{ .Values.image.pullPolicy }}
    command: ["/kube-ovn/start-ovs.sh"]
    securityContext:
      runAsUser: 0
      privileged: true
      capabilities:
        add:
          - NET_ADMIN
          - NET_BIND_SERVICE
          - SYS_MODULE
          - SYS_NICE
          - SYS_ADMIN

Ensure that the privileged field is set to true under securityContext for the openvswitch container. This configuration allows the container to execute operations that require root privileges, which is necessary for ovs-vswitchd to function correctly. Additionally, make sure the necessary capabilities are added to the container's security context [1].

To continue talking to Dosu, mention @dosu.

@nniehoff
Copy link
Author

@dosu tried this, however, it doesn't resolve the problem

Copy link

dosubot bot commented Sep 17, 2024

To resolve the "Operation not permitted" error for ovs-vswitchd in the ovs-ovn DaemonSet on Talos Linux, ensure that the necessary capabilities are added to the security context. Specifically, you need to add the CAP_NET_ADMIN capability, which is often required for network-related operations. Additionally, you may need to add other capabilities such as NET_BIND_SERVICE, SYS_MODULE, SYS_NICE, and SYS_ADMIN.

Here is the relevant section of the DaemonSet configuration with the necessary capabilities added:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: ovs-ovn
  namespace: kube-system
  annotations:
    kubernetes.io/description: |
      This daemon set launches the openvswitch daemon.
spec:
  selector:
    matchLabels:
      app: ovs
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: ovs
        component: network
        type: infra
    spec:
      tolerations:
        - effect: NoSchedule
          operator: Exists
        - effect: NoExecute
          operator: Exists
        - key: CriticalAddonsOnly
          operator: Exists
      priorityClassName: system-node-critical
      serviceAccountName: ovn-ovs
      hostNetwork: true
      hostPID: true
      containers:
        - name: openvswitch
          image: "$REGISTRY/kube-ovn:$VERSION"
          imagePullPolicy: $IMAGE_PULL_POLICY
          command: ["/kube-ovn/start-ovs.sh"]
          securityContext:
            runAsUser: 0
            privileged: true
            capabilities:
              add:
                - NET_ADMIN
                - NET_BIND_SERVICE
                - SYS_MODULE
                - SYS_NICE
                - SYS_ADMIN
          env:
            - name: ENABLE_SSL
              value: "$ENABLE_SSL"
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: KUBE_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: OVN_DB_IPS
              value: $addresses
            - name: OVN_REMOTE_PROBE_INTERVAL
              value: "10000"
            - name: OVN_REMOTE_OPENFLOW_INTERVAL
              value: "180"
          volumeMounts:
            - mountPath: /var/run/netns
              name: host-ns
              mountPropagation: HostToContainer
            - mountPath: /lib/modules
              name: host-modules
              readOnly: true
            - mountPath: /var/run/openvswitch
              name: host-run-ovs
            - mountPath: /var/run/ovn
              name: host-run-ovn

By adding the NET_ADMIN, NET_BIND_SERVICE, SYS_MODULE, SYS_NICE, and SYS_ADMIN capabilities, you grant the container the necessary permissions to perform network-related operations, which should resolve the "Operation not permitted" error [1].

To continue talking to Dosu, mention @dosu.

@PeterFalken
Copy link

I get an error on the Daemonset that says:
cannot create regular file '/usr/local/bin/kubectl-ko': Read-only file system
And the Controller fails to connect to it kubernetes endpoint.

@tamay86
Copy link

tamay86 commented Oct 23, 2024

@nniehoff
I got the same problem, almost identical configuration. Running Talos 1.8.1, kube-ovn 1.13 via Helm.
Relevant values:

  cni_conf:
    CNI_CONFIG_PRIORITY: "10"
  OPENVSWITCH_DIR: "/var/lib/openvswitch"
  OVN_DIR: "/var/lib/ovn"
  DISABLE_MODULES_MANAGEMENT: true

I tried quite a few things and got this workaround - I'm not content with it, but at least it works for now in my test lab.

What I did:

  • Edited the Daemonset ovn-ovs
  • Change the image to v1.12.28 for the openvswitch container (yes, the init container runs with v1.13.0)
  • Edit the security context and change runAsUser to 0
kubectl -n kube-system edit ds ovs-ovn
...
    spec:
      containers:
      - command:
        - /kube-ovn/start-ovs.sh
        image: docker.io/kubeovn/kube-ovn:v1.12.28
        name: openvswitch
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - NET_BIND_SERVICE
            - SYS_NICE
            - SYS_ADMIN
          privileged: true
          runAsUser: 0
...

EDIT: I have to clarify, only the ovs-ovn daemonset works with this; the pods are running, however (I'm guessing becuase of the version mismatch) the other daemonsets and deployments stop working.

@PeterFalken
Copy link

I'm still getting an error, but I think that the paths are correct.
failed to generate container "XXXXYYYZZZZ" spec: failed to generate spec: failed to mkdir "/etc/origin/openvswitch": mkdir /etc/origin: read-only file system

I've already added the kernel parameter for openvswitch.

machine:
  kernel:
    modules:
    - name: openvswitch

These are the values I'm feeding the Helm chart:

global:
  images:
    kubeovn:
      tag: v1.12.28

ipv4:
  POD_CIDR: "10.244.0.0/16"
  POD_GATEWAY: "10.244.0.1"

DISABLE_MODULES_MANAGEMENT: true
OPENVSWITCH_DIR: /var/lib/openvswitch
OVN_DIR: /var/lib/ovn
cni_conf:
  MOUNT_LOCAL_BIN_DIR: false

@quynhlab
Copy link

I got the same problem
Running Talos 1.8.3
Kube-ovn 1.13 installed by Helm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants