-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read init-p: connection reset by peer #20212
Comments
Hi @wusikijeronii and thanks for raising this issue. Nomad v1.7.0 and above introduced improved cGroup and isolation primitives, which might be involved here. Could you confirm that cGroups are enabled and provide other information such as how the Nomad client is being run? |
Hello. Thank you for the reply. I provide results from OL8 (which doesn't work) and Ubuntu (which works). [root@srv1-prod ~]# awk '{print $1 " " $4}' /proc/cgroups
#subsys_name enabled
cpuset 1
cpu 1
cpuacct 1
blkio 1
memory 1
devices 1
freezer 1
net_cls 1
perf_event 1
net_prio 1
hugetlb 1
pids 1
rdma 1
[root@srv1-prod ~]# Ubuntu: root@srv3-prod:~# awk '{print $1 " " $4}' /proc/cgroups
#subsys_name enabled
cpuset 1
cpu 1
cpuacct 1
blkio 1
memory 1
devices 1
freezer 1
net_cls 1
perf_event 1
net_prio 1
hugetlb 1
pids 1
rdma 1
misc 1
root@srv3-prod:~# I just don't know anything about cgroups, and I didn't set any related settings manually. I didn't understand your question about Nomad running additional info. Do you mean binary, Docker, etc.? I use Nomad just as a binary app (from native host). Do you want me to share the client config? |
Hi @wusikijeronii to follow up on those questions, I think we're looking for:
Digging into the logs a bit more, it looks like the relevant sections for investigation are here:
It looks like the plugin is starting and trying to launch the container, but libcontainer is returning the error you reported. |
[root@srv2 ~]# mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755,inode64)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/misc type cgroup (rw,nosuid,nodev,noexec,relatime,misc)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
[root@srv2 ~]# grep cgroup /proc/filesystems
nodev cgroup
nodev cgroup2
[root@srv2 ~]#
Yep.
Yes. I also provide the unit file: [Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Wants=network-online.target
After=network-online.target
# When using Nomad with Consul it is not necessary to start Consul first. These
# lines start Consul before Nomad as an optimization to avoid Nomad logging
# that Consul is unavailable at startup.
#Wants=consul.service
#After=consul.service
[Service]
EnvironmentFile=-/etc/nomad.d/nomad.env
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/bin/nomad agent -config /etc/nomad.d
KillMode=process
KillSignal=SIGINT
LimitNOFILE=65536
LimitNPROC=infinity
Restart=on-failure
RestartSec=2
## Configure unit start rate limiting. Units which are started more than
## *burst* times within an *interval* time span are not permitted to start any
## more. Use `StartLimitIntervalSec` or `StartLimitInterval` (depending on
## systemd version) to configure the checking interval and `StartLimitBurst`
## to configure how many starts per interval are allowed. The values in the
## commented lines are defaults.
# StartLimitBurst = 5
## StartLimitIntervalSec is used for systemd versions >= 230
# StartLimitIntervalSec = 10s
## StartLimitInterval is used for systemd versions < 230
# StartLimitInterval = 10s
TasksMax=infinity
OOMScoreAdjust=-1000
[Install]
WantedBy=multi-user.target
No. SELinux is disabled
Yes, but I see the main error: "rpc error: code = Unknown desc ". I thought it meant an error happened while reading a message through the gRPC channel. So, I updated only the Nomad app but not the libcontainer. So, I can conclude: 1.7.3 version created some changes that involved reading or recognizing messages from the gRPC stream. |
The actual failure appears to be happening with runc calling libcontainer.
More complete log output:
|
Thanks @nickwales. Given the segfault, I'm having a strong suspicion this is a build issue where we might not be linking against the right version of glibc in the build environment. I'll follow-up on this and report back. |
So here's what I'm seeing with
Nomad releases should be linked against glibc 2.31 (internal ref, which we need to turn into docs at some point), so this looks dubious. However, @wusikijeronii's report was that 1.5.8 worked fine, but I see 2.34 symbols in the 1.5.8 build as well. @wusikijeronii can you run |
Hello. Sure. [root@srv2 bin]# ls -al /lib/libc*
-rwxr-xr-x 3 root root 1940100 Mar 6 21:24 /lib/libc-2.28.so
-rw-r--r-- 3 root root 104136 Mar 6 21:24 /lib/libc_nonshared.a
lrwxrwxrwx. 1 root root 17 Jan 12 2022 /lib/libcom_err.so.2 -> libcom_err.so.2.1
-rwxr-xr-x. 3 root root 16204 Jan 12 2022 /lib/libcom_err.so.2.1
lrwxrwxrwx 1 root root 19 Dec 18 16:40 /lib/libcrypto.so.1.1 -> libcrypto.so.1.1.1k
-rwxr-xr-x 3 root root 2977064 Dec 18 16:41 /lib/libcrypto.so.1.1.1k
lrwxrwxrwx. 1 root root 17 Oct 9 2021 /lib/libcrypt.so -> libcrypt.so.1.1.0
lrwxrwxrwx. 1 root root 17 Oct 9 2021 /lib/libcrypt.so.1 -> libcrypt.so.1.1.0
-rwxr-xr-x. 3 root root 139496 Oct 9 2021 /lib/libcrypt.so.1.1.0
-rw-r--r-- 3 root root 238 Mar 6 21:15 /lib/libc.so
lrwxrwxrwx 1 root root 12 Mar 6 21:16 /lib/libc.so.6 -> libc-2.28.so
lrwxrwxrwx 1 root root 16 Oct 18 19:02 /lib/libcurl.so.4 -> libcurl.so.4.5.0
-rwxr-xr-x 3 root root 661556 Oct 18 19:02 /lib/libcurl.so.4.5.0
[root@srv2 bin]# ldd -r -v /lib/libc.so.6
/lib/ld-linux.so.2 (0xf7f0f000)
linux-gate.so.1 (0xf7f0d000)
Version information:
/lib/libc.so.6:
ld-linux.so.2 (GLIBC_2.3) => /lib/ld-linux.so.2
ld-linux.so.2 (GLIBC_PRIVATE) => /lib/ld-linux.so.2
ld-linux.so.2 (GLIBC_2.1) => /lib/ld-linux.so.2 It looks like I use 2.28. By the way, it is the latest version in the OL8 Base Stream (official stable mirror). [root@srv2 bin]# nomad --version
Nomad v1.5.3
BuildDate 2023-04-04T20:09:50Z
Revision 434f7a1745c6304d607562daa9a4a635def7153f
[root@srv2 bin]# ldd -r -v nomad
linux-vdso.so.1 (0x00007ffd76de9000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f7750804000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f77505e4000)
libc.so.6 => /lib64/libc.so.6 (0x00007f775021f000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7750a1c000)
Version information:
./nomad:
libresolv.so.2 (GLIBC_2.2.5) => /lib64/libresolv.so.2
libpthread.so.0 (GLIBC_2.3.3) => /lib64/libpthread.so.0
libpthread.so.0 (GLIBC_2.3.2) => /lib64/libpthread.so.0
libpthread.so.0 (GLIBC_2.2.5) => /lib64/libpthread.so.0
libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.8) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3.2) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.7) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.14) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.9) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
/lib64/libresolv.so.2:
libc.so.6 (GLIBC_2.14) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_PRIVATE) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6
/lib64/libpthread.so.0:
ld-linux-x86-64.so.2 (GLIBC_2.2.5) => /lib64/ld-linux-x86-64.so.2
ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2
libc.so.6 (GLIBC_2.14) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3.2) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
libc.so.6 (GLIBC_PRIVATE) => /lib64/libc.so.6
/lib64/libc.so.6:
ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2 |
@tgross So, What the plan? Build fix? Either glibc 2.31 requirement or building from sources? I think I can install the newer version through unstable channels but I don't think it a good solution. At least because OL8 is supported untill 2032. |
In my testing I found that the breaking change happened in Nomad version 1.6.9 in case that helps pinpoint. |
@nickwales reported that Ubuntu 20.04 was also seeing the error, and I checked libc there:
So that at least lines up with what
Obviously there's some sort of flaw in our reasoning here. I'm going to have to dig up what happened to the build environment. I'll report back here once I know more. |
🤦 The build/linking stuff is a total red herring! From my previous comment:
The 2.34 symbol isn't in Nomad, it's in |
Ok, I've been able to reproduce on Ubuntu 20.04 and I did some
This looks like the Go runtime itself is blowing up on us somewhere in There's a PR in In the meantime, if you are using the
|
Yes, I think your mentioned PR can fix the issue in runc, but it introduces some changes for libct/nsenter users, please see opencontainers/runc#4193 (comment) |
I see why that's needed, but wouldn't that be applied by |
Yes, it would be applied by |
Hi @tgross, This happened to us also migrating to nomad 1.7.7 from 1.6.5. |
Just in case anyone else is still on 1.5, we saw this when we upgraded to 1.5.17 and worked around it by downgrading to 1.5.15. The release notes for 1.5.16 mention the Go upgrade, so I presume it's the same issue. |
I'm keeping an eye on the new upstream PR opencontainers/runc#4292 |
Upstream has released v1.1.13 with this fix (ref https://github.com/opencontainers/runc/releases/tag/v1.1.13), so we'll get the dependency updated and tested Internal ref: https://hashicorp.atlassian.net/browse/NET-10078 |
Update `runc` to 1.1.13 to pick up build support for Go 1.22.4+, in order to ensure we've resolved errors cloning processes into Linux namespaces for libcontainer (`exec` driver) with new versions of Go and older but still supported versions of glibc. This changeset has two minor quirks: * Testing shows that the reported issues is already resolved on `main` by upgrading to Go 1.22.4 without this dependency bump, at least for glibc 2.31. Upgrading the dependency should make sure there isn't another glibc version where the problem will still appear. * This version of `runc` refers to fields in `cilium/ebpf` which are not present in more recent versions of that library. So in order to build, we have to downgrade `cilium/ebpf`. Fortunately, `runc` is the only consumer of that transitive dependency. Closes: #20212 Ref: https://hashicorp.atlassian.net/browse/NET-10078
I've closed this issue via #23331. There's an interesting quirk from that PR:
That Go 1.22.4 bump was in main but didn't land before 1.8.0 GA. This fix will be released in Nomad 1.8.1, with Enterprise backports. |
Are there (or will there be) any versions of 1.7 that aren't affected by this issue available to non-enterprise users? I'm concerned otherwise we will have to have jump straight from 1.6 to 1.8, which I think the docs recommend against. |
Hi @martinmcnulty! Sorry, there will not be a CE backport. As of Nomad 1.8.0 LTS, earlier major versions of Nomad CE will no longer receive backports. |
Thanks for the quick reply, @tgross. Shame to hear about the change to backporting for CE. However, it looks like the golang upgrade went in to 1.7.6, so hopefully we'll be able to go via 1.7.5 to 1.8.1. Will give that a try. |
Nomad version
Operating system and Environment details
Issue
After updating Nomad from 1.5.3 to 1.7.6, I can't run any job on two of the three nodes (the same job). I get the error:
I also tried to create a simple job that will run /bin/bash, but I still face the issue. I also tried to reboot servers and update all packages on host machines, but that didn't help.
I also tried to remove all cache data from all servers. I thought it was a file access issue at first. If you use the default user (anonymous), the same error occurs.
Job file
Nomad logs
If you need to check anything on my end, let me know. I just don't know what else to check.
Reverting to 1.5.3 fixes the issue
The text was updated successfully, but these errors were encountered: