Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systemd breaks mirrored networking #11672

Closed
1 of 2 tasks
withinboredom opened this issue Jun 9, 2024 · 12 comments
Closed
1 of 2 tasks

Systemd breaks mirrored networking #11672

withinboredom opened this issue Jun 9, 2024 · 12 comments
Labels

Comments

@withinboredom
Copy link

Windows Version

Microsoft Windows [Version 10.0.22631.3672]

WSL Version

WSL version: 2.1.5.0

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

5.15.146.1-2

Distro Version

Ubuntu 24.04

Other Software

curl 8.5.0 (x86_64-pc-linux-gnu) libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7
Release-Date: 2023-12-06, security patched: 8.5.0-2ubuntu10.1
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM PSL SPNEGO SSL threadsafe TLS-SRP UnixSockets zstd

Repro Steps

I've tried nearly everything to get mirrored networking mode working again, but for some reason it has stopped working correctly in the last week.

At first, it was similar to other reported issues where mirrored mode would work for about 10-15 minutes and then mysteriously fail. Eventually, it just stopped working altogether. At least that is what I thought. (#11369)

I am still able to ping and I see responses. However, UDP and TCP packets leave the interface, but I never see them return in WSL (though I see their responses and retransmissions in wireshark on the windows side).

I then went on an adventure to uninstall/reinstall network adapters, WSL, etc. None of these things seemed to resolve my issue. It wasn't until I stumbled upon #10842 that I got a crazy idea. My simple idea was to manually set the source port of curl and then use the iperf trick to see if that was a related issue.

To my surprise, this worked exactly once: curl -v google.com --local-port 12345 producing the expected output! When I ran it again, I got: curl: (45) bind failed with errno 98: Address already in use which is weird because there is no longer any process listening on that port. Changing the source port does, in fact, cause it to work exactly once, yet again.

This leads me to believe that this might be a kernel issue, or some other software doing something weird. So, I go to disable systemd ... and lo-and-behold, things work again!

I do note that specifying the source port via curl still only works exactly once and I don't see it in ss output, which is a bit unusual.

I'm kinda stumped at the moment with what systemd might be doing, so any tips would be very much appreciated.

Note that #11143 appears to potentially be a duplicate.

Expected Behavior

Networking to work.

Actual Behavior

Networking does not work.

Diagnostic Logs

Steps followed:

  1. wsl --shutdown
  2. collect logs with .\collect-wsl-logs.ps1
  3. Start up WSL
  4. run curl -v 1.1.1.1 (DNS works via tunneling but lets remove as many variables as possible)
  5. run curl -v 1.1.1.1 --local-port 12345
  6. run curl -v 1.1.1.1 --local-port 12345
  7. stop collecting logs
  8. collect dmesg logs
  9. collect journctl logs (if applicable)
  10. add logs from 8 & 9 to log zip file

WSL startup with systemd:
WslLogs-2024-06-09_12-20-00 (2).zip

WSL startup without systemd:
WslLogs-2024-06-09_12-27-08 (2).zip

Copy link

github-actions bot commented Jun 9, 2024

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Open similar issues:

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

Diagnostic information
Multiple log files found, using: https://github.com/user-attachments/files/15751492/WslLogs-2024-06-09_12-20-00.2.zip
appxpackage.txt not found
optional-components.txt not found
Error while parsing the logs. See action page for details

@withinboredom
Copy link
Author

It's also worth pointing out that cloud-config and snapd were disabled for those logs (to save anyone else any trouble-shooting). Enabling/disabling them doesn't seem to have any effect.

@chanpreetdhanjal
Copy link

Hi. Can you please collect networking logs by following the instructions below?
https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

@withinboredom
Copy link
Author

Here's with all of networking working correctly:
WslNetworkingLogs-2024-06-13_21-10-41.zip

and with systemd preventing networking from working:
WslNetworkingLogs-2024-06-13_21-13-10.zip

I performed the same steps as before.

Copy link

Diagnostic information
Multiple log files found, using: https://github.com/user-attachments/files/15827620/WslNetworkingLogs-2024-06-13_21-10-41.zip
.wslconfig found
Detected appx version: 2.1.5.0
optional-components.txt not found

@dcasota
Copy link

dcasota commented Jun 19, 2024

With the new networkingMode=mirrored I had similar issues in wsl 2.1.5 and 2.2.4, hence I left it to nat. This works flawlessly.

VMware Photon OS uses systemd as well and it works in wsl by configuring a rootless user same as the logged-in windows user. See https://github.com/dcasota/photonos-scripts/wiki/Photon-OS-on-WSL2, step 4.

@withinboredom
Copy link
Author

Bump. Anything?

@withinboredom
Copy link
Author

withinboredom commented Jul 4, 2024

It appears this might be related to #11450 as running

sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"

and then doing a ton of connections will cause this situation, even without systemd (I just had to run a load test from wsl and needed more than 1024 connections). Combined with the "Address already in use" bug mentioned (but not called out explicitly), eventually, there is port starvation, and no connections can be made.

So, basically, I suspect that systemd just causes port starvation, since enough ports are not allocated to wsl in mirrored mode.

@NyaMisty
Copy link

NyaMisty commented Aug 9, 2024

It appears this might be related to #11450 as running

sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"

and then doing a ton of connections will cause this situation, even without systemd (I just had to run a load test from wsl and needed more than 1024 connections). Combined with the "Address already in use" bug mentioned (but not called out explicitly), eventually, there is port starvation, and no connections can be made.

So, basically, I suspect that systemd just causes port starvation, since enough ports are not allocated to wsl in mirrored mode.

I believe this has ZERO relation to this issue. The ephemeral port issue is a longstanding problem which is definitely not related with this problem.

@withinboredom
Copy link
Author

Do you have any evidence of that @NyaMisty? I saw exactly the same behavior with tools that connected to the internet as I did with systemd enabled, after running that sysctl command. Maybe they aren't directly related, but exhibit the same symptoms.

@NyaMisty
Copy link

NyaMisty commented Aug 9, 2024

Do you have any evidence of that @NyaMisty? I saw exactly the same behavior with tools that connected to the internet as I did with systemd enabled, after running that sysctl command. Maybe they aren't directly related, but exhibit the same symptoms.

TL;DR: You are right, and removing net.ipv4.ip_local_port_range statements from /etc/sysctl.conf /etc/sysctl.d/* and /usr/lib/sysctl.d/* will solve the issue.

I'm sorry I made a stupid false assertion in the above reply. I'm not taking your guess because all possible issue that's causing network packet get unexpectedly dropped will cause the above issue.

In additional to @withinboredom 's previous investigation, I changed systemd's startup target from multi-user.target all the way down to emergency.target (which loads none services), which that makes debug a lot easier. However, even using emergency.target is still killing the network in mirrored mode.

Then I opened two terminal, one running a strace-d systemd

sudo strace -T -tt -f /usr/bin/unshare -fp --propagation shared --mount-proc -- systemd 2>&1 | tee systemd.log

and another running a watchdog

while true; do curl -s 192.168.112.1:7890; date +"%T.%N"; done    

With the log I can guess things goes wrong during systemd-sysctl.service, so finally I pinpointed that the issue goes exactly from sysctl.conf.

It turns out that Microsoft is using some black magic to implement forwarding in WSL mirrored network. It seems that it will only forward connection with specific source port, while dropping other connection silently, so when we overrided the port range, the network requests fails immediately.

Removing all ip_local_port_range statements in sysctl will solve the issue.

@withinboredom
Copy link
Author

I will give this a go asap. Thanks for looking into it; and kinda obvious source in retrospect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants