Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init aborted by WSL 'watchdog' #11602

Open
1 of 2 tasks
thielj opened this issue May 20, 2024 · 13 comments
Open
1 of 2 tasks

cloud-init aborted by WSL 'watchdog' #11602

thielj opened this issue May 20, 2024 · 13 comments

Comments

@thielj
Copy link

thielj commented May 20, 2024

Windows Version

Microsoft Windows [Version 10.0.22631.3593]

WSL Version

2.2.4.0 (also tested with v2.1.5 and v2.0.5)

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

5.15.153.1-microsoft-standard-WSL2

Distro Version

Ubuntu 24.04

Other Software

I uninstalled anything that could possibly interfer

Repro Steps

I made sure .wslconfig doesn't exist.

Create a minimalist .cloud-init/default.user-data as follows:

#cloud-config

runcmd:
  - echo RUNCMD - Going to sleep 60s
  - sleep 60s
  - echo RUNCMD - Wakey wakey 

Run wsl --install Ubuntu-24.04. When prompted for the username, pause for about 30 seconds. Then continue as usual.

Expected Behavior

I would expect that cloud-init would be able to complete an operation that takes longer than just a few seconds (e.g. installing a larger number of packages, which is how I noticed this first).

Actual Behavior

Some kind of watchdog kicks in 10s after the machine is started, logs in as root and executes a shutdown about 15s later, without letting cloud-init finish and leaving the system in an undefined state.

Diagnostic Logs

When logged in, run journalctl. You'll find entries like the following (some lines ~~~ omitted):

May 20 21:46:24 NEUROMANCER kernel: Linux version 5.15.153.1-microsoft-standard-WSL2 (root@941d701f84f1) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Fri Mar 29 23:14:13 UTC 2024
~~~
May 20 21:46:24 NEUROMANCER cloud-init[157]: Cloud-init v. 24.1.3-0ubuntu3 running 'init-local' at Mon, 20 May 2024 20:46:24 +0000. Up 11.57 seconds.
~~~
May 20 21:46:25 NEUROMANCER cloud-init[173]: Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Mon, 20 May 2024 20:46:25 +0000. Up 12.02 seconds.
~~~
May 20 21:46:27 NEUROMANCER cloud-init[379]: Cloud-init v. 24.1.3-0ubuntu3 running 'modules:final' at Mon, 20 May 2024 20:46:27 +0000. Up 14.05 seconds.
May 20 21:46:27 NEUROMANCER cloud-init[379]: RUNCMD - Going to sleep 60s
~~~
May 20 21:46:33 NEUROMANCER unknown: WSL (2) ERROR: WaitForBootProcess:3342: /sbin/init failed to start within 10000
May 20 21:46:33 NEUROMANCER unknown: ms
May 20 21:46:33 NEUROMANCER unknown: WSL (2): Creating login session for root
~~~
May 20 21:46:34 NEUROMANCER login[518]: ROOT LOGIN  on '/dev/pts/1'
~~~
May 20 21:46:49 NEUROMANCER unknown:
May 20 21:47:00 NEUROMANCER kernel: Linux version 5.15.153.1-microsoft-standard-WSL2 (root@941d701f84f1) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Fri Mar 29 23:14:13 UTC 2024

The last two lines above are the last logged message and the first message once the machine is started again.

The shutdown (21:46:49) was most likely initiated by whatever the watchdog's root login was doing.

The start (21:48:00) was initiated interactively by entering the default user credentials.

Copy link

Logs are required for review from WSL team

If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'.
Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.

How to collect WSL logs

Download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:

Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1

The scipt will output the path of the log file once done.

Once completed please upload the output files to this Github issue.

Click here for more info on logging
If you choose to email these logs instead of attaching to the bug, please send them to [email protected] with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

@thielj
Copy link
Author

thielj commented May 20, 2024

The relevant log excerpts have already been included. Happy to provide additional information if necessary.

Maybe it's worth mentioning that setting vmIdleTimeout to -1 or a sufficient large value has no effect.

@thielj thielj changed the title cloud-init aborted by WSL watchdog cloud-init aborted by WSL 'watchdog' May 20, 2024
Copy link

Logs are required for review from WSL team

If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'.
Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.

How to collect WSL logs

Download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:

Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1

The scipt will output the path of the log file once done.

Once completed please upload the output files to this Github issue.

Click here for more info on logging
If you choose to email these logs instead of attaching to the bug, please send them to [email protected] with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

@thielj
Copy link
Author

thielj commented May 21, 2024

Copy link

Diagnostic information
Detected appx version: 2.2.4.0

@OneBlue
Copy link
Collaborator

OneBlue commented May 25, 2024

Thank you @thielj.

Based on the symptoms, I wonder if the issue is that the ubuntu distribution installer simply shuts down the WSL distribution after creating the user.

Do you see the same symptoms with let's say debian ?

@thielj
Copy link
Author

thielj commented May 25, 2024

@OneBlue Hi. Please note that the shutdown is initiated by the '10000ms watchdog' before user credentials are entered (21:46:49). It's not automatically restarting after this.

Only when you provide the default user name and hit enter the instance/distro is brought up again (21:47:00). Just wait long enough (e.g. 30 seconds) when prompted to enter the default username and you will see. The default user is created after this, and the instance/distro is then shutdown a second time before the default user is logged in.

I naturally tried Debian but it comes without cloud-init. I haven't investigated how to roll my own cloud-init enabled WSL distro but would think that with the observed '10000ms watchdog' this would be a waste of time. Even the standard 'apt upgrade' module wouldn't be safe to run.

@thielj
Copy link
Author

thielj commented May 25, 2024

@OneBlue Maybe this helps to determine if cloud-init is still running?

https://cloudinit.readthedocs.io/en/latest/howto/status.html

@OneBlue
Copy link
Collaborator

OneBlue commented May 28, 2024

@thielj: I wonder if what you're experiencing is the normal VM idle time out if there's no activity. Let's try something: Can you put:

[wsl2]
vmIdleTimeout=-1

in .wslconfig and see if that the solves the issue for you ?

@thielj
Copy link
Author

thielj commented May 28, 2024 via email

@OneBlue
Copy link
Collaborator

OneBlue commented May 28, 2024

Ok interesting. Could you capture logs with that .wslconfig value set ?

/logs

@thielj
Copy link
Author

thielj commented May 28, 2024

@OneBlue It was mentioned here: #11602 (comment)

Neither -1 or a much longer timeout made any difference at all. Neither did anything else that looked even remotely related in .wslconfig. It's also readily reproduceable on my laptop (with a 900000 timeout). I would be surprised to hear from anyone who can't reproduce this.

And it seems I just discovered a workaround: If I log into the distro from another terminal window, during the installation, as soon as this is possible, WaitForBootProcess still chimes in, the typical root login happens -- but the shutdown does not and cloud-init completes as expected. Just having an interactive shell open seems to prevent it.

Maybe the following is related. The actual shutdown in my case also happens almost exactly 15s after WaitForBootProcess comes in. #8659

The first timer is how long a WSL instance is idle before it is terminated. Think of this as a wsl --terminate that occurs when there are no more running processes in it other than init. This is non-configurable, but #6782 is a feature request for the ability to configure it. As far as I can tell, this is hard-coded at about 15 seconds.

Have you looked what is triggering your WaitForBootProcess code? It is clearly setup with a fixed 10000ms timeout and then logs in as root. I don't see this error in the logs after the first/--install start.

If it's supposed to detect a running cloud-init: maybe it would be worth looking into that bit? And if not, my guess is that a service like cloud-init doesn't count as 'running process other than init'. A simple command -v cloud-init && cloud-init status --wait before initiating the shutdown would allow cloud-init to complete and solve the issue.

@thielj
Copy link
Author

thielj commented May 30, 2024

@OneBlue And another update regarding that workaround: Looking at the debug console for cloud-init output or counting to 15 and then starting wsl ~ cloud-init status --wait in another terminal does indeed work. Once you try to do something more than just sleeping, there's another problem though: /etc/resolv.conf disappears, roughly around the time cloud-init is running apt upgrade. It's probably triggered by your 10000ms watchdog. So sometimes the upgrade runs through, sometimes it fails. Additional packages installed after that always fail (unable to resolve archive.ubuntu.com). It's definitely no problem with my network and is also independent of networkingMode or dnsTunneling.

I can workaround this by replacing the symlink with an immutable copy of /mnt/wsl/resolv.conf very early in the boot process, like so:

#cloud-config

bootcmd: 
  - chattr -i /etc/resolv.conf
  - rm -f /etc/resolv.conf
  - cp /mnt/wsl/resolv.conf /etc/resolv.conf
  - chattr +i /etc/resolv.conf

package_update: true
package_upgrade: true
packages: [ btop ]

Your ConfigCreateResolvConfSymlink isn't happy about that, which makes me think it might also be involved with its disappearance:

[   25.822086] WSL (2) ERROR: WaitForBootProcess:3342: /sbin/init failed to start within 10000
[   25.823081] WSL (2) ERROR: ConfigCreateResolvConfSymlink:1741: symlink(/mnt/wsl/resolv.conf
[   25.833860] WSL (2) ERROR: ConfigCreateResolvConfSymlink:1741: symlink(/mnt/wsl/resolv.conf

I tried generateResolvConf=false to no avail. The symlink would still disappear. I suppose /etc/wsl.conf is read before the distro is launched?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants