Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmcompute issue with latest patch - encountered an error during hcs::System::Start: context deadline exceeded related to (Splunk Universal Forwarder) #426

Closed
brwilkinson opened this issue Sep 23, 2023 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@brwilkinson
Copy link

#------------------------------------------------------------------

older (unpatched) server - working

hostname

abc123

docker info

Server: Docker Engine - Community
Engine:
Version: 24.0.6

gi C:\windows\System32\vmcompute.exe | foreach Target
C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1906_none_9fdbb7c8de1cc2e4\vmcompute.exe

vmcompute.exe is 10.0.20348.1906

docker run jenkins/agent:jdk17-windowsservercore-ltsc2022 pwsh -c echo hello

hello

#------------------------------------------------------------------

new server (patched) - not working

hostname

abc1234new

docker info

Server: Docker Engine - Community
Engine:
Version: 24.0.6

gi C:\windows\System32\vmcompute.exe | foreach Target
C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1970_none_9fe3427ede15da7f\vmcompute.exe

vmcomute.exe is 10.0.20348.1906

docker run jenkins/agent:jdk17-windowsservercore-ltsc2022 pwsh -c echo hello

encountered an error during hcs::System::Start: context deadline exceeded. 🟧

The only difference i can see is the build version of the vmcompute.exe ?

Which was updated as part of this:

https://support.microsoft.com/en-us/topic/september-12-2023-kb5030216-os-build-20348-1970-34d4aff3-fd05-4270-b288-4ab6379c7f81

@brwilkinson brwilkinson added the bug Something isn't working label Sep 23, 2023
@brwilkinson
Copy link
Author

@ntrappe-msft ntrappe-msft added the triage New and needs attention label Sep 25, 2023
@ntrappe-msft ntrappe-msft removed the triage New and needs attention label Sep 27, 2023
@Howard-Haiyang-Hao
Copy link
Contributor

@brwilkinson

I am trying to repro the issue:

PS C:\Users\Administrator> docker run jenkins/agent:jdk17-windowsservercore-ltsc2022 powershell.exe -c echo hello
hello
PS C:\Users\Administrator> get-hotfix

Source Description HotFixID InstalledBy InstalledOn


WIN-35ILNJ... Update KB5022507 5/5/2023 12:00:00 AM
WIN-35ILNJ... Security Update KB5026370 5/5/2023 12:00:00 AM
WIN-35ILNJ... Update KB5026493 5/5/2023 12:00:00 AM

Can you call get-hotfix to see the packages that installed on your system?

Thanks,
Howard.

@brwilkinson
Copy link
Author

brwilkinson commented Sep 28, 2023

when the server gets deployed (without updates) it works and has the following.

get-hotfix

Source        Description      HotFixID      InstalledBy          InstalledOn
------        -----------      --------      -----------          -----------
server001  Update           KB5028956     NT AUTHORITY\SYSTEM  9/22/2023 12:00:00 AM
server001  Security Update  KB5012170     NT AUTHORITY\SYSTEM  9/21/2023 12:00:00 AM
server001  Security Update  KB5029250     NT AUTHORITY\SYSTEM  9/22/2023 12:00:00 AM
server001  Update           KB5029395     NT AUTHORITY\SYSTEM  9/21/2023 12:00:00 AM

#-----------------------------------------
Package Identity : Package_for_DotNetRollup31bf3856ad364e35amd64~~10.0.4614.6
State : Installed
Release Type : Update
Install Time : 4/5/2023 12:21 AM

Package Identity : Package_for_RollupFix31bf3856ad364e35amd64~~20348.1668.1.8
State : Installed
Release Type : Security Update
Install Time : 4/5/2023 12:33 AM 

Package Identity : Package_for_ServicingStack_166331bf3856ad364e35amd64~~20348.1663.1.1
State : Installed
Release Type : Security Update
Install Time : 4/5/2023 12:20 AM

#-----------------------------------------
Which is from a server 2022 base image: Win_Server_2022_2108.20
#-----------------------------------------

Then as part of the deployment process when we install the latest patch it breaks.

Get-HotFix

Source        Description      HotFixID      InstalledBy          InstalledOn
------        -----------      --------      -----------          -----------
server002  Update           KB5029928     NT AUTHORITY\SYSTEM  9/28/2023 12:00:00 AM
server002  Security Update  KB5012170     NT AUTHORITY\SYSTEM  9/28/2023 12:00:00 AM
server002  Security Update  KB5030216     NT AUTHORITY\SYSTEM  9/28/2023 12:00:00 AM
server002  Security Update  KB5025314                          4/5/2023 12:00:00 AM
server002  Update           KB5030369     NT AUTHORITY\SYSTEM  9/28/2023 12:00:00 AM

encountered an error during hcs::System::Start:context deadline exceeded.

I have tested on Server Core and with GUI and same result.

Based on the other thread it likely wasn't this latest roll up or patch that had the breaking change... however I haven't been able to test on different patch versions. Also when I uninstalled this update the server was still broken, I noticed vmcompute.exe was an even earlier version in this case (after uninstall).

I am using the exact same process to deploy all of the servers so they are identical other than the working servers did not have the updates applied at build time.

Then I noticed the vmcomute.exe did have some changes between the versions.

we should test with this since it's a smaller image and straight from mcr... not that it makes that much difference.

docker run mcr.microsoft.com/windows/nanoserver:ltsc2022 ping localhost -r 5

also docker pull works fine, it's just run that times out.

This is the patch to install top test a repro...
Security patch
September 12, 2023—KB5030216 (OS Build 20348.1970) - Microsoft Support
https://support.microsoft.com/en-us/topic/september-12-2023-kb5030216-os-build-20348-1970-34d4aff3-fd05-4270-b288-4ab6379c7f81

@brwilkinson
Copy link
Author

@Howard-Haiyang-Hao also thank you for the follow up.

If there are some lower level tests or logs that we can look into, please let us know.

@brwilkinson
Copy link
Author

Adding below log from the Application logs:

6:25:02PM - Warning
Syscall did not complete within operation timeout. This may indicate a platform issue. If it appears to be making no forward progress, obtain the stacks and see if there is a syscall stuck in the platform API for a significant length of time. [traceID=bb85376784de4797faa7730c7d12ce40 spanID=323e301be76b14aa timeout=4m0s]

6:25:02PM - Error
failed to start container [error=container 6ac12218a7dfcf3e6507274fc678b46fd8f30f2224dcbc0bb27c05c58b543641 encountered an error during hcs::System::Start: context deadline

@brwilkinson
Copy link
Author

Hi @Howard-Haiyang-Hao were you able to get the update installed? Any luck with the repro?

@brwilkinson
Copy link
Author

Adding some further information.

  1. Expanded testing from VMware host machines to Hyper-V host machines. No change in experience.

docker run --isolation=process mcr.microsoft.com/windows/nanoserver:ltsc2022 ping localhost -r 5

ltsc2022: Pulling from windows/nanoserver
452f3ad1d32c: Pulling fs layer
452f3ad1d32c: Verifying Checksum
452f3ad1d32c: Download complete
452f3ad1d32c: Pull complete
Digest: sha256:6562c9a2580260e4f2e3a081cc2cf1d960899cbce7c4568fb851e4848ca50e07
Status: Downloaded newer image for mcr.microsoft.com/windows/nanoserver:ltsc2022
docker: Error response from daemon: container 62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 encountered
an error during hcs::System::Start: context deadline exceeded.

Event logs

warning

Syscall did not complete within operation timeout. This may indicate a platform issue. If it appears to be making no forward progress, obtain the stacks and see if there is a syscall stuck in the platform API for a significant length of time. [spanID=be1a8d4cdaa4cc79 traceID=3c3d99e94347624bac4b3aa9b63f8d8b timeout=4m0s]

errors

failed to start container [module=libcontainerd namespace=moby container=62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 error=container 62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 encountered an error during hcs::System::Start: context deadline exceeded]

failed to cleanup after a failed Start [module=libcontainerd namespace=moby container=62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 error=container 62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 encountered an error during hcs::System::Terminate: context deadline exceeded]

Handler for POST /v1.43/containers/62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8/start returned error: container 62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 encountered an error during hcs::System::Start: context deadline exceeded

OSinfo

WindowsBuildLabEx                                       : 20348.1.amd64fre.fe_release.210507-1500
WindowsCurrentVersion                                   : 6.3
WindowsEditionId                                        : ServerStandard
WindowsInstallationType                                 : Server
WindowsInstallDateFromRegistry                          : 10/2/2023 3:38:09 PM
WindowsProductId                                        : 00454-10000-00001-AA777
WindowsProductName                                      : Windows Server 2022 Standard
OsName                                                  : Microsoft Windows Server 2022 Standard
OsType                                                  : WINNT
OsOperatingSystemSKU                                    : StandardServerEdition
OsVersion                                               : 10.0.20348


get-hotfix | ft -AutoSize

Source         Description     HotFixID  InstalledBy         InstalledOn
------         -----------     --------  -----------         -----------
BWILKINSON-VM1 Update          KB5029928 NT AUTHORITY\SYSTEM 10/2/2023 12:00:00 AM
BWILKINSON-VM1 Security Update KB5012170 NT AUTHORITY\SYSTEM 10/2/2023 12:00:00 AM
BWILKINSON-VM1 Security Update KB5030216 NT AUTHORITY\SYSTEM 10/2/2023 12:00:00 AM
BWILKINSON-VM1 Security Update KB5025314                     4/5/2023 12:00:00 AM
BWILKINSON-VM1 Update          KB5030369 NT AUTHORITY\SYSTEM 10/2/2023 12:00:00 AM

@brwilkinson
Copy link
Author

brwilkinson commented Oct 4, 2023

Today I installed the base image that had been tested as working. A fresh image, not the "corporate" image, via a standard deployment.

Then installed each monthly patch, and tested each iteration of the updates.

SW_DVD9_Win_Server_STD_CORE_2022_2108.20_64Bit_English_DC_STD_MLF_X23-42932

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1668_none_a001fbb8ddff3a5d\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1726_none_9ff35834de0abdb0\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1787_none_9ff94328de056f5b\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1850_none_9fec11d6de0f8be0\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1906_none_9fdbb7c8de1cc2e4\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1970_none_9fe3427ede15da7f\vmcompute.exe

I was not able to reproduce the error.

I am continuing to investigate.

I am leaning towards some "scanning" and "security" software as a possible cause, however will update once I have completed more testing.

@brwilkinson
Copy link
Author

I am out of cycle for deploying more nodes right now, however I expect to continue doing so in the next few days.

Once I have more data, I will provide a further update.

@brwilkinson
Copy link
Author

brwilkinson commented Oct 27, 2023

I will close this for now.

At this time, I don't have a root cause... however I will come back later and add notes in case it's useful for others.

I do have a way to make the vmcompute start working again... however I am just not definitive on exactly why or which single things breaks the service.

1 time restarting vmcompute (and then rebooting) brought the containers online
1 time uninstalling 2 enterprise monitoring apps brought the containers online
1 time uninstalling 1 enterprise monitoring app brought the containers online (also I have reproduced this several times).

I can only assume that the vmcompute is sensitive to something, however I am unsure what exactly OR why uninstalling the apps allowed the containers to start as normal. Also reinstalling the same combination of apps did not break the container services.

So I have a workaround, however not any sure solution on what exactly the issue is or was.

I will provide an update in the future.

@brwilkinson brwilkinson closed this as not planned Won't fix, can't repro, duplicate, stale Oct 27, 2023
@brwilkinson
Copy link
Author

brwilkinson commented Nov 3, 2023

In case it's helpful for others...

In our environment I have been able to reproduce the issue many times.

  1. The container timeout error occurs.

    • Syscall did not complete within operation timeout. This may indicate a platform issue. If it appears to be making no forward progress, obtain the stacks and see if there is a syscall stuck in the platform API for a significant length of time. [timeout=4m0s traceID=6b59dd25c46e317d63303f745be2ed33 spanID=540b21032cc68875]
    • failed to cleanup after a failed Start [namespace=moby container=1b4b560791f9c2fcf36aeafb6251c6bc883b4b1a788f9178fba4d87e5b186324 error=container 1b4b560791f9c2fcf36aeafb6251c6bc883b4b1a788f9178fba4d87e5b186324 encountered an error during hcs::System::Terminate: context deadline exceeded module=libcontainerd]
    • Handler for POST /v1.43/containers/1b4b560791f9c2fcf36aeafb6251c6bc883b4b1a788f9178fba4d87e5b186324/start returned error: container 1b4b560791f9c2fcf36aeafb6251c6bc883b4b1a788f9178fba4d87e5b186324 encountered an error during hcs::System::Start: context deadline exceeded
  2. Stop the "SplunkForwarder" Service.

  3. The container starts successfully.

  4. Start the "SplunkForwarder" the containers continue to work.

I have not seen any regression on this afterwards.

I have a Guess that it maybe related to the network setup for the "Microsoft-Windows-Hyper-V-VmSwitch" etc.

  • However this is only based on reviewing logs in the Application and System Events between stopping/starting the Splunk service as well as running docker.

If I have more time I will continue to capture these states as I deploy more new resources in the future.

@brwilkinson
Copy link
Author

@Howard-Haiyang-Hao @akarshm just a ping for this issue, which I closed, however is worth follow up in:

@brwilkinson brwilkinson changed the title vmcompute issue with latest patch - encountered an error during hcs::System::Start: context deadline exceeded. vmcompute issue with latest patch - encountered an error during hcs::System::Start: context deadline exceeded related to (Splunk Universal Forwarder) Nov 30, 2023
@hopenbr
Copy link

hopenbr commented Jan 4, 2024

@brwilkinson ,
stopping the splunk forwarding svc, restarting docker service, and starting the splunk fwd service seems to have work for me. Thanks a million for that update.

Any further issues? Reoccurrences?

@brwilkinson
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants