-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--isolation=process
fails sporadically on machines using SplunkForwarder
#374
Comments
Hi! Thanks for posting your issue. When I usually see errors with However, it's interesting that you mentioned everything worked fine before the May 2023 (KB5026370) update. I'm going to create an internal ticket (#44926683) to see whether the update is causing any issues or if we can suggest any mitigation strategies. We'll keep you posted when we have an update. |
Resources on this system should not be an issue, it as 12 GB of RAM and 6 vCPU cores. Space on the disk should not be an issue either, it is ~100GB. We have still not had a chance to reproduce the issue on bare metal, i.e. without ESXi. |
I agree that it doesn't seem like a resource issue. I used fewer cores and less memory and it started up for me. If you feel up to the challenge, there are 2 things you could try doing. I'll keep trying to repro your error in the meantime. Task 1: Get a fresh Windows Server Core 2022 image.
Task 2: See what's timing out and blocking the container start up process.
Sidenote. I know that the second one can be confusing. We don't expect our customers to troubleshoot their own problems, but, if you CAN find the processes related to your container session and share them, we may get some insights as to what's going on. |
Task 1 we have attempted (including spinning up a brand new, fresh VM and installing docker from scratch). As for task 2, I am more than willing to troubleshoot, my issue is that I don’t know where to start. This is good information that I did not know and I will attempt it this week. We are also working in parallel to stand up our WS2022 image on some loaner hardware to see if it’s something about the VM that is causing this. Unfortunately, due to internal policy requirements, this will take some time. One question I did have: |
@ntrappe-msft We were able to test this on some loaner hardware, just a laptop. It was not connected to our domain so some security software (in monitor-only mode) was not running, but it should be a 1:1 test otherwise. The test of With the release of KB5027225 and KB5027544, our test VM was updated to include these. Furthermore, a new Windows Server Core ltsc2022 image was released (a0cb8066ecaf ). We removed all docker images on the VM and re-downloaded them to match what was on the laptop. Still, process isolation did not work. The docker version was still 24.0.2 on both. Presently, we have the following situation (assuming ltsc2022@a0cb8066ecaf on both):
So, we currently believe this to be some bug with WS2022 guests on ESXi. While we do have some evidence that this was working before KB5026370, we have not been able to reproduce a working setup by just reverting the KB patch. Any advice for further debugging would be greatly appreciated, as I think we are stuck on our end. The timeout is still happening with Another thing to add: I am more than willing to do some deep, technical troubleshooting and provide logs or additional information relevant about our infrastructure. But as of right now, I don't really know what is useful or relevant. |
Hmm ok this is interesting. I'll do a bit more digging with the new information you've given. Thanks for being so open to help debug, I may take you up on that if I can't repro the error you're getting. |
We're currently waiting on a VM with the VMWare hypervisor to repro on our end. Thanks for your patience. |
This issue has been open for 30 days with no updates. |
Still working on this. I was OOF for 2 weeks so I apologize for the delay. |
I appreciate the update, we've also been working on our end to open up a ticket with VMWare, it has taken our virtualization team a lot of convincing and back-and-forth to convince them this is a bug on their end. |
Hi, I'm currently waiting on an Azure VM running VMware. Once I have that, we'll try to repro your bug! |
This issue has been open for 30 days with no updates. |
Hi @ntrappe-msft , Just wanted to follow up with what we've found on our end. We have reached out to VMWare, but have been told that we must get support via Docker, as VMWare doesn't support individual applications, even with their production support. They said Docker would have to troubleshoot with them directly, after we confirm with their support that this is a VMWare issue. In truth, while evidence points to it, we don't exactly know if VMWare is the real issue. We can't debug any further on that end. They also pointed us to this documentation from Docker claiming that nested virtualization is "not supported". (I know that page is for Docker desktop, but IIRC it's still running the same container runtime underneath, perhaps with some proprietary changes?). The above is problematic for us for a couple reasons:
Given the above, we have concluded that Docker on Windows with nested virtualization is too difficult to support for our company at this time, and have instead procured physical hardware. |
I apologize for backtracking, but now that we have the physical hardware set up, it seems as if we were wrong. Process isolation still is not working. Since this has nothing to do with VMWare, we're going back to potentially investigation our security/compliance software. Unfortuantely there's not much I can share about our infrastructure without permission, but please see the below: System InfoDocker info
Servercore image
System specsProcessor information
Memory256 GB Storage2 TB HDD OS VersionMicrosoft Windows [Version 10.0.20348.1970] I would greatly appreciate some more information on HCS, specifically what syscalls or Windows features its calling in to that could be affected by:
|
Thank you for doing such a thorough investigation and attempting to resolve this on your end. Our intuition is that this probably isn't an issue with the VMware hypervisor but something else. I'll try to repro this using all the specs you provided and hopefully we can see your error. Otherwise, I'll work on getting you the information we're allowed to share on HCS. |
Curious update that may either a) make everything clearer on Microsoft's side, or b) make everything more confusing on Microsof'ts side. Hopefully the former! I have been able to reproduce getting Reproduction steps
I have successfully tested this 3 times in a row. I'm pretty baffled, but it does work. Note that process isolation still feels slow, it takes about 30 seconds to a minute to actually start the container. I am guessing this is security software related. EDIT: Correction, rebooting the machine between installing/removing Hyper-V is consistent. Installing/Removing in the same boot sometimes works. |
Sorry for the delay, I'm working on using a separate VMware hypervisor to try to repro this. I'll update the Issue next week with my results. |
Hi @ntrappe-msft, We have been doing some renewed debugging at our company looking in to exactly when/how docker is failing, and we've made some progress. To start: it's confirmed not a VMWare issue, or at least not only a VMWare issue. Second, while toggling Hyper-V on/off does seem to work, it's not as consistent as I first implied. In fact, what really seems to matter is when
Docker now works for the remainder of the boot, i.e. this process needs to be repeated once per boot. For now, we are switching debugging efforts to investigate what these services possibly have in common with one another and where they overlap. As far as we can tell, SplunkForwarder only needs a port or two open, and does not use any virtualization/containerization features. So this is perhaps a firewalling issue, but it is not obvious why this would cause docker to completely fail. Any insight on Microsoft's side on how this could happen is appreciated. If @brwilkinson has any insights on his end that would also be appreciated. Hopefully we can all share notes and figure this out. |
Hi, thank you for all the work you're doing debugging! I'm trying to repro the race condition now. Out of curiosity, can I ask what you're using the SplunkForwarder service for? |
Just accidentally spotted this, I can confirm having same issue now for years with our windows containers workloads, bottomed down to the splunkforwarder workaround long time ago. We're still using win server 1809 and ltsc1809 container images, got nothing from engaging splunk support and seeing this discussion - seems like doomed to keep using the stop-start workaround in the newer windows versions as well. |
Thanks for confirming the same problem and that there's a definite issue with the SplunkForwarder service and running containers. Did you happen to see the same behavior in that a container fails to start up if docker is run after the SplunkForwarder service starts up? |
We have significant evidence that Splunk prevents docker from starting containers. Here are some mitigation tips while we try to investigate what's going wrong: Tip 1:
Tip 2:
|
@ntrappe-msft yes - the containers always fail to start if splunk forwarder is running, however I am only speaking from win server 2019 experience. The reliable way to start any container in my case is to stop splunk forwarder, then start containers, then start splunk forwarder. What you're suggesting in "Tip 1" above certainly doesn't work. In Tip 2 do you actually mean windows event logs? (we do ingest all event logs too into splunk - there are cases where it's useful, like searching for the certain application crash error across the fleet of hosts). |
Can we please update the title of this issue to map to the current findings? |
--isolation=process
does not work on WS2022 despite equivalent kernel version--isolation=process
fails sporadically on machines using SplunkForwarder
@ArtyomBaranovskiy I'm sorry to hear that even Tip #1 doesn't work for you (for some people they've had success). And yes for #2 the Windows event logs. |
Thanks for everyone's patience. We need to understand more about how the SplunkForwarder could be interfering with Docker so we've reached out to our contacts at Splunk. I'll update this when I learn more. |
I didn't want to update this issue before actually testing it myself, but we've found another clue, but since @ntrappe-msft is reaching out to Splunk, then I think this is relevant to bring up. Splunk may be interfering with named pipes, see this page. Once we have the chance to, I am going to try to use a TCP socket to run docker containers on and see if that works without any workarounds. I encourage anyone who is able to try this to do it. Our main machine is under heavy use during the week but I was going to attempt it this weekend. |
Listening on a TCP socket did not work. I still got the "context exceeded deadline" error. So, Splunk is somehow interfering with docker at a lower level than the host socket. |
If this is an issue with the splunk forwarder, it would be more appropriate to raise this on the appropriate channels with them. |
@augusew1 Sorry this has been such a lengthy process. I just spoke to a Splunk engineer who said we have to follow the formal channels to get support. Would you mind raising a case with Splunk about the problem you're experiencing? You can me to the case so Splunk has a Microsoft contact if they need our support. |
I appreciate the update. We are going through formal channels with Splunk and we'll forward this information to them. |
@augusew1 I had a case open on cascading container failures and crashing containers that could never recover with Microsoft for 8 months with zero progress made. The only thing that work for me was Hyper-V isolation. We had servers with 128 cores and 2tb of ram that could only run 20 containers with a max of 3 SHIR jobs per instance under process isolation it was a huge waist of machines. But with Hyper-V isolation the issues magically disappears. Azure/Azure-Data-Factory-Integration-Runtime-in-Windows-Container#7 |
@nickcva I’m sorry to hear that you got no support. I don't think that repo is very active. Could you raise an Issue on this GitHub repo with information about your host, repro steps, behavior, etc.? I’ll make sure we give it some attention. |
We got an update from Splunk support, they have investigated this issue before and concluded that it must be a Docker issue, not a Splunk one. Specifically that "the conflict causing Docker to become unstable does not stem from a defect or bug within [Splunk]". Therefore they advised us to reach out to Docker support about this. Our company does not have a contract with Docker as far as I know, so that might be a dead end for us if I can't convince someone to get support. I looked more into Docker and did discover that they have logging plugin for Splunk. However we are not using this, as seen by
Still, I wonder if the presence of Splunk along with the logging plugin could possibly be causing this? |
This issue has been open for 30 days with no updates. |
Hi, thank you again for being so persistent and trying to tackle this Issue. I also could not make forward progress with Splunk but I'm so glad you at least found that Docker may be the main culprit. I'm going to ping some of the people I know at Docker to see if they're aware of this Issue and if there's a mitigation strategy they'd recommend. |
@augusew1 Alrighty! Some very nice people at Docker are taking a look at this. |
Hi @augusew1, I've tried to reproduce this issue on a Windows 11 Pro 23H2 (build 22631.3296), with SplunkForwarder started before dockerd.exe and no specific config. Unforunately, I can't reproduce. I think your Splunk Forwarder config might be at play here. First, what version of Splunk Forwarder are you running? Could you tell us what did you set in You could try to disable all your inputs and then re-enable them one by one until you find the culprit. The error message you reported initially indicates a timeout during a call to vmcompute.dll. It seems this call returns a response asynchronously but the error message gives no clue about whether the timeout happen during the request or while waiting for a response. I'll follow-up with @ntrappe-msft and see if we can find some debugging command that could help pinpoint what's deadlocked within vmcompute.dll. |
Sorry @akerouanton, maybe I'm missing smth, but the original issue is about windows server running windows containers in process isolation. Afaik, Docker on windows 11 only supports hyper-v isolation - so not surprised you cannot reproduce the issue. |
@ArtyomBaranovskiy The last section of the 1st link you posted shows that containerized processes running with process isolation appear on the host, unlike when run with Hyper-V isolation. If I launch PS C:\Users\akerouanton> Get-Process -Name ping | Format-Table Id, ProcessName, SessionID
Id ProcessName SessionId
-- ----------- ---------
6972 PING 2 When ping run in a container started without PS C:\Users\akerouanton> Get-Process -Name ping | Format-Table Id, ProcessName, SessionID
Get-Process : Cannot find a process with the name "ping". Verify the process name and call the cmdlet again.
At line:1 char:1
+ Get-Process -Name ping | Format-Table Id, ProcessName, SessionID
+ ~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : ObjectNotFound: (ping:String) [Get-Process], ProcessCommandException
+ FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand And when ping run on the host: PS C:\Users\akerouanton> Get-Process -Name ping | Format-Table Id, ProcessName, SessionID
Id ProcessName SessionId
-- ----------- ---------
10224 PING 0
Notice that the Session ID isn't the same between the first and the last example. This confirms I'm running with process isolation 😉 Unfortunately I don't have access to Windows Server right now but I'm not entirely sure this really matters. |
Docker is not configured with anything extra, the full contents of the daemon's config is:
We do not do any custom logging of individual containers, they are run as-is:
In particular, we don't have any long running containers running any services, they are used for Gitlab's CI/CD. But to be clear, this issue is reproducible outside of gitlab just running a shell. Splunk is run as a system service, run as
However, in taking with our IT department I was given the following advice:
The output of that command is exceptionally large, and I can't tell what's a default config and what's a modified value, so I'm not sure how useful it is. The next thing I will try is working with our IT department to disable configs in Splunk to see if one particular configuration is causing issues. |
This issue has been open for 30 days with no updates. |
I am seeing this issue as well-We disabled all of the apps being used by the Forwarder and still the issue remained. This was seen where the process was set to isolation for the Windows container being run. We are using Docker as well |
We are encountering this issue in our environment. Did this thread die as there was a solution or work around found or is this still being looked at? Thanks! |
This issue has been open for 30 days with no updates. |
2 similar comments
This issue has been open for 30 days with no updates. |
This issue has been open for 30 days with no updates. |
I am still seeing instances of this occurring. I will try to open a support case with Splunk. |
This issue has been open for 30 days with no updates. |
We are seeing a similar issue to the one described above on a Windows Server 2022 docker host running docker community edition (as opposed to docker desktop). I can't start a container when the SplunkForwarder service is running. |
UPDATE
Please see this comment for the latest on this bug
Original Report:
Describe the bug
--isolation=process
does not work on Windows Server 2022 with an equivalent image of windows-servercoreTo Reproduce
docker pull mcr.microsoft.com/windows/servercore:ltsc2022
docker run -it mcr.microsoft.com/windows/servercore:ltsc2022
The container will fail with the following error:
Expected behavior
The container should run
Configuration:
Output of
docker info
:Additional context
--isolation=hyperv
does workCurrently, our best guess is some combination of the following is creating an issue:
--isolation=process
We are working on testing this on a bare metal machine to see if ESXi issues can be ruled out.
The text was updated successfully, but these errors were encountered: