-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wepoll loses events #35
Comments
Your use case as you describe it should work, and I can't think of any potential gotchas in wepoll itself that could cause the behavior you're describing. The question is whether this is a bug in wepoll, in your code or in Windows. Here's what I would recommend:
|
After a bit of trial and error, managed to create a fairly small reproducer, posted here: https://gist.github.com/djelinski/d4e8456c197576c355100b25266b9cdd
On failing systems (Windows 2016 / 2019), the output is something like:
Removing and re-adding interest set enables the program to make progress.
|
I created another reproducer that re-adds a socket to the interest set if we detect that no events were reported for over 5 seconds. The event is immediately reported, but another event on the same socket is lost soon after that. The events are only lost on sockets that are polled for read and write at the same time; sockets that only send or only receive do not lose events. Adding a global critical section instead of per-port critical section in wepoll does not change the outcome, events are still lost. On the other hand, if we use a socket from the same thread at all times, the problem does not reproduce. The problem is also reproducible on Windows 11 with TCP sockets; it does not reproduce with UDP sockets because writes to UDP sockets never block on Windows 11. |
I modified the reproducer to use EPOLL_CTL_MOD instead of EPOLL_CTL_DEL + EPOLL_CTL_ADD, and the issue no longer reproduces. The updated code can be found here: @piscisaureus let me know if this is enough information to address this issue |
Created a TCP-based reproducer. Unlike the UDP-based one, the TCP one also fails on more modern Windows versions like Windows 11 or Windows 2022. The reproducer can be found here: We are using EPOLL_CTL_DEL + ADD because of #34. We will probably invent a different workaround for #34, but we would like to understand what is going wrong here. |
I suspect the root cause is something along these lines:
It might be possible to solve the problem by keeping deleted sockets in the socket tree (rather than in a specialized deleted-socket queue on the side) until its AFD_POLL operation completes, so that if it is deleted and re-added the poll group association can be maintained. However one potential problem I see with this is that if the user actually closes the socket after deleting it from the epoll set, and creates another socket, and windows reuses the same socket handle, wepoll might not adequately resubmit the poll operation when it should. |
@piscisaureus thanks for your reply. Will EPOLL_CTL_ADD create a new group if the original is not full? I thought it only creates a new group if all existing ones monitor 32 sockets already; here we only monitor 2 sockets at a time. (also, we have exactly one thread calling epoll_wait per epoll handle, but that probably doesn't matter here) |
Also, when I comment out this line, the bug no longer reproduces. This code path is only hit when multiple threads are polling the same epoll handle at the same time. |
right. That line is also called when the epoll handle is updated while another thread is waiting on it, which is exactly what happens here. We can't really comment out this line, as then the updates would only be applied on the next epoll_wait call, which can happen much later (or never). One other interesting thing I noticed is that sometimes we get uninteresting events from afd_poll. Specifically, if I add the following lines:
before these, I usually (but not always) get logs like the following:
This seems to suggest that AFD is somehow mixing up the interest sets. If we never poll the same socket handle from 2 distinct epoll handles at the same time, the problem doesn't reproduce. |
It looks like AFD sometimes mixes up different IO status blocks related to the same socket handle; I added more logging, and found that sometimes when a socket gets stuck, GetCompletionStatusEx returns an event that was intended for another poll handle. In other words, if we have 2 outstanding requests for a socket, one for EPOLLIN, the other one for EPOLLOUT, sometimes EPOLLOUT event is delivered to the request waiting for EPOLLIN, or vice versa. When that happens, the other request never completes. The requests necessarily use different completion ports and different AFD device handles. As an experiment, I changed afd_create_device_handle to use unique names for each AFD device ( In another experiment I changed the per-port critical sections to a global critical section to make sure that all epoll_ctl_add operations are globally serialized. That didn't fix the problem either, which suggests that the problem might be a race between epoll_ctl_add and GetQueuedCompletionStatusEx, which is run outside of the critical section. As mentioned earlier, using EPOLL_CTL_MOD seems to fix this issue. With EPOLL_CTL_MOD, requests are only added to the poll set by the thread that calls GetQueuedCompletionStatusEx, which further confirms the hypothesis. As suggested above, commenting out With the above in mind, I think the remaining options worth exploring are:
did I miss anything? |
Background
We use two epoll handles; One is used only for EPOLLIN events, the other is used for EPOLLOUT. Both are always used with EPOLLONESHOT.
Each epoll handle has a dedicated thread that runs epoll_wait in a loop, and schedules tasks to run in other threads in response to polled events.
Other threads register interest ops with EPOLL_CTL_ADD, and when the event is polled, deregister interest with EPOLL_CTL_DEL. Any given socket handle can be registered at most once with each of the above epoll handles, once for reading, once for writing.
The problem
Occasionally one of the events doesn't fire, even if the wait conditions are satisfied. When that happens, the socket handle is still registered in port_state->sock_tree in state SOCK__POLL_PENDING.
This happens frequently on Windows 2016, less frequently on Windows 2019 and Windows 10, we haven't observed it yet with Windows 11 or 2022. The problem is reproducible with a piece of code that uses multiple UDP sockets talking over the loopback interface, and on Windows 2016 it usually reproduces within the first few minutes of run time. We haven't been able to produce a minimal reproducer yet.
I'm not sure if it's related, but there's some cross-talk when polling the same socket with 2 epoll handles; I added some extra logging, and found that we frequently receive AFD_POLL_SEND event on the EPOLLIN handle. This happens on all systems, including Windows 11.
How can I troubleshoot it further?
The text was updated successfully, but these errors were encountered: