-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak on new_server_connection
#499
Comments
I was more or less with you until the If you look at Can't you use a hammer ? Cut all references of the TCP stack as close to the root as possible, and see if the leak goes away, maybe disconnect the whole TCP stack ? We're just trying to testify that we're not chasing some red herring. The second idea is to count the number of waiters in Lwt, if your assumption is correct, and Or also, just forcebly cancel the |
From what I can tell, this thread appears for all new server connections (via new_server_connection). So this thread is started every time I get a new connection. Again, but I haven't looked into it, but it seems to me that this thread is only cancelled by Lines 455 to 461 in 4bae549
EDIT: about Line 166 in 4bae549
I did not count. In our case, only
I don't really understand what you are asking. I only notified the leak when I apply my stress test. If we introspect the beginning of our stress test, it seems that everything works fine but it is only at a certain load level that the leaks appear. Again, I have checked that in the application area:
Bob is not doing anything other than handling incoming connections.
As I said, it is difficult for me to go any further and this outcome is mainly a notification of a behaviour that I can reproduce. Unfortunately, I don't really have time to look further 😕 . |
👍 It's nice that it's written here, hopefully someone picks up :) |
Hi,
I currently try to stress-test an unikernel which uses the
mirage-tcpip
stack and see if we don't have a memory leak. I audited my application to check if new connections are properly closed - and by this way, check that if we have a memory leak, it's probably due to the underlying layer, in my case,mirage-tcpip
.The stress test is about a simple relay (compiled for Solo5/hvt and available here: bob-hvt). You can check the source code here: https://github.com/dinosaure/bob. The situation is: I try to launch 512 clients and check if they can find their peers from a shared password. On the other side, I monitor the relay.
The script (
bash
+parallel
) to launch the stress is here:It requires
bob.com
as the client, a randomfile.txt
and the relay deployed somewhere (by default, we use one deployed atosau.re
, my server). If you launch the stress-test, you can see that the relay lets clients to solve the initial handshake and allocate a secure room to let associated peers to share something (in my example, a simple file).If I launch the relay with 40M, you can launch only one time such script. The next time will raise an
Out_of_memory
exception on the relay side. I tried to inrospect this memory leak.bob
is quite simple (no TLS, just a bunch of packets between peers). I usedmirage-memprof
to introspect the memory used by my relay. You must change theconfig.ml
and theunikernel.ml
such as:Then, just before to launch the unikernel with
solo5-hvt
, you can retrieve the trace via:netcat -l -p 5678 > bob.trace
. Afterthat, you can use the file withmemtrace-viewer
(on 4.13.1). Frommemtrace-viewer
, I tried to find what is allocated and still alive just before theOut_of_memory
. The nice things is: if you launchstress.sh
one time, waiting few seconds, and relaunch it, you can see a "plateau" where almost everything should be a garbage. The second launch ofstress.sh
will call the GC but it seems clear for me that the GC did not delete everything.One this picture, we can see the GC in action, few objects allocated before are GCed:
If we introspect more, we can see that some allocations are kept. Let's take an area where we have such allocation:
If we see what is allocated, the
memtrace-viewer
mentionsnew_server_connection
. Moreover, it seems that a "thread" is kept infinitely, specially thesend_empty_ack
"thread":mirage-tcpip/src/tcp/flow.ml
Lines 145 to 166 in 4bae549
As you can see, this thread is an infinite loop. This is where the control flow becomes complex. It seems that such thread can be cancelled if an other thread raises an exception:
mirage-tcpip/src/tcp/flow.ml
Lines 384 to 394 in 4bae549
But it's clear that in my situation, the cancellation never happens. It's really hard to follow what is going on because it's a mix between the lwt control flow (bind & async), exceptions and cancellation... I probably missed something about the cancellation of the
send_empty_ack
thread but the reality (memtrace-viewer
) shows me that such thread is never cancelled. Due to the fact that this thread require thepcb
value, it kepts multiples values and, specially the initialCstruct.t
given byNetif
. So, of course, if such problem appears several times, we just accumulateCstruct.t
and a fiber under the hood.On the other side (the application side), I added few logs which permit to "track" connections. And if you launch the
stress.sh
the first time and introspect logs, you can see that we have 512 new connections, andFlow.close
is called 512 times. That mostly means that, on the application level, we should not have a leak about our usage of the TCP/IP stack (however, if you lookmemtrace-viewer
, you can see few references to whatBob
does, and it's normal, we want to keep a global state of the relay).Finally, I don't expect a fix because I know that this project became more and more complex - and it's difficult to really understand what is going on. But if anyone has time to look at this leak in more detail, it would be very nice.
The text was updated successfully, but these errors were encountered: