-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak when using OpenSSL and threads #1942
Conversation
This is present in the 0.11.0 release, I guess? Is it critical to fix? Do we have a CI job that tests this configuration? |
For what it's worth, I'm not able to reproduce the leak myself—@ashman-p could you let me know what Linux environment you're running so I can test various configs out in Docker images? |
@SWilson4 , Thanks for checking on this issue. liboqs Build cmake -DBUILD_SHARED_LIBS=ON -DOQS_USE_OPENSSL=ON -DCMAKE_BUILD_TYPE=Debug -GNinja .. |
@SWilson4, the problem occurs when use of OQS_USE_SHA3_OPENSSL=ON and threading is active. The only significance of the use of threads is that the problem went away when threading was disabled. |
OK, I've managed to reproduce the leak, but only when building against OpenSSL >= 3.3.2. In particular, the leak does not occur when building against OpenSSL 3.3.1 with the same configuration. It seems to me that this might actually be an OpenSSL bug rather than a liboqs bug, especially as the fix proposed here should (?) be unnecessary: per the OpenSSL docs:
Looping in @baentsch @romen @beldmit @levitte: any knowledge of a regression going from 3.3.1 to 3.3.2? I will try to isolate the exact commit which introduces the leak. |
At first glance nothing suspicious |
It seems that this leak was introduced in openssl/openssl@83efcfd. When building against the parent of that commit (openssl/openssl@a13df68) the leak does not occur. |
^ fyi @beldmit |
@nhorman could you please take a look? |
FWIW, i don't see anything expressly wrong with the changes, its fine to call OPENSSL_init_crypto in your application rather than having the library do it as needed, but I'm having a hard time understanding:
I'll try reproduce here, but since you already have it set up, can you re-run valgrind with --leak-check=full to show the stack trace of where the leaked memory is being allocated? |
Sure, here's the valgrind output:
OpenSSL is built with cd ~/openssl
git checkout 83efcfdfa1
./config --debug --prefix=`pwd`/../.localopenssl_83efcfdfa1_debug && make -j 4 && make install_sw install_ssldirs liboqs is built with cd ~/liboqs
git checkout 7f4c89b2
mkdir build && cd build
cmake -GNinja -DOPENSSL_ROOT_DIR=../.localopenssl_83efcfdfa1_debug -DOQS_MINIMAL_BUILD="SIG_ml_dsa_65" -DOQS_USE_SHA3_OPENSSL=ON -DCMAKE_BUILD_TYPE=Debug ..
ninja The triggering command is valgrind --leak-check=full ~/liboqs/build/tests/test_sig ML-DSA-65 |
Thank you. Thats interesting, I just tried to recreate with the head of the master branch, and didn't encounter the leak. Given that you reported the issue was introduced with commit 83efcfd (which is a backport of the original commit to the 3.3 branch, that makes me think we fixed something in master that we didn't backport. Let me see if I can find it |
Weird—I'm also getting the leak on the latest master (openssl/openssl@c262cc0). |
hmm, what version of valgrind are you using? I'm on valgrind-3.23.0 |
also, what does your openssl.conf look like? Its possible I'm not loading any modules, and as such not triggering the allocation thats leaking |
In fact, I'm sure its my config, I just ran it through gdb and never hit a breakpoint in ossl_rcu_read_lock |
There we go, if I load the oqs provider (duh, should have thought of that), I see the leak. Now to figure out why |
I'm on valgrind-3.22.0, and I haven't made any manual changes to openssl.cnf (including to load any provider). I'm running everything in a Docker container on x86_64 / Linux; I'll attach a Dockerfile to duplicate my setup in case it's helpful to you. |
FROM openquantumsafe/ci-ubuntu-latest:latest
WORKDIR /root
RUN git clone --depth=1 --branch=0.11.0 https://github.com/open-quantum-safe/liboqs.git liboqs
RUN git clone --branch=master https://github.com/openssl/openssl.git openssl
WORKDIR /root/openssl
RUN ./config --debug --prefix=/root/.localopenssl && make -j && make install_sw install_ssldirs
WORKDIR /root/liboqs
RUN mkdir build
WORKDIR ./build
RUN cmake -GNinja -DOPENSSL_ROOT_DIR=../.localopenssl -DOQS_MINIMAL_BUILD="SIG_ml_dsa_65" -DOQS_USE_SHA3_OPENSSL=ON -DCMAKE_BUILD_TYPE=Debug ..
RUN ninja |
Ok, think I found the problem. openssl/openssl@83efcfd added a use of ossl_init_thread_start to the code base, for which there is only a handful of users. ossl_init_thread_start registers a handler function to clean up thread local data, not when the thread exits explicitly, but rather when the library context against which the thread allocates resources is deleted. I'm not 100% sure why we do this, but its how it works. Normally its not a problem. Library contexts get deleted in OSSL_LIB_CTX_free, that calls context_deinit->ossl_ctx_thread_stop->init_thread_stop, and that calls all the handlers registered via ossl_init_thread_start, and everything gets cleaned up. However, in liboqs, you implicitly call the OpenSSL_cleanup routine via the registered atexit() handler, which again, is ok, but, it means that in test_sig.c (and likely your other tests), you join and exit the thread prior to OpenSSL getting cleaned up. It appears whats happening is that, on exit, the C library is cleaning up its thread local data list (I feel like I investigated this before), and, not having an cleanup handler registered (recalling from above that we register the local data handler with libcrypto, not with libc), and so it just NULL's the pointer in the thread local data table in the c library. As a result when OpenSSL_cleanup is called (either explicitly or implicitly), we run through the libcrypto registered cleanup handlers, the handler for rcu gets called (ossl_rcu_free_local_data), and when we call CRYPTO_THREAD_get_local to fetch the local data to free (which maps to pthread_getspecific) it returns NULL, as libc already expunged that pointer. As a result the local data leaks. This API could use some reworking, as It seems to me that it relies on implementation details in libc that aren't guaranteed (I think in the past those pointers stuck around until after all the exit handlers ran, but I need to look further). The documentation here: For the time being, as any fix here would be an ABI break, I think its something we need to live with. The good news(?) is that theres an API to manage this, and theres precedent for it in our tests. OPENSSL_thread_stop is used to explicitly clean up a threads local data prior to exiting, which can be used here. You can see it used in situations similar/identical to yours in drbgtest.c and threadstest.h in which thread wrappers like thread_run() call it prior to exiting for just this reason. You will likely need to do this for your various thread enabled tests, but this patch:
which I tested here, resolves the leak for me |
@SWilson4, I can complete the code changes if you haven't already done them. But can you work out the documentation? |
With |
OK -- that might solve the issue for "internal use" (tests), but isn't the proposal to expose |
Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Steen Rasmussen <[email protected]> Co-authored-by: Steen Rasmussen <[email protected]> Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Basil Hess <[email protected]> Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Norman Ashley <[email protected]>
…king. Signed-off-by: Norman Ashley <[email protected]>
Remove OPENSSL build-time macro from helper functions. Signed-off-by: Norman Ashley <[email protected]>
Use of OQS_DLOPEN_OPENSSL needs a new API change to expose an OQS function that calls OPENSSL_thread_stop Signed-off-by: Norman Ashley <[email protected]>
Use of OQS_DLOPEN_OPENSSL needs a new API change to expose an OQS function that calls OPENSSL_thread_stop Signed-off-by: Norman Ashley <[email protected]>
Use of OQS_DLOPEN_OPENSSL needs a new API change to expose an OQS function that calls OPENSSL_thread_stop Signed-off-by: Norman Ashley <[email protected]>
0027d7d
to
195f61e
Compare
This last commit isolates the memory leak fix when dlopen is used. A separate will be created to track the addition of a new OQS API needed to address this. I hope this is a reasonable and agreeable approach as discussed earlier in a status meeting. |
It looks like some unrelated changes to CBOM-related files snuck into the diff. Possibly a rebase gone wrong @ashman-p? |
Starting over... |
'run_tests' memory leak tests fails when build options enables OQS_USE_PTHREADS and OQS_USE_OPENSSL.
Fixes #1941.