-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 bug report] Nanothread queue assertion failure #190
Comments
Hi @jatentaki, Thanks for reporting! Unless this rings a bell and someone finds the issue immediately, I think we'd need the smallest script that reproduces the issue on your end + the corresponding data. |
How would I go about it? Since this is very likely a threading bug, it's rather unlikely this is deterministically reproducible. Would something along the lines of "this render-in-loop usually eventually breaks for me" work? Could the core dumps/any other artifacts be of use? |
Yes absolutely, even if the script only crashes 1/10 times it should already be a good starting point for debugging (as long as it doesn't take hours to reproduce the issue). If you have time, you could also try compiling Mitsuba 3 in debug mode and trying to reproduce the crash in a debugger on your own machine. |
hi @jatentaki , did you solve this problem? when I use :
the error same as u occured, and if i change it to : |
@lynshwoo2022 that's good to know. I wasn't able to reduce the repro case nor get a core dump (since I'm running in a container and that messes with debugging) so I put this problem on the back burner. |
and now |
What do you mean by now? What has changed? I found it to be a highly intermittent issue, as typical for threading bugs :( |
just changed integrator type to |
@Speierers recently fixed a number of sources of undefined behavior on the |
i'll try, thank you. |
hi there, I compiled mitsuba from git. assertion errors solved, but NAN problems remained. |
Excellent, I am happy to hear it. I will close this bug related to the assertion failure then. |
(if you continue to have the issue even with the latest master version, feel free to post and we can reopen the issue @jatentaki. In that case, we would need you to provide code/data that reproduces the issue to be able to investigate further) |
Hi team @merlinND @wjakob , I've also been having this error using Mitsuba for a few months. It's not immediately reproducible, or consistent to reproduce, as others have reported. I am using mitsuba3 installed from pip (3.3.0). My use case is rendering images for RL, and what I have observed is that RAM becomes increasingly occupied as training progresses (with no other significant processes in the machine). I intuit that there is an OOM occurring that causes the failure. I can provide code to reproduce this example, although it takes a non-trivial amount of time to occur. If this is possible, are there any direct fixes to handling the memory usage? I have tried deleting the reference to the tensor object, rendering using different variants (cuda and cpu), but the error persists. |
Hello, @merlinND @wjakob following up from: mitsuba-renderer/drjit-core#63 I am now consistently running into this issue, but at random times during execution. I have a particular job that successfully completes when using my Macbook pro. However, that environment is not computationally scalable enough for my tasks, and I am only able to achieve low-resolution results. So I am trying to run on ec2 instances and other linux machines with 50+ CPUs. This is where I see the issue. I have a conda environment where I installed mistuba3 using Pip. I tried these versions: mitsuba: 3.30 and mitsuba-3.2.1 I have LLVM v15 and v10 .so installed (via conda install), as set the path accordingly: export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-15.so (I also installed LLVM using apt get, no diff, also tried to build LLMV from source, no diff) All of the above configurations result in this error: I occurs at radome times during execution -- sometimes after a few minutes, sometimes after a hour. No stack trace or any debugging info. Any ideas where to start? |
We're now tracking this issue over here: #849 (Thanks @sagesimhon ) |
Summary
I ran a script to render 4046 objects of class
02691156
fromShapeNetCore.v2
and had the script crash on 3887th render with:System configuration
pip
pip
)scalar_rgb
scalar_spectral
llvm_ad_rgb
(used)cuda_ad_rgb
Description
Unfortunately, I am not able to reproduce - since the assertion failure just kills python, I am not able to dump the specific content that crashed it. I have tried rendering this specific object separately (outside of the main loop) and it worked.
Is there a way to be more useful reporting such multithreading bugs?
The text was updated successfully, but these errors were encountered: