Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 bug report] Nanothread queue assertion failure #190

Closed
jatentaki opened this issue Aug 19, 2022 · 16 comments
Closed

[🐛 bug report] Nanothread queue assertion failure #190

jatentaki opened this issue Aug 19, 2022 · 16 comments

Comments

@jatentaki
Copy link

Summary

I ran a script to render 4046 objects of class 02691156 from ShapeNetCore.v2 and had the script crash on 3887th render with:

Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
bash: line 1:   624 Aborted                 (core dumped)

System configuration

  • Platform: Ubuntu20.04
  • Compiler: as on pip
  • Python version: 3.8.10
  • Mitsuba 3 version: 3.0.1
  • Compiled variants: (as on pip)
    • scalar_rgb
    • scalar_spectral
    • llvm_ad_rgb (used)
    • cuda_ad_rgb

Description

Unfortunately, I am not able to reproduce - since the assertion failure just kills python, I am not able to dump the specific content that crashed it. I have tried rendering this specific object separately (outside of the main loop) and it worked.

Is there a way to be more useful reporting such multithreading bugs?

@merlinND
Copy link
Member

Hi @jatentaki,

Thanks for reporting!

Unless this rings a bell and someone finds the issue immediately, I think we'd need the smallest script that reproduces the issue on your end + the corresponding data.
If the bug is about the number of renderings, you could try making each rendering in the loop with low res and low spp so that testing is faster.

@jatentaki
Copy link
Author

How would I go about it? Since this is very likely a threading bug, it's rather unlikely this is deterministically reproducible. Would something along the lines of "this render-in-loop usually eventually breaks for me" work? Could the core dumps/any other artifacts be of use?

@merlinND
Copy link
Member

merlinND commented Aug 19, 2022

Yes absolutely, even if the script only crashes 1/10 times it should already be a good starting point for debugging (as long as it doesn't take hours to reproduce the issue).
A core dump would help too, but since the pip version is compiled in release mode, it will be missing a lot of info. But who knows, maybe already enough to help locate the issue.

If you have time, you could also try compiling Mitsuba 3 in debug mode and trying to reproduce the crash in a debugger on your own machine.

@lynshwoo2022
Copy link

lynshwoo2022 commented Dec 26, 2022

hi @jatentaki , did you solve this problem? when I use :

"integrator": {
                "type": "path",
                'max_depth': 8,
                'hide_emitters':True
       },

the error same as u occured, and if i change it to :'type'='prb', error disappears

@jatentaki
Copy link
Author

@lynshwoo2022 that's good to know. I wasn't able to reduce the repro case nor get a core dump (since I'm running in a container and that messes with debugging) so I put this problem on the back burner.

@lynshwoo2022
Copy link

@lynshwoo2022 that's good to know. I wasn't able to reduce the repro case nor get a core dump (since I'm running in a container and that messes with debugging) so I put this problem on the back burner.

and now prb or prb_reparam has this problem too.... confused.

@jatentaki
Copy link
Author

What do you mean by now? What has changed? I found it to be a highly intermittent issue, as typical for threading bugs :(

@lynshwoo2022
Copy link

lynshwoo2022 commented Dec 27, 2022

just changed integrator type to prb_reparam(integrating mitsuba in pytorch).
and yes, some times it happens after 100 iterations, sometimes just happens in the beginning....

@wjakob
Copy link
Member

wjakob commented Dec 29, 2022

@Speierers recently fixed a number of sources of undefined behavior on the master branch. If you are able to compile from Git, I suspect that this will likely fix your problem.

@lynshwoo2022
Copy link

lynshwoo2022 commented Dec 30, 2022

@Speierers recently fixed a number of sources of undefined behavior on the master branch. If you are able to compile from Git, I suspect that this will likely fix your problem.

i'll try, thank you.

@lynshwoo2022
Copy link

lynshwoo2022 commented Jan 3, 2023

hi there, I compiled mitsuba from git. assertion errors solved, but NAN problems remained.

@wjakob
Copy link
Member

wjakob commented Jan 3, 2023

Excellent, I am happy to hear it. I will close this bug related to the assertion failure then.

@wjakob wjakob closed this as completed Jan 3, 2023
@wjakob
Copy link
Member

wjakob commented Jan 3, 2023

(if you continue to have the issue even with the latest master version, feel free to post and we can reopen the issue @jatentaki. In that case, we would need you to provide code/data that reproduces the issue to be able to investigate further)

@NMontanaBrown
Copy link

Hi team @merlinND @wjakob , I've also been having this error using Mitsuba for a few months. It's not immediately reproducible, or consistent to reproduce, as others have reported. I am using mitsuba3 installed from pip (3.3.0).

My use case is rendering images for RL, and what I have observed is that RAM becomes increasingly occupied as training progresses (with no other significant processes in the machine). I intuit that there is an OOM occurring that causes the failure. I can provide code to reproduce this example, although it takes a non-trivial amount of time to occur.

If this is possible, are there any direct fixes to handling the memory usage? I have tried deleting the reference to the tensor object, rendering using different variants (cuda and cpu), but the error persists.

@sagesimhon
Copy link

sagesimhon commented Aug 11, 2023

Hello, @merlinND @wjakob following up from: mitsuba-renderer/drjit-core#63 I am now consistently running into this issue, but at random times during execution.


I have a particular job that successfully completes when using my Macbook pro. However, that environment is not computationally scalable enough for my tasks, and I am only able to achieve low-resolution results. So I am trying to run on ec2 instances and other linux machines with 50+ CPUs. This is where I see the issue.

I have a conda environment where I installed mistuba3 using Pip. I tried these versions:

mitsuba: 3.30
drjit: 0.4.2

and

mitsuba-3.2.1
drjit-0.4.1

I have LLVM v15 and v10 .so installed (via conda install), as set the path accordingly:

export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-15.so
export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-10.so

(I also installed LLVM using apt get, no diff, also tried to build LLMV from source, no diff)

All of the above configurations result in this error:
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1

I occurs at radome times during execution -- sometimes after a few minutes, sometimes after a hour. No stack trace or any debugging info.

Any ideas where to start?

@njroussel
Copy link
Member

We're now tracking this issue over here: #849

(Thanks @sagesimhon )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants