[🐛 bug report] Nanothread queue assertion failure #190

jatentaki · 2022-08-19T16:43:26Z

Summary

I ran a script to render 4046 objects of class 02691156 from ShapeNetCore.v2 and had the script crash on 3887th render with:

Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
bash: line 1:   624 Aborted                 (core dumped)

System configuration

Platform: Ubuntu20.04
Compiler: as on pip
Python version: 3.8.10
Mitsuba 3 version: 3.0.1
Compiled variants: (as on pip)
- scalar_rgb
- scalar_spectral
- llvm_ad_rgb (used)
- cuda_ad_rgb

Description

Unfortunately, I am not able to reproduce - since the assertion failure just kills python, I am not able to dump the specific content that crashed it. I have tried rendering this specific object separately (outside of the main loop) and it worked.

Is there a way to be more useful reporting such multithreading bugs?

The text was updated successfully, but these errors were encountered:

merlinND · 2022-08-19T17:51:39Z

Hi @jatentaki,

Thanks for reporting!

Unless this rings a bell and someone finds the issue immediately, I think we'd need the smallest script that reproduces the issue on your end + the corresponding data.
If the bug is about the number of renderings, you could try making each rendering in the loop with low res and low spp so that testing is faster.

jatentaki · 2022-08-19T17:55:11Z

How would I go about it? Since this is very likely a threading bug, it's rather unlikely this is deterministically reproducible. Would something along the lines of "this render-in-loop usually eventually breaks for me" work? Could the core dumps/any other artifacts be of use?

merlinND · 2022-08-19T17:57:41Z

Yes absolutely, even if the script only crashes 1/10 times it should already be a good starting point for debugging (as long as it doesn't take hours to reproduce the issue).
A core dump would help too, but since the pip version is compiled in release mode, it will be missing a lot of info. But who knows, maybe already enough to help locate the issue.

If you have time, you could also try compiling Mitsuba 3 in debug mode and trying to reproduce the crash in a debugger on your own machine.

lynshwoo2022 · 2022-12-26T08:38:39Z

hi @jatentaki , did you solve this problem? when I use :

"integrator": {
                "type": "path",
                'max_depth': 8,
                'hide_emitters':True
       },

the error same as u occured, and if i change it to :'type'='prb', error disappears

jatentaki · 2022-12-27T11:44:43Z

@lynshwoo2022 that's good to know. I wasn't able to reduce the repro case nor get a core dump (since I'm running in a container and that messes with debugging) so I put this problem on the back burner.

lynshwoo2022 · 2022-12-27T12:02:56Z

@lynshwoo2022 that's good to know. I wasn't able to reduce the repro case nor get a core dump (since I'm running in a container and that messes with debugging) so I put this problem on the back burner.

and now prb or prb_reparam has this problem too.... confused.

jatentaki · 2022-12-27T12:06:24Z

What do you mean by now? What has changed? I found it to be a highly intermittent issue, as typical for threading bugs :(

lynshwoo2022 · 2022-12-27T12:33:12Z

just changed integrator type to prb_reparam(integrating mitsuba in pytorch).
and yes, some times it happens after 100 iterations, sometimes just happens in the beginning....

wjakob · 2022-12-29T15:20:35Z

@Speierers recently fixed a number of sources of undefined behavior on the master branch. If you are able to compile from Git, I suspect that this will likely fix your problem.

lynshwoo2022 · 2022-12-30T06:13:44Z

@Speierers recently fixed a number of sources of undefined behavior on the master branch. If you are able to compile from Git, I suspect that this will likely fix your problem.

i'll try, thank you.

lynshwoo2022 · 2023-01-03T07:56:21Z

hi there, I compiled mitsuba from git. assertion errors solved, but NAN problems remained.

wjakob · 2023-01-03T08:38:38Z

Excellent, I am happy to hear it. I will close this bug related to the assertion failure then.

wjakob · 2023-01-03T08:39:55Z

(if you continue to have the issue even with the latest master version, feel free to post and we can reopen the issue @jatentaki. In that case, we would need you to provide code/data that reproduces the issue to be able to investigate further)

NMontanaBrown · 2023-07-26T09:49:51Z

Hi team @merlinND @wjakob , I've also been having this error using Mitsuba for a few months. It's not immediately reproducible, or consistent to reproduce, as others have reported. I am using mitsuba3 installed from pip (3.3.0).

My use case is rendering images for RL, and what I have observed is that RAM becomes increasingly occupied as training progresses (with no other significant processes in the machine). I intuit that there is an OOM occurring that causes the failure. I can provide code to reproduce this example, although it takes a non-trivial amount of time to occur.

If this is possible, are there any direct fixes to handling the memory usage? I have tried deleting the reference to the tensor object, rendering using different variants (cuda and cpu), but the error persists.

sagesimhon · 2023-08-11T04:32:29Z

Hello, @merlinND @wjakob following up from: mitsuba-renderer/drjit-core#63 I am now consistently running into this issue, but at random times during execution.

I have a particular job that successfully completes when using my Macbook pro. However, that environment is not computationally scalable enough for my tasks, and I am only able to achieve low-resolution results. So I am trying to run on ec2 instances and other linux machines with 50+ CPUs. This is where I see the issue.

I have a conda environment where I installed mistuba3 using Pip. I tried these versions:

mitsuba: 3.30
drjit: 0.4.2

and

mitsuba-3.2.1
drjit-0.4.1

I have LLVM v15 and v10 .so installed (via conda install), as set the path accordingly:

export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-15.so
export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-10.so

(I also installed LLVM using apt get, no diff, also tried to build LLMV from source, no diff)

All of the above configurations result in this error:
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1

I occurs at radome times during execution -- sometimes after a few minutes, sometimes after a hour. No stack trace or any debugging info.

Any ideas where to start?

njroussel · 2023-08-13T00:41:42Z

We're now tracking this issue over here: #849

(Thanks @sagesimhon )

wjakob closed this as completed Jan 3, 2023

merlinND mentioned this issue Jul 17, 2023

Strange assert error mitsuba-renderer/drjit-core#63

Closed

sagesimhon mentioned this issue Aug 11, 2023

Getting strange assertion error #849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 bug report] Nanothread queue assertion failure #190

[🐛 bug report] Nanothread queue assertion failure #190

jatentaki commented Aug 19, 2022

merlinND commented Aug 19, 2022

jatentaki commented Aug 19, 2022

merlinND commented Aug 19, 2022 •

edited

Loading

lynshwoo2022 commented Dec 26, 2022 •

edited

Loading

jatentaki commented Dec 27, 2022

lynshwoo2022 commented Dec 27, 2022

jatentaki commented Dec 27, 2022

lynshwoo2022 commented Dec 27, 2022 •

edited

Loading

wjakob commented Dec 29, 2022

lynshwoo2022 commented Dec 30, 2022 •

edited

Loading

lynshwoo2022 commented Jan 3, 2023 •

edited

Loading

wjakob commented Jan 3, 2023

wjakob commented Jan 3, 2023

NMontanaBrown commented Jul 26, 2023

sagesimhon commented Aug 11, 2023 •

edited

Loading

njroussel commented Aug 13, 2023

[🐛 bug report] Nanothread queue assertion failure #190

[🐛 bug report] Nanothread queue assertion failure #190

Comments

jatentaki commented Aug 19, 2022

Summary

System configuration

Description

merlinND commented Aug 19, 2022

jatentaki commented Aug 19, 2022

merlinND commented Aug 19, 2022 • edited Loading

lynshwoo2022 commented Dec 26, 2022 • edited Loading

jatentaki commented Dec 27, 2022

lynshwoo2022 commented Dec 27, 2022

jatentaki commented Dec 27, 2022

lynshwoo2022 commented Dec 27, 2022 • edited Loading

wjakob commented Dec 29, 2022

lynshwoo2022 commented Dec 30, 2022 • edited Loading

lynshwoo2022 commented Jan 3, 2023 • edited Loading

wjakob commented Jan 3, 2023

wjakob commented Jan 3, 2023

NMontanaBrown commented Jul 26, 2023

sagesimhon commented Aug 11, 2023 • edited Loading

njroussel commented Aug 13, 2023

merlinND commented Aug 19, 2022 •

edited

Loading

lynshwoo2022 commented Dec 26, 2022 •

edited

Loading

lynshwoo2022 commented Dec 27, 2022 •

edited

Loading

lynshwoo2022 commented Dec 30, 2022 •

edited

Loading

lynshwoo2022 commented Jan 3, 2023 •

edited

Loading

sagesimhon commented Aug 11, 2023 •

edited

Loading