Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting strange assertion error #849

Open
sagesimhon opened this issue Aug 11, 2023 · 11 comments
Open

Getting strange assertion error #849

sagesimhon opened this issue Aug 11, 2023 · 11 comments

Comments

@sagesimhon
Copy link

sagesimhon commented Aug 11, 2023

Hi, I'm reopening this issue from #190 and mitsuba-renderer/drjit-core#63.

Summary

Hello, @merlinND @wjakob following up from: mitsuba-renderer/drjit-core#63 I am now consistently running into this issue, but at random times during execution.

The error is as follows:
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1

More detail below.

System configuration

System information:

OS: Ubuntu 22.04.3 LTS
CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
GPU: NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
Python: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0]
NVidia driver: 530.30.02
LLVM: 10.0.1

Dr.Jit: 0.4.1
Mitsuba: 3.2.1
Is custom build? False
Compiled with: GNU 10.2.1
Variants:
scalar_rgb
scalar_spectral
cuda_ad_rgb
llvm_ad_rgb

Description

I have a particular job that successfully completes when using my Macbook pro. However, that environment is not computationally scalable enough for my tasks, and I am only able to achieve low-resolution results. So I am trying to run on ec2 instances and other linux machines with 50+ CPUs. This is where I see the issue.

I have a conda environment where I installed mistuba3 using Pip. I tried these versions:

mitsuba: 3.30
drjit: 0.4.2

and

mitsuba-3.2.1
drjit-0.4.1

I have LLVM v15 and v10 .so installed (via conda install), and set the path accordingly:

export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-15.so
export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-10.so

(I also installed LLVM using apt get, no diff, also tried to build LLMV from source, no diff)

All of the above configurations result in this error:
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1

I occurs at random times during execution -- sometimes after a few minutes, sometimes after a hour. No stack trace or any debugging info.

Any ideas where to start?

Steps to reproduce

I can share my repo (including clear instructions to reproduce in the readme) with you - let me know if you would like

@njroussel
Copy link
Member

Hi @sagesimhon

Yes, we'd appreciate the reproducer. Tracking such an issue down is painful. As you've seen, this bug is a bit elusive. There most likely is some underlying race condition at play here.

If you want to give it an attempt yourself, I'd recommend compiling the project yourself and in DEBUG mode. This assertion is in our thread pool/worker management system. One of the threads should have launched some asynchronous job from drjit.

@sagesimhon
Copy link
Author

sagesimhon commented Aug 14, 2023

Hi @njroussel,

Just added you to my git repo https://github.com/sagesimhon/totem_plus
To reproduce with minimal settings,

Follow instructions in "Dependencies" section of the README, then run
python run_generation.py --res 256 --exp_folder 'test_minimal_run_reproducer' --n 999.
The code is one large for loop. The assertion error comes up at random times, for me in the last run it came at iteration 947, after three minutes. Here is the last thing printed before getting the error:

Trying iter 947, file 0000947
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
Aborted (core dumped)

This error is seen on a linux machine with many CPUs (>48)..but it maybe be the case with lower cpu counts. I do not run into it using a mac.

@sagesimhon
Copy link
Author

UPDATE: I believe i have isolated the issue, it may be coming from mi.load_file() method --- the xml loading method. My process loads many xml files, and this function may be problematic by potentially running our of threads or file pointers or something else if they are not closed properly -- just guessing the root cause.

@sagesimhon
Copy link
Author

sagesimhon commented Aug 17, 2023

I have ported all my code to use the 'dict' data instead of xml -- and am still getting the error : ( @njroussel Is there anything else you need for the reproducer?

@njroussel
Copy link
Member

I haven't been able to reproduce it yet...
You might want to turn off parallel scene loading if you thing that is the issue here. (documentation)

@sagesimhon
Copy link
Author

Thanks, I will try it and hopefully it will not degrade performance too much -- the problem definitely seems to come from the load_dict() method.

@FYRichie
Copy link

FYRichie commented Sep 5, 2023

Hello, in my situation, it seems that mi.util.write_bitmap() also causes this error. In my code, this function will be called many times. Is this error also due to lack of file pointers?

@njroussel
Copy link
Member

Something with the asynchronous job manager in drjit is going awry. The write_bitmap() function writes the files asynchronously by default, you can change it. It might therefore be related.

@kach
Copy link

kach commented Oct 5, 2023

Hi all - writing to say that I have also been experiencing this issue. I can't share my reproducer (~200 lines) publicly but am happy to email with someone on the team if that would be helpful.

@member67
Copy link

I have a similar issue.
I create a room with some furniture in Blender, export this room with Mitsuba to a .xml file and load this file in Sionna with load_scene(...). Since I want to get the channel impulse response at many different locations in the room, I run a for loop over an array of receiver positions (rx_pos), and re-set the position of the reciever (rx) in the scene in each iteration as follows:

for i_c, i_rx_pos in enumerate(rx_pos):
	rx.position = i_rx_pos  # set position of the reciever
	paths = scene.compute_paths(max_depth=3)

After a random number (typically somewhere between 1000 and 4000) of iterations, the process is finished with the following output:

Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

I use only CPUs (no GPU) on Ubuntu, "lsb_release -a" yields:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

Python version: 3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0]
Mitsuba version: '3.4.0'
Dr. Jit version: '0.4.3'

@wjakob
Copy link
Member

wjakob commented Dec 2, 2023

Hi @sagesimhon -- I just wanted to access https://github.com/sagesimhon/totem_plus but cannot. Is it a private repository?

chaihahaha added a commit to chaihahaha/sionna that referenced this issue Dec 10, 2023
Fix strange assertion error caused by mitsuba when load_scene is called many times because of this unresolved issue: mitsuba-renderer/mitsuba3#849
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants