-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting strange assertion error #849
Comments
Hi @sagesimhon Yes, we'd appreciate the reproducer. Tracking such an issue down is painful. As you've seen, this bug is a bit elusive. There most likely is some underlying race condition at play here. If you want to give it an attempt yourself, I'd recommend compiling the project yourself and in DEBUG mode. This assertion is in our thread pool/worker management system. One of the threads should have launched some asynchronous job from |
Hi @njroussel, Just added you to my git repo https://github.com/sagesimhon/totem_plus Follow instructions in "Dependencies" section of the README, then run
This error is seen on a linux machine with many CPUs (>48)..but it maybe be the case with lower cpu counts. I do not run into it using a mac. |
UPDATE: I believe i have isolated the issue, it may be coming from mi.load_file() method --- the xml loading method. My process loads many xml files, and this function may be problematic by potentially running our of threads or file pointers or something else if they are not closed properly -- just guessing the root cause. |
I have ported all my code to use the 'dict' data instead of xml -- and am still getting the error : ( @njroussel Is there anything else you need for the reproducer? |
I haven't been able to reproduce it yet... |
Thanks, I will try it and hopefully it will not degrade performance too much -- the problem definitely seems to come from the load_dict() method. |
Hello, in my situation, it seems that |
Something with the asynchronous job manager in |
Hi all - writing to say that I have also been experiencing this issue. I can't share my reproducer (~200 lines) publicly but am happy to email with someone on the team if that would be helpful. |
I have a similar issue.
After a random number (typically somewhere between 1000 and 4000) of iterations, the process is finished with the following output: Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1 I use only CPUs (no GPU) on Ubuntu, "lsb_release -a" yields: Python version: 3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0] |
Hi @sagesimhon -- I just wanted to access |
Fix strange assertion error caused by mitsuba when load_scene is called many times because of this unresolved issue: mitsuba-renderer/mitsuba3#849
Hi, I'm reopening this issue from #190 and mitsuba-renderer/drjit-core#63.
Summary
Hello, @merlinND @wjakob following up from: mitsuba-renderer/drjit-core#63 I am now consistently running into this issue, but at random times during execution.
The error is as follows:
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
More detail below.
System configuration
System information:
OS: Ubuntu 22.04.3 LTS
CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
GPU: NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
NVIDIA TITAN RTX
Python: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0]
NVidia driver: 530.30.02
LLVM: 10.0.1
Dr.Jit: 0.4.1
Mitsuba: 3.2.1
Is custom build? False
Compiled with: GNU 10.2.1
Variants:
scalar_rgb
scalar_spectral
cuda_ad_rgb
llvm_ad_rgb
Description
I have a particular job that successfully completes when using my Macbook pro. However, that environment is not computationally scalable enough for my tasks, and I am only able to achieve low-resolution results. So I am trying to run on ec2 instances and other linux machines with 50+ CPUs. This is where I see the issue.
I have a conda environment where I installed mistuba3 using Pip. I tried these versions:
mitsuba: 3.30
drjit: 0.4.2
and
mitsuba-3.2.1
drjit-0.4.1
I have LLVM v15 and v10 .so installed (via conda install), and set the path accordingly:
export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-15.so
export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-10.so
(I also installed LLVM using apt get, no diff, also tried to build LLMV from source, no diff)
All of the above configurations result in this error:
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
I occurs at random times during execution -- sometimes after a few minutes, sometimes after a hour. No stack trace or any debugging info.
Any ideas where to start?
Steps to reproduce
I can share my repo (including clear instructions to reproduce in the readme) with you - let me know if you would like
The text was updated successfully, but these errors were encountered: