Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cholesky example shows "matrix is not positive definite" error #1148

Open
s769 opened this issue Jul 29, 2024 · 1 comment
Open

[BUG] Cholesky example shows "matrix is not positive definite" error #1148

s769 opened this issue Jul 29, 2024 · 1 comment

Comments

@s769
Copy link

s769 commented Jul 29, 2024

Software versions

Python      :  3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Platform    :  Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28
Legion      :  v24.01.00.dev-38-g90944d7
Legate      :  24.01.00.dev+38.g90944d7
WARNING: Disabling control replication for interactive run
Disable Control Replication
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   c315-012
  Local device: mlx5_0
--------------------------------------------------------------------------
Cunumeric   :  24.01.00.dev+29.g503affb8
Numpy       :  2.0.0
Scipy       :  1.14.0
Numba       :  0.60.0
/work/08435/srvenkat/ls6/miniconda3/lib/python3.12/site-packages/conda_package_streaming/package_streaming.py:25: UserWarning: zstandard could not be imported. Running without .conda support.
  warnings.warn("zstandard could not be imported. Running without .conda support.")
/work/08435/srvenkat/ls6/miniconda3/lib/python3.12/site-packages/conda_package_handling/api.py:29: UserWarning: Install zstandard Python bindings for .conda support
  _warnings.warn("Install zstandard Python bindings for .conda support")
CTK package :  cuda-version-12.4-hbda6634_3 (pkgs/main)
GPU driver  :  535.104.12
GPU devices :
  GPU 0: NVIDIA A100-PCIE-40GB
  GPU 1: NVIDIA A100-PCIE-40GB
  GPU 2: NVIDIA A100-PCIE-40GB

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

I ran the cholesky.py example with -n 257 and expected to see the timing/flops output.

Observed behavior

I got an error saying the matrix is not positive definite. This was strange since I believe the example uses an identity matrix. I do not get the error for -n 256 or less.

Example code or instructions

legate --gpus 1 ./cholesky.py -n 257

Stack traceback or browser console output

(legate-ucx) c315-012.ls6(1033)$ legate --gpus 1 ./cholesky.py -n 257
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   c315-012
  Local device: mlx5_0
--------------------------------------------------------------------------
Elapsed Time: 52.263 ms
108267.2062453361 GOP/s
[0 - 14f511066000]    1.320818 {6}{python}: python exception occurred within task:
numpy.linalg.LinAlgError: Matrix is not positive definite

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legion_top.py", line 481, in legion_python_main
    cleanup()
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/runtime.py", line 2164, in _cleanup_legate_runtime
    runtime.destroy()
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/runtime.py", line 1322, in destroy
    self.raise_exceptions()
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/runtime.py", line 2075, in raise_exceptions
    pending.raise_exception()
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/exception.py", line 50, in raise_exception
    raise exn_reraised from exn_original
numpy.linalg.LinAlgError: Matrix is not positive definite
legion_python: /work/08435/srvenkat/ls6/legate.core/_skbuild/linux-x86_64-3.11/cmake-build/_deps/legion-src/runtime/realm/python/python_module.cc:1054: virtual void Realm::LocalPythonProcessor::execute_task(Realm::Processor::TaskFuncID, const Realm::ByteArrayRef&): Assertion `0' failed.
Signal 6 received by node 0, process 2983422 (thread 14f511066000) - obtaining backtrace
Signal 6 received by process 2983422 (thread 14f511066000) at: stack trace: 14 frames
  [0] = raise at unknown file:0 [000014f78aaeba9f]
  [1] = abort at unknown file:0 [000014f78aabee04]
  [2] = __assert_fail_base.cold.0 at unknown file:0 [000014f78aabecd8]
  [3] = __assert_fail at unknown file:0 [000014f78aae43f5]
  [4] = Realm::LocalPythonProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) at unknown file:0 [000014f78b41463a]
  [5] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [000014f78b3aaf41]
  [6] = Realm::KernelThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [000014f78b3aafd5]
  [7] = Realm::PythonThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [000014f78b41740c]
  [8] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [000014f78b3a9325]
  [9] = Realm::PythonThreadTaskScheduler::python_scheduler_loop() at unknown file:0 [000014f78b415f1e]
  [10] = Realm::KernelThread::pthread_entry(void*) at unknown file:0 [000014f78b3aed73]
  [11] = start_thread at unknown file:0 [000014f7889581ce]
  [12] = __clone at unknown file:0 [000014f78aad6dd2]
  [13] = unknown symbol at unknown file:0 [ffffffffffffffff]
@manopapad
Copy link
Contributor

I am not seeing the issue on my machine with the 24.06 packages (latest available on conda), could you please check if those solve your issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants