Legion hang on rank 16 on DGX100 #1808

suranap · 2024-12-18T20:47:55Z

Running FlexFlow on Stanford's Marlowe, which is a DGX H100 machine. It completes the entire training run and hangs on Legion shutdown for rank 16.

1 node * 4 procs/node = WORKS
1 node * 8 procs/node = WORKS
2 node * 4 procs/node = WORKS
2 node * 8 procs/node = FREEZE
4 node * 4 procs/node = FREEZE

Here's my srun:
srun --output ${LOGS}/run_%t.log --ntasks-per-node=8 select_gpu_device.sh ./build-default/flexflow_python examples/python/native/mnist_mlp.py -ll:py 1 -ll:gpu 1 -ll:fsize 70000 -ll:zsize 10000 -lg:inorder -ll:force_kthreads

Attached backtrace and logs for 2 node * 8 procs/node:
N2n16_flexflow.tgz

Let me know what else you need to debug this.

The text was updated successfully, but these errors were encountered:

lightsighter · 2024-12-19T08:29:05Z

Report the output of -level shutdown=2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legion hang on rank 16 on DGX100 #1808

Legion hang on rank 16 on DGX100 #1808

suranap commented Dec 18, 2024

lightsighter commented Dec 19, 2024

Legion hang on rank 16 on DGX100 #1808

Legion hang on rank 16 on DGX100 #1808

Comments

suranap commented Dec 18, 2024

lightsighter commented Dec 19, 2024