hello, i found a problem about the execution time #4

Suetsumu · 2022-10-08T16:43:03Z

I used the single-layer algorithm(By use __merge function to merge every layer) to run this program,then I found that the non-overlapped time is bigger than MG-WFBP algorithm.This is very right. But the time used in every iteration and epoch is less than MG-WFBP.

shyhuai · 2022-10-08T16:56:29Z

Hi, May I know if you have reset the alpha and beta values (https://github.com/HKBU-HPML/MG-WFBP/blob/master/distributed_optimizer.py#L166) that should be evaluated from your own cluster? MG-WFBP requires this information (can use nccl-tests to estimate) to generate a merging solution that fits the cluster.

Suetsumu · 2022-10-09T14:14:46Z

thanks for your reply. I will try your suggestion

Suetsumu · 2022-10-10T08:39:57Z

I use the benchmark function (

MG-WFBP/distributed_optimizer.py

Line 105 in 5b8ad54

def _benchmark_communication(self):

) to get the alpha and beta value and I found that the alpha value is almost 0.002s. The value is so big that the all layer have merge to one signer layer. I don't know why the startup speed is so slow. And I found that NCCL info show that my lab's GPUs don't support P2P.
Is it for this reason？

shyhuai · 2022-10-10T08:58:41Z

Hi, the startup time (i.e., alpha) is quite large, so MG-WFBP tends to merge all layers into a single one. You may need to check your hardware configuration. There may be different reasons (e.g., p2p support or not, # of CPU PCIe lanes, PCIe version, etc.) causing a large latency.

Suetsumu · 2022-10-10T09:00:02Z

this is the nccl_tests result that I used 5 NVIDIA GeForce RTX 3090.

Suetsumu · 2022-10-10T09:54:17Z

OK,I will check my hardware configuration. Thanks for your suggestion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hello, i found a problem about the execution time #4

hello, i found a problem about the execution time #4

Suetsumu commented Oct 8, 2022

shyhuai commented Oct 8, 2022

Suetsumu commented Oct 9, 2022

Suetsumu commented Oct 10, 2022

shyhuai commented Oct 10, 2022

Suetsumu commented Oct 10, 2022

Suetsumu commented Oct 10, 2022

hello, i found a problem about the execution time #4

hello, i found a problem about the execution time #4

Comments

Suetsumu commented Oct 8, 2022

shyhuai commented Oct 8, 2022

Suetsumu commented Oct 9, 2022

Suetsumu commented Oct 10, 2022

shyhuai commented Oct 10, 2022

Suetsumu commented Oct 10, 2022

Suetsumu commented Oct 10, 2022