Qs #230

zws98 · 2024-04-16T09:08:47Z

I trained MOE on 8 gpus with 8 experts. When I conducted the inference in parallel, I found each process had a similar but different result. I would like to ask you what could be the cause of this?

ghostplant · 2024-04-16T09:11:31Z

Maybe you can consider if drop-less MOE mode can solve your issue, which is achieved by setting capacity_factor=0

zws98 · 2024-04-16T09:18:36Z

The results are still diverse for each process and the results are different from setting capacity_factor=1.25.

ghostplant · 2024-04-17T01:34:55Z

Do you have more information? I didn't get what you said.

Outputs from different GPUs:

STEP-10: loss = 21.11541, step_time = 3.628716 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3628715753555298 sec.
STEP-10: loss = 21.11541, step_time = 3.670310 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36703104972839357 sec.
STEP-10: loss = 21.11541, step_time = 3.689584 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.3689584493637085 sec.
STEP-10: loss = 21.11541, step_time = 3.675405 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36754045486450193 sec.
STEP-10: loss = 21.11541, step_time = 3.681213 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36812126636505127 sec.
STEP-10: loss = 21.11541, step_time = 3.629702 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3629701852798462 sec.
STEP-10: loss = 21.11541, step_time = 3.700365 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.37003653049468993 sec.
STEP-10: loss = 21.11541, step_time = 3.658189 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3658188819885254 sec.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qs #230

Qs #230

zws98 commented Apr 16, 2024

ghostplant commented Apr 16, 2024

zws98 commented Apr 16, 2024

ghostplant commented Apr 17, 2024

Qs #230

Qs #230

Comments

zws98 commented Apr 16, 2024

ghostplant commented Apr 16, 2024

zws98 commented Apr 16, 2024

ghostplant commented Apr 17, 2024