hello，i have 2 questions and 2 minds (I don't know if it's right）(^_^) #1

zhangfuhan · 2020-01-02T02:50:05Z

the acc of MG-WFBP will same as the acc of original sgd?
i think MG-WFBP use the small batch-size. Because the it has batch update weights, it means reduce original sgd.

2)can the way in MG-WFBP use in adam ?
the adam may takes longer to backward, Reduce the conflict problem in many papers.

shyhuai · 2020-01-02T03:49:15Z

Hi, Thanks for your interest in MG-WFBP.

The accuracy of MG-WFBP is the same as the original synchronized SGD (S-SGD). For any given mini-batch size, MG-WFBP averages the gradients which are consistent with the average operation of S-SGD. The reason why MG-WFBP can run faster than S-SGD is that it merges gradients at the "right" position so that more communications are hidden.
MG-WFBP can also be applied in many first-order optimizers including Adam as the key idea of MG-WFBP is to schedule the gradients for communication. So one can use the averaged gradients from P workers to customize its own optimizers.
Hope the responses address your concerns. Thanks.

zhangfuhan · 2020-01-02T07:12:11Z

thank you very much for your answer.
To your amusement, my mind is too simple. I'll look at your paper more carefully

Provide feedback