Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AllReduce distributed strategy design #373

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open

Conversation

QiJune
Copy link
Collaborator

@QiJune QiJune commented Oct 27, 2020

Here is for beter review

@QiJune QiJune changed the title [WIP] Add AllReduce distributed strategy design Add AllReduce distributed strategy design Oct 28, 2020
@codecov
Copy link

codecov bot commented Oct 28, 2020

Codecov Report

Merging #373 into develop will decrease coverage by 0.06%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #373      +/-   ##
===========================================
- Coverage    87.75%   87.69%   -0.07%     
===========================================
  Files           33       33              
  Lines         1503     1503              
===========================================
- Hits          1319     1318       -1     
- Misses         121      122       +1     
  Partials        63       63              
Impacted Files Coverage Δ
vision/imageloader/imageloader.go 90.41% <0.00%> (-0.69%) ⬇️


Single-process multi-GPU is not the recommended mode,
becase of its overhead of scatter/gather and GIL contention in every forward pass.
So, let's focus on DistributedDataParallel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GoTorch does not have GIL, does single-process multi-GPU mode fits GoTorch?

Copy link
Collaborator Author

@QiJune QiJune Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer is no. There are two reasons:

  • The overhead of scatter/gather is also nonnegligible. We once use scatter/parallel do/gather to support multi-GPU AllReduce in Paddle with C++. From the experience, the speedup ratio is not very good.

  • We could only use scatter/gather in multi-GPU of one node. It could not be scaled to multi-node multi-GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants