-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Add support for distributed training #257
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR @tchaton, and for the preliminary review @pietrolesci.
Some questions and minor remarks:
- Do those changes remove the strict requirement for pytorch_lightning==1.0.2? (If yes, to be changed in lightning_episodic_module.py:16).
- List contributions (crediting to yourself) in CHANGELOG.md?
def train_dataloader(self): | ||
return EpisodicBatcher.epochify( | ||
return Epochifier( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was Epochifier
inadvertently removed? Or should it be replaced by TaskDataParallel
?
Also, if I remember correctly Epochifier
was preventing us from using pytorch_lightning > 1.0.2. I believe the fix was just to inherit from a data class in lightning. @nightlessbaron do you remember which class and if that was enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seba-1511, quick answer to your first question: Epochifier
is replaced by TaskDataParallel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seba-1511, as @pietrolesci mentioned, in the benchmark, we replaced Epochifier
with TaskDistributedParallel
. The downside is windows users can't make use of it as stated in #5358.
I believe the fix was just to inherit from a data class in lightning.
I couldn't figure it out. That is still unresolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @seba-1511, yes it was inadvertently removed.
However, I added a TODO. It should be provided replaced by the TaskDataParallel, TaskDistributedDataParallel when tested on your side.
I will copy those files on the Flash side to no block the integration until the next learn2learn release.
Best,
T.C
Description
This PR adds support for computing support on new data for running inference. This is only supported by LightningPrototypicalNetworks right now, as the classifier needs re-fitting.
This is used for Lightning Flash learn2learn integration: Lightning-Universe/lightning-flash#737
Replace this line by a one sentence summary of your PR.
If necessary, use the following space to provide context or more details.
Contribution Checklist
If your contribution modifies code in the core library (not docs, tests, or examples), please fill the following checklist.
Optional
If you make major changes to the core library, please run
make alltests
and copy-paste the content ofalltests.txt
below.