Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new fairseq_pretraining function for starting from checkpoint #255

Merged
merged 4 commits into from
Jan 7, 2025

Conversation

AndreasPlt
Copy link
Contributor

@AndreasPlt AndreasPlt commented Oct 22, 2024

For doing scheduled hard_negatives training, I added run_fairseq_pretraining_from_checkpoint function that takes an additional checkpoint parameter and runs the pretraining job from that checkpoint

@AndreasPlt
Copy link
Contributor Author

AndreasPlt commented Oct 29, 2024

I noticed that this, when using the checkpoint parameter, it is not possible to resume training from the last checkpoint when interrupted.

The reason for that is that, when checkpoint is given, fairseq always starts training from that checkpoint and not from checkpoint_last.pt. On the other hand, when removing the checkpoint parameter in order to make fairseq train from the last checkpoint, the hash changes and checkpoint_last.pt is in fact not available anymore (because it was saved in the previous job with the other hash). Any ideas, how to solve that elegantly?

@AndreasPlt
Copy link
Contributor Author

To tackle the problem described above, I proposed to change the behavior in the FairseqHydraTrainingJob in the i6_core repo (see this PR).

@AndreasPlt AndreasPlt changed the title add new fairseq_pretraining job with starting from cp add new fairseq_pretraining function for starting from checkpoint Nov 14, 2024
@vieting vieting self-requested a review November 14, 2024 13:30
Copy link
Contributor

@vieting vieting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function run_fairseq_pretraining_from_checkpoint that you introduce has huge overlap with the existing run_fairseq_pretraining. Instead of copying the whole function and only changing a small part of it, please add checkpoint: Optional[tk.Path] = None as an argument to the existing function and do the modifications you need if it is given, e.g.

if checkpoint is not None:
    fairseq_args["checkpoint"]["continue_once"] = checkpoint

@AndreasPlt
Copy link
Contributor Author

Done. Instead of tk.Path however, I added str as a typehint since I am unsure if fairseq accepts a tk.Path. Feel free to correct me if I'm wrong

@vieting
Copy link
Contributor

vieting commented Dec 18, 2024

fairseq will never see a tk.Path object because it is written to the config like a normal string and fairseq only interacts with the config file, not with what you have in your sisyphus graph, right?

@vieting vieting merged commit 4a1af01 into rwth-i6:main Jan 7, 2025
2 checks passed
@AndreasPlt AndreasPlt deleted the from_checkpoint_pretrain branch January 7, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants