add new fairseq_pretraining function for starting from checkpoint #255

AndreasPlt · 2024-10-22T08:55:10Z

For doing scheduled hard_negatives training, I added run_fairseq_pretraining_from_checkpoint function that takes an additional checkpoint parameter and runs the pretraining job from that checkpoint

AndreasPlt · 2024-10-29T09:24:27Z

I noticed that this, when using the checkpoint parameter, it is not possible to resume training from the last checkpoint when interrupted.

The reason for that is that, when checkpoint is given, fairseq always starts training from that checkpoint and not from checkpoint_last.pt. On the other hand, when removing the checkpoint parameter in order to make fairseq train from the last checkpoint, the hash changes and checkpoint_last.pt is in fact not available anymore (because it was saved in the previous job with the other hash). Any ideas, how to solve that elegantly?

AndreasPlt · 2024-11-14T13:16:04Z

To tackle the problem described above, I proposed to change the behavior in the FairseqHydraTrainingJob in the i6_core repo (see this PR).

vieting

The function run_fairseq_pretraining_from_checkpoint that you introduce has huge overlap with the existing run_fairseq_pretraining. Instead of copying the whole function and only changing a small part of it, please add checkpoint: Optional[tk.Path] = None as an argument to the existing function and do the modifications you need if it is given, e.g.

if checkpoint is not None:
    fairseq_args["checkpoint"]["continue_once"] = checkpoint

AndreasPlt · 2024-12-18T16:14:27Z

Done. Instead of tk.Path however, I added str as a typehint since I am unsure if fairseq accepts a tk.Path. Feel free to correct me if I'm wrong

vieting · 2024-12-18T16:18:10Z

fairseq will never see a tk.Path object because it is written to the config like a normal string and fairseq only interacts with the config file, not with what you have in your sisyphus graph, right?

add new fairseq_pretraining job with starting from cp

3464e55

AndreasPlt changed the title ~~add new fairseq_pretraining job with starting from cp~~ add new fairseq_pretraining function for starting from checkpoint Nov 14, 2024

vieting self-requested a review November 14, 2024 13:30

use fairseq param for start checkpoint

2845bdd

vieting requested changes Dec 18, 2024

View reviewed changes

simplify

11cb4a8

change str to tk.Path

9b0660a

vieting approved these changes Jan 7, 2025

View reviewed changes

vieting merged commit 4a1af01 into rwth-i6:main Jan 7, 2025
2 checks passed

AndreasPlt deleted the from_checkpoint_pretrain branch January 7, 2025 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new fairseq_pretraining function for starting from checkpoint #255

add new fairseq_pretraining function for starting from checkpoint #255

AndreasPlt commented Oct 22, 2024 •

edited

Loading

AndreasPlt commented Oct 29, 2024 •

edited

Loading

AndreasPlt commented Nov 14, 2024

vieting left a comment

AndreasPlt commented Dec 18, 2024

vieting commented Dec 18, 2024

add new fairseq_pretraining function for starting from checkpoint #255

add new fairseq_pretraining function for starting from checkpoint #255

Conversation

AndreasPlt commented Oct 22, 2024 • edited Loading

AndreasPlt commented Oct 29, 2024 • edited Loading

AndreasPlt commented Nov 14, 2024

vieting left a comment

Choose a reason for hiding this comment

AndreasPlt commented Dec 18, 2024

vieting commented Dec 18, 2024

AndreasPlt commented Oct 22, 2024 •

edited

Loading

AndreasPlt commented Oct 29, 2024 •

edited

Loading