[Fix] Fix flagscale entrypoint train.py and update extra_valid for newest megatron #306

zhaoyinglia · 2025-01-06T04:48:14Z

No description provided.

aoyulong · 2025-01-06T05:30:36Z

flagscale/train/train.py

@@ -1141,7 +1134,7 @@ def training_log(loss_dict, total_loss_dict, learning_rate, decoupled_learning_r
        total_loss_dict[nan_iters_key] = 0
        print_rank_last(log_string)
        if not args.auto_tune:
-            if report_memory_flag and learning_rate > 0.:


Why does this pr remove the learning_rate > 0?

Cause Megatron-LM removed it, but we didn't in the previous merge-megatron pr.

aoyulong

LGTM

zhaoyinglia added 3 commits January 6, 2025 12:37

[Fix] fix flagscale entrypoint train.py

93af545

rm moe tracker

d3fedb8

update extra valid for newest megatron

5149bf9

zhaoyinglia requested a review from a team as a code owner January 6, 2025 04:48

aoyulong reviewed Jan 6, 2025

View reviewed changes

fix conflict

cda2753

aoyulong approved these changes Jan 6, 2025

View reviewed changes

aoyulong merged commit 0ba8435 into FlagOpen:main Jan 6, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Fix flagscale entrypoint train.py and update extra_valid for newest megatron #306

[Fix] Fix flagscale entrypoint train.py and update extra_valid for newest megatron #306

zhaoyinglia commented Jan 6, 2025

aoyulong Jan 6, 2025

zhaoyinglia Jan 6, 2025

aoyulong left a comment

[Fix] Fix flagscale entrypoint train.py and update extra_valid for newest megatron #306

[Fix] Fix flagscale entrypoint train.py and update extra_valid for newest megatron #306

Conversation

zhaoyinglia commented Jan 6, 2025

aoyulong Jan 6, 2025

Choose a reason for hiding this comment

zhaoyinglia Jan 6, 2025

Choose a reason for hiding this comment

aoyulong left a comment

Choose a reason for hiding this comment