Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the Gaudi trainer with transformers 4.45.2 #1398

Merged
merged 55 commits into from
Dec 9, 2024

Conversation

yafshar
Copy link
Contributor

@yafshar yafshar commented Oct 4, 2024

What does this PR do?

Update the Gaudi trainer with transformers 4.45.2

  • Add the description
  • remove _is_peft_model function and import it from transformers
  • remove the unnecessary import datasets and keep the first one
  • Update the _get_train_sampler, by replacing computed var num_samples
  • Update the _inner_training_loop
    • num_update_steps_per_epoch used when possible
    • unused line was removed
    • add a new _should_compute_grad_norm var to remove extra conditional check
  • Update the _load_best_model, enabled the if self.is_deepspeed_enabled
  • Update _maybe_log_save_evaluate, _grad_norm.item() -> _grad_norm.detach.item()
  • Remove the _save_checkpoint, exactly the same as transformers
  • Update autocast_smart_context_manager, updateing the old interface
  • Update save_model, add accelerate version check
  • Update evaluation_loop
    • introduce a new var _should_update_inputs for llama, qwen2, starcoder2 and gemma, to avoid double if condition in the loop
    • set the logits_dtype to float32 before the loop
    • re-order the conditionals for losses, logits, labels and match with
      transformers
  • Update _inner_training_loop
    • introduce a new var _should_update_inputs for llama, qwen2, starcoder2 and gemma, to avoid double if condition in the loop

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

yafshar added 12 commits October 4, 2024 09:59
- Add the description
- remove _is_peft_model function and import it from transformers
- remove the unnecessary import datasets and keep the first one
- Update the _get_train_sampler, by replacing computed var num_samples
- Update the _inner_training_loop
  - num_update_steps_per_epoch used when possible
  - unused line was removed
  - add a new _should_compute_grad_norm var to remove extra conditional
    check
- Update the _load_best_model, enabled the if self.is_deepspeed_enabled
- Update _maybe_log_save_evaluate, _grad_norm.item() -> _grad_norm.detach.item()
- Remove the _save_checkpoint, exactly the same as transformers
- Update autocast_smart_context_manager, updateing the old interface
- Update save_model, add accelerate version check
- Update evaluation_loop
  - introduce a new var _should_update_inputs  for llama, qwen2,
    starcoder2 and gemma, to avoid double if condition in the loop
  - set the logits_dtype to float32 before the loop
  - re-order the conditionals for losses, logits, labels and match with
    transformers
@yafshar yafshar marked this pull request as ready for review October 10, 2024 11:45
@yafshar yafshar requested a review from regisss as a code owner October 10, 2024 11:45
@yafshar
Copy link
Contributor Author

yafshar commented Oct 11, 2024

The env for all tests,

export RUN_SLOW=true
export GAUDI2_CI=1

Some tests which are finished

- slow_tests_fsdp -> 2 passed in 689.82s (0:11:29) | the same accuracy and loss on both tests with main

- test_trainer_distributed -> 2 passed, 6 warnings in 21.19s | the same as main

- test_trainer -> 75 passed, 8 skipped, 37 warnings in 55.23s | the same as main

- test_trainer_seq2seq -> 2 passed, 6 warnings in 21.19s | the same as main

Other tests are in progress and will be updated here

- introduce a new var _should_update_inputs for llama, qwen2,
  starcoder2 and gemma, to avoid double if condition in the loop
@yafshar
Copy link
Contributor Author

yafshar commented Oct 14, 2024

The env for all tests,

export RUN_SLOW=true
export GAUDI2_CI=1

test_examples finished

- test_examples -> 11 failed, 50 passed, 2 warnings in 21761.68s (6:02:41) | 11 failed, 50 passed, 2 warnings in 23254.39s (6:27:34) on main
  • main branch
=========================== short test summary info ============================
FAILED tests/test_examples.py::MultiCardQuestionAnsweringExampleTester::test_run_qa_roberta-large_multi_card
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingExampleTester::test_run_clm_gpt2_multi_card
FAILED tests/test_examples.py::DeepspeedCausalLanguageModelingExampleTester::test_run_clm_CodeLlama-13b-Instruct-hf_deepspeed
FAILED tests/test_examples.py::MultiCardSummarizationExampleTester::test_run_summarization_t5-small_multi_card
FAILED tests/test_examples.py::DeepspeedSummarizationExampleTester::test_run_summarization_flan-t5-xxl_deepspeed
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingLORAExampleTester2::test_run_lora_clm_falcon-40b_multi_card
FAILED tests/test_examples.py::MultiCardSeq2SeqSpeechRecognitionExampleTester::test_run_speech_recognition_seq2seq_whisper-small_multi_card
FAILED tests/test_examples.py::DeepspeedSFTExampleTester::test_sft_Qwen2-72B_deepspeed
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingPrefixTuningExampleTester::test_run_prompt_tuning_clm_llama-7b_multi_card
FAILED tests/test_examples.py::MultiCardMultiTastPromptPeftExampleTester::test_run_multitask_prompt_tuning_t5-small_multi_card
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingVeraExampleTester::test_run_lora_clm_llama-7b_multi_card
=========== 11 failed, 50 passed, 2 warnings in 23254.39s (6:27:34) ============
  • Current PR
=========================== short test summary info ============================
FAILED tests/test_examples.py::MultiCardQuestionAnsweringExampleTester::test_run_qa_roberta-large_multi_card
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingExampleTester::test_run_clm_gpt2_multi_card
FAILED tests/test_examples.py::DeepspeedCausalLanguageModelingExampleTester::test_run_clm_CodeLlama-13b-Instruct-hf_deepspeed
FAILED tests/test_examples.py::MultiCardSummarizationExampleTester::test_run_summarization_t5-small_multi_card
FAILED tests/test_examples.py::DeepspeedSummarizationExampleTester::test_run_summarization_flan-t5-xxl_deepspeed
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingLORAExampleTester2::test_run_lora_clm_falcon-40b_multi_card
FAILED tests/test_examples.py::MultiCardSeq2SeqSpeechRecognitionExampleTester::test_run_speech_recognition_seq2seq_whisper-small_multi_card
FAILED tests/test_examples.py::DeepspeedSFTExampleTester::test_sft_Qwen2-72B_deepspeed
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingPrefixTuningExampleTester::test_run_prompt_tuning_clm_llama-7b_multi_card
FAILED tests/test_examples.py::MultiCardMultiTastPromptPeftExampleTester::test_run_multitask_prompt_tuning_t5-small_multi_card
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingVeraExampleTester::test_run_lora_clm_llama-7b_multi_card
=========== 11 failed, 50 passed, 2 warnings in 21761.68s (6:02:41) ============

@yafshar
Copy link
Contributor Author

yafshar commented Oct 24, 2024

@regisss would you please check this PR, if we want to move towards v4.46.0, this PR will help. There are updates in the Trainer for the new version

@emascarenhas
Copy link
Contributor

Please make the title more descriptive to say what PR is doing.

@yafshar yafshar changed the title Trainer Update the Gaudi trainer with transformers 4.45.0 Nov 5, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@emascarenhas
Copy link
Contributor

Nice! I'll run the nightly CI to see if there is no regression. Also, not sure the _load_best_model method works with DeepSpeed and FSDP.

@regisss , Do you have results from CI to share, was there any performance regression?

@yafshar
Copy link
Contributor Author

yafshar commented Dec 3, 2024

with the latest changes

main branch

>>>  RUN_SLOW=true GAUDI2_CI=1 python -m pytest tests/test_examples.py -v -s --token=$HUGGING_FACE_HUB_TOKEN -k MultiCardCausalLanguageModelingLORAExampleTester

2 passed, 51 deselected in 2873.00s (0:47:52)

current PR

>>>  RUN_SLOW=true GAUDI2_CI=1 python -m pytest tests/test_examples.py -v -s --token=$HUGGING_FACE_HUB_TOKEN -k MultiCardCausalLanguageModelingLORAExampleTester

2 passed, 51 deselected in 2301.59s (0:38:21)

@jiminha jiminha requested review from ssarkar2 and libinta December 4, 2024 19:03
@regisss regisss merged commit e627a26 into huggingface:main Dec 9, 2024
4 checks passed
@yafshar yafshar deleted the trainer branch December 9, 2024 21:38
zzhang37 pushed a commit to zzhang37/optimum-habana that referenced this pull request Dec 9, 2024
imangohari1 pushed a commit to imangohari1/optimum-habana that referenced this pull request Dec 10, 2024
zzhang37 pushed a commit to zzhang37/optimum-habana that referenced this pull request Dec 10, 2024
zzhang37 pushed a commit to zzhang37/optimum-habana that referenced this pull request Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants