Update the Gaudi trainer with transformers 4.45.2 #1398

yafshar · 2024-10-04T19:56:02Z

What does this PR do?

Update the Gaudi trainer with transformers 4.45.2

Add the description
remove _is_peft_model function and import it from transformers
remove the unnecessary import datasets and keep the first one
Update the _get_train_sampler, by replacing computed var num_samples
Update the _inner_training_loop
- num_update_steps_per_epoch used when possible
- unused line was removed
- add a new _should_compute_grad_norm var to remove extra conditional check
Update the _load_best_model, enabled the if self.is_deepspeed_enabled
Update _maybe_log_save_evaluate, _grad_norm.item() -> _grad_norm.detach.item()
Remove the _save_checkpoint, exactly the same as transformers
Update autocast_smart_context_manager, updateing the old interface
Update save_model, add accelerate version check
Update evaluation_loop
- introduce a new var _should_update_inputs for llama, qwen2, starcoder2 and gemma, to avoid double if condition in the loop
- set the logits_dtype to float32 before the loop
- re-order the conditionals for losses, logits, labels and match with
  transformers
Update _inner_training_loop
- introduce a new var _should_update_inputs for llama, qwen2, starcoder2 and gemma, to avoid double if condition in the loop

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

- Add the description - remove _is_peft_model function and import it from transformers - remove the unnecessary import datasets and keep the first one

- Update the _get_train_sampler, by replacing computed var num_samples - Update the _inner_training_loop - num_update_steps_per_epoch used when possible - unused line was removed - add a new _should_compute_grad_norm var to remove extra conditional check - Update the _load_best_model, enabled the if self.is_deepspeed_enabled - Update _maybe_log_save_evaluate, _grad_norm.item() -> _grad_norm.detach.item()

- Remove the _save_checkpoint, exactly the same as transformers - Update autocast_smart_context_manager, updateing the old interface - Update save_model, add accelerate version check - Update evaluation_loop - introduce a new var _should_update_inputs for llama, qwen2, starcoder2 and gemma, to avoid double if condition in the loop - set the logits_dtype to float32 before the loop - re-order the conditionals for losses, logits, labels and match with transformers

…mandatory for Gaudi

…checkpoint

… any more

…in train any more" This reverts commit e3f316f.

yafshar · 2024-10-11T17:18:46Z

The env for all tests,

export RUN_SLOW=true
export GAUDI2_CI=1

Some tests which are finished

- slow_tests_fsdp -> 2 passed in 689.82s (0:11:29) | the same accuracy and loss on both tests with main

- test_trainer_distributed -> 2 passed, 6 warnings in 21.19s | the same as main

- test_trainer -> 75 passed, 8 skipped, 37 warnings in 55.23s | the same as main

- test_trainer_seq2seq -> 2 passed, 6 warnings in 21.19s | the same as main

Other tests are in progress and will be updated here

- introduce a new var _should_update_inputs for llama, qwen2, starcoder2 and gemma, to avoid double if condition in the loop

yafshar · 2024-10-14T22:37:40Z

The env for all tests,

export RUN_SLOW=true
export GAUDI2_CI=1

test_examples finished

- test_examples -> 11 failed, 50 passed, 2 warnings in 21761.68s (6:02:41) | 11 failed, 50 passed, 2 warnings in 23254.39s (6:27:34) on main

main branch

=========================== short test summary info ============================
FAILED tests/test_examples.py::MultiCardQuestionAnsweringExampleTester::test_run_qa_roberta-large_multi_card
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingExampleTester::test_run_clm_gpt2_multi_card
FAILED tests/test_examples.py::DeepspeedCausalLanguageModelingExampleTester::test_run_clm_CodeLlama-13b-Instruct-hf_deepspeed
FAILED tests/test_examples.py::MultiCardSummarizationExampleTester::test_run_summarization_t5-small_multi_card
FAILED tests/test_examples.py::DeepspeedSummarizationExampleTester::test_run_summarization_flan-t5-xxl_deepspeed
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingLORAExampleTester2::test_run_lora_clm_falcon-40b_multi_card
FAILED tests/test_examples.py::MultiCardSeq2SeqSpeechRecognitionExampleTester::test_run_speech_recognition_seq2seq_whisper-small_multi_card
FAILED tests/test_examples.py::DeepspeedSFTExampleTester::test_sft_Qwen2-72B_deepspeed
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingPrefixTuningExampleTester::test_run_prompt_tuning_clm_llama-7b_multi_card
FAILED tests/test_examples.py::MultiCardMultiTastPromptPeftExampleTester::test_run_multitask_prompt_tuning_t5-small_multi_card
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingVeraExampleTester::test_run_lora_clm_llama-7b_multi_card
=========== 11 failed, 50 passed, 2 warnings in 23254.39s (6:27:34) ============

Current PR

=========================== short test summary info ============================
FAILED tests/test_examples.py::MultiCardQuestionAnsweringExampleTester::test_run_qa_roberta-large_multi_card
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingExampleTester::test_run_clm_gpt2_multi_card
FAILED tests/test_examples.py::DeepspeedCausalLanguageModelingExampleTester::test_run_clm_CodeLlama-13b-Instruct-hf_deepspeed
FAILED tests/test_examples.py::MultiCardSummarizationExampleTester::test_run_summarization_t5-small_multi_card
FAILED tests/test_examples.py::DeepspeedSummarizationExampleTester::test_run_summarization_flan-t5-xxl_deepspeed
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingLORAExampleTester2::test_run_lora_clm_falcon-40b_multi_card
FAILED tests/test_examples.py::MultiCardSeq2SeqSpeechRecognitionExampleTester::test_run_speech_recognition_seq2seq_whisper-small_multi_card
FAILED tests/test_examples.py::DeepspeedSFTExampleTester::test_sft_Qwen2-72B_deepspeed
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingPrefixTuningExampleTester::test_run_prompt_tuning_clm_llama-7b_multi_card
FAILED tests/test_examples.py::MultiCardMultiTastPromptPeftExampleTester::test_run_multitask_prompt_tuning_t5-small_multi_card
FAILED tests/test_examples.py::MultiCardCausalLanguageModelingVeraExampleTester::test_run_lora_clm_llama-7b_multi_card
=========== 11 failed, 50 passed, 2 warnings in 21761.68s (6:02:41) ============

yafshar · 2024-10-24T12:24:48Z

@regisss would you please check this PR, if we want to move towards v4.46.0, this PR will help. There are updates in the Trainer for the new version

emascarenhas · 2024-11-05T16:16:03Z

Please make the title more descriptive to say what PR is doing.

HuggingFaceDocBuilderDev · 2024-11-29T14:30:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- We have Accelerate >= v0.33.0 mandatory for optimum-habana

emascarenhas · 2024-12-03T17:27:57Z

Nice! I'll run the nightly CI to see if there is no regression. Also, not sure the _load_best_model method works with DeepSpeed and FSDP.

@regisss , Do you have results from CI to share, was there any performance regression?

yafshar · 2024-12-03T23:18:44Z

with the latest changes

main branch

>>>  RUN_SLOW=true GAUDI2_CI=1 python -m pytest tests/test_examples.py -v -s --token=$HUGGING_FACE_HUB_TOKEN -k MultiCardCausalLanguageModelingLORAExampleTester

2 passed, 51 deselected in 2873.00s (0:47:52)

current PR

>>>  RUN_SLOW=true GAUDI2_CI=1 python -m pytest tests/test_examples.py -v -s --token=$HUGGING_FACE_HUB_TOKEN -k MultiCardCausalLanguageModelingLORAExampleTester

2 passed, 51 deselected in 2301.59s (0:38:21)

Non CausalLM models like LlamaForSequenceClassification will fail with unexpected argument lazy_mode

yafshar added 12 commits October 4, 2024 09:59

Update the Gaudi trainer

32830a4

- Add the description - remove _is_peft_model function and import it from transformers - remove the unnecessary import datasets and keep the first one

update the _should_update_inputs var and removed extra conditionals

01c7bf2

Merge branch 'main' into trainer

b2bc9d2

Remove is_accelerate_available with no min version, as accelerate is …

7c5ef76

…mandatory for Gaudi

Uncomment load_fsdp_model since it is already in use from _load_from_…

d1d1cb5

…checkpoint

Merge branch 'main' into trainer

404a1b3

cleaning, we have args = self.args, no need to use self.args in train…

e3f316f

… any more

Merge branch 'main' into trainer

d90b1ff

Merge branch 'main' into trainer

47faecd

Revert "cleaning, we have args = self.args, no need to use self.args …

eb54370

…in train any more" This reverts commit e3f316f.

yafshar marked this pull request as ready for review October 10, 2024 11:45

yafshar requested a review from regisss as a code owner October 10, 2024 11:45

Merge branch 'main' into trainer

8f97c49

Update _inner_training_loop

03aea0d

- introduce a new var _should_update_inputs for llama, qwen2, starcoder2 and gemma, to avoid double if condition in the loop

yafshar added 5 commits October 17, 2024 08:51

Merge branch 'main' into trainer

cebf5d6

Merge branch 'main' into trainer

782d580

Merge branch 'main' into trainer

4fb5ad9

Merge branch 'main' into trainer

1afd991

Merge branch 'main' into trainer

8dbc8a3

yafshar added 4 commits October 29, 2024 10:42

Merge branch 'main' into trainer

d654557

Merge branch 'main' into trainer

eeb3e82

Simplify the code

8fb2886

Fix style

a303607

yafshar changed the title ~~Trainer~~ Update the Gaudi trainer with transformers 4.45.0 Nov 5, 2024

yafshar added 7 commits December 2, 2024 04:44

Resolve merge conflict

64e60fa

Remove extra check for Accelerate

eed58a9

- We have Accelerate >= v0.33.0 mandatory for optimum-habana

Remove extra check for Accelerate

f74a642

- We have Accelerate >= v0.33.0 mandatory for optimum-habana

Merge branch 'main' into trainer

5950221

Merge branch 'main' into trainer

f0e47f5

Merge branch 'main' into trainer

37d8e5e

Merge branch 'main' into trainer

3ebe72e

yafshar added 2 commits December 3, 2024 15:10

Resolve merge conflict

f6426fb

Merge branch 'main' into trainer

6f296ca

Merge branch 'main' into trainer

aec3777

jiminha requested review from ssarkar2 and libinta December 4, 2024 19:03

yafshar added 9 commits December 4, 2024 11:16

Make sure the lazy_mode exists in model forward

cd4465b

Non CausalLM models like LlamaForSequenceClassification will fail with unexpected argument lazy_mode

Merge branch 'main' into trainer

56ab416

Merge branch 'main' into trainer

eed1418

Merge branch 'main' into trainer

585fba4

Merge branch 'main' into trainer

0ea64d7

Fix the lazy_mode assignment for Peft models

e6d582a

Merge branch 'main' into trainer

3615333

Resolve merge conflict

f7d4013

Resolve merge conflict

d6fb5b3

regisss merged commit e627a26 into huggingface:main Dec 9, 2024
4 checks passed

yafshar deleted the trainer branch December 9, 2024 21:38

zzhang37 pushed a commit to zzhang37/optimum-habana that referenced this pull request Dec 9, 2024

Update the Gaudi trainer with transformers 4.45.2 (huggingface#1398)

cbaa02b

imangohari1 pushed a commit to imangohari1/optimum-habana that referenced this pull request Dec 10, 2024

Update the Gaudi trainer with transformers 4.45.2 (huggingface#1398)

0e207d8

zzhang37 pushed a commit to zzhang37/optimum-habana that referenced this pull request Dec 10, 2024

Update the Gaudi trainer with transformers 4.45.2 (huggingface#1398)

33a718f

zzhang37 pushed a commit to zzhang37/optimum-habana that referenced this pull request Dec 11, 2024

Update the Gaudi trainer with transformers 4.45.2 (huggingface#1398)

ceace58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the Gaudi trainer with transformers 4.45.2 #1398

Update the Gaudi trainer with transformers 4.45.2 #1398

yafshar commented Oct 4, 2024 •

edited

Loading

yafshar commented Oct 11, 2024 •

edited

Loading

yafshar commented Oct 14, 2024 •

edited

Loading

yafshar commented Oct 24, 2024 •

edited

Loading

emascarenhas commented Nov 5, 2024

HuggingFaceDocBuilderDev commented Nov 29, 2024

emascarenhas commented Dec 3, 2024

yafshar commented Dec 3, 2024 •

edited

Loading

Update the Gaudi trainer with transformers 4.45.2 #1398

Update the Gaudi trainer with transformers 4.45.2 #1398

Conversation

yafshar commented Oct 4, 2024 • edited Loading

What does this PR do?

Before submitting

yafshar commented Oct 11, 2024 • edited Loading

yafshar commented Oct 14, 2024 • edited Loading

yafshar commented Oct 24, 2024 • edited Loading

emascarenhas commented Nov 5, 2024

HuggingFaceDocBuilderDev commented Nov 29, 2024

emascarenhas commented Dec 3, 2024

yafshar commented Dec 3, 2024 • edited Loading

yafshar commented Oct 4, 2024 •

edited

Loading

yafshar commented Oct 11, 2024 •

edited

Loading

yafshar commented Oct 14, 2024 •

edited

Loading

yafshar commented Oct 24, 2024 •

edited

Loading

yafshar commented Dec 3, 2024 •

edited

Loading