Releases · huggingface/optimum-habana

18 Feb 02:23

regisss

v1.10.2

a6a88fa

v1.10.2: Patch release

Upgrade to Transformers v4.37

Upgrade to Transformers 4.37 #651

Full Changelog: v1.10.0...v1.10.2

Assets 2

30 Jan 21:50

regisss

v1.10.0

c1154b2

v1.10: SDXL, Textual-Inversion, TRL, SynapseAI v1.14

SynapseAI v1.14

The codebase is fully validated for the latest version of Habana SDK, SynapseAI v1.14.0.

Upgrade to SynapseAI 1.14 #664 @regisss

Stable Diffusion XL

SDXL is now supported and optimized for Gaudi.

Stable Diffusion XL for Gaudi #619 @dsocek
Update for SDXL Turbo support #634 @atakaha

Textual inversion fine-tuning

An example of textual-inversion fine-tuning has been added.

Add Textual Inversion fine-tuning script #243 @regisss

TRL

The 🤗 TRL library is now supported on Gaudi for performing DPO and SFT.

Add DPO and SFT of TRL support in Gaudi and example #601
Restructure example/trl/stack_llama_2 for generic DPO #635 @libinta
Add DPO of TRL in README.md #652 @libinta
Add seed in DPO for reproduce the training result #646 @sywangyi

Full bf16 evaluation

Full bf16 evaluation inside the trainer can now be performed like in Transformers.

Adding support for bf16_full_eval #610 @bhargaveede

Text-generation pipeline

A text-generation pipeline fully optimized for Gaudi has been added.

Text-Generation Pipeline Example #526 @sjagtap1803

Model optimizations

Enhances llama performance by removing the 'cast_f32_to_bf16' operation #564 @kalyanjk
Refactoring LLama Attention and mlp layers #589 @bgoldberg-habana
Support for FlashAttention in Llama2 #584 @wszczurekhabana
Integrate Habana flash attention to Llama2-70B finetune #596 @mandy-li
Enabling T5ForConditionalGeneration Inference using static shapes #425 @bhargaveede
Avoid falcon perf drop from PR#607 when BS=1 @schoi-habana
Enable fused rmsnorm in bf16 for llama #621 @puneeshkhanna
Flash attention enhancement of repeatKV #626 @puneeshkhanna
Update repeat KV llama logic for better TP-4 performance #639 @puneeshkhanna
Falcon changes for v1.14.0 release #654 @schoi-habana

TGI

TGI on Gaudi has been moved to a dedicated repo: https://github.com/huggingface/tgi-gaudi

Update tokenizer for tgi #572 @hsubramony
Remove redundant requirements #575 @hsubramony
Change next_token_chooser to HeterogeneousNextTokenChooser for TGI #574 @yeonsily
Remove TGI folder from Optimum Habana #597 @regisss

Various fixes

Fix messed up README for llama2-70b #571 @mandy-li
Fix Diffusers tests #570 @ssarkar2
Fix fp8 command in text-generation README #586 @regisss
Fix wav2vec inference bug #588 @skaulintel
Fix hash_with_views error #587 @bgoldberg-habana
Add dataset disposal of b-mc2/sql-create-context for codegen and fix zero3 lora save issue #552 @sywangyi
Fix gptj training issue #594 @BaihuiJin
Fix DataLoaderDispatcher issue in Gaudi #600 @sywangyi
Fix for Falcon error from PR #587 #608 @schoi-habana
Falcon graph compilation error fix for when bs>1 #607 @regisss
Fix crash if gaudi_config is not passed to GaudiTrainer #613 @sywangyi
Fix flash attention output for llama for padded batched inputs #623 @puneeshkhanna
Fix backward error in DDP when running reward model finetune in RLHF #507 @sywangyi
Fix dpo graph compile error in evaluation #630 @sywangyi
Fix error in run_image_classification.py #631 @regisss
Fix RLHF llama rewarding modeling backward issue #612 @sywangyi
Fix SD example so that custom bf16 ops can be used #642 @regisss
Fix SD2 test #647 @regisss
Fix typo in README #656 @yeonsily
Fix error in PR#654 #661 @schoi-habana
Fix compile error for torch_cmpile for llama #662 @jiminha
Fix SDXL test #666 @regisss

Others

Remove red crosses in model table #577 @regisss
Misc changes for transformers tests #581 @ankurneog
Remove delete_doc_comment workflows #582 @regisss
Pin PEFT for the languge-modeling example #591 @regisss
Remove workarounds to have causal_mask in uint8 for GPT2, GPT-J and CodeGen #592 @regisss
Change Synapse validated version in README #603 @regisss
Dyn prompt afterrefactor #543 @ssarkar2
In peft, only the trainable parameters need to be saved #576 @sywangyi
Add inheritance in Diffusers pipelines #611 @regisss
Update generation config to enable flash attention for inference #609 @puneeshkhanna
Remove setting of PT_HPU_LAZY_MODE=2 in training_args.py #625 @vivekgoe
Remove hpu:X notation untill fully supported by bridge #637 @hsubramony
Add use_flash_attention to Llama2-70B finetuning command in README #640 @mandy-li
Enable master_port selecting for DeepSpeed and MPI #641 @yangulei
Enabling Graphs in Wav2Vec AC training #622 @bhargaveede
Add changes to support FSDP #598 @vivekgoe
Run Llama2 with torch.compile on Gaudi2 #616 @kausikmaiti
Hqt #648 @bgoldberg-habana

Contributors

kalyanjk, bhargaveede, and 21 other contributors

Assets 2

04 Dec 14:36

regisss

v1.9.0

7e4bd7f

v1.9: Llama2-70B, Falcon-180B, Mistral, fp8, SynapseAI v1.13

SynapseAI v1.13

The codebase is fully validated for the latest version of Habana SDK, SynapseAI v1.13.

Upgrade to SynapseAI 1.13 #563 @regisss

Fine-tuning Llama2-70B, Falcon-180B and BLOOM-7B

Added examples for fine-tuning Llama2-70B and Falcon-180B on Gaudi2 and BLOOM-7B on first-gen Gaudi.

Enable llama2-70b LoRA finetuning #527 @mandy-li
Add Deepspeed zero3 configuration to run bloom-7b on Gaudi1 #487
Enable Falcon 180B #537 @hlahkar

Llama2 fp8 inference

Add llamav2 fp8 inference #542 @bgoldberg-habana

Mistral

Add mistral support for generation #496 @sywangyi

Optimizations

Remove GPTJ dma before mha #468 @BaihuiJin
Enable llama attention softmax in bf16 #521 @schoi-habana
Add load_meta_device option to reduce host RAM #529 @jiminha
Improve llama performance and reduce memory consumption by updating sin/cos cache when inferring more than max position embeddings (4096) #532 @puneeshkhanna
Add hash_with_views arg for Falcon inference perf #534 @schoi-habana
Automate skip_hash_with_views for text generation with Falcon #544 @regisss

Improved text generation

Allow multi prompts #479 @ssarkar2
Growing bucket for beam #450 @ssarkar2
Some models have extra inputs, pad them too #488 @ssarkar2
Refactor run generation #523 @bgoldberg-habana
Fix setting of reuse cache #553 @puneeshkhanna
No need to unsqueeze input_id in prepare_inputs_for_generation #559 @sywangyi
Adding lm eval script #541 @bgoldberg-habana

Support for Transformers v4.34 and Diffusers v0.23

This version has been validated for Transformers v4.34 and Diffusers v0.23.

Upgrade to Transformers 4.34 #475 @regisss
Upgrade to Diffusers 0.23 #516 @regisss
Pin Diffusers #565 @regisss

TGI

Add link to TGI license #517 @regisss
Tgi Sharded feature #485 @libinta

Dynamic shape support

Add infra to enable/disable dynamic shapes feature through gaudi_config #513 @vivekgoe

Habana Mixed Precision was removed in favor of Torch Autocast

Remove HMP from optimum-habana #349 @jwieczorekhabana

Various fixes

Fix for SegFault during FT #483 @MohitIntel
Enable/disable gradient_checkpointing as per training_args.gradient_checkpointing value #484 @vivekgoe
Fix split validation dataset problem #489 @mandy-li
Fix validate dataset problem for openassistant-guanaco #498 @mandy-li
Fix for Accelerate #500 @regisss
Fix deepspeed init issue when using external launcher #497 @yuanwu2017
Update Transformers dependency in setup.py #504 @regisss
Fix token transmission in text-generation example #509 @regisss
Merge LoRA model before initializing DS inference in text-generation example #515 @regisss
Fix for Falcon-40b inference with deepspeed #502 @schoi-habana
Fixing FusedSDPA recompute bug #512 @skaulintel
Fixing update method - avoid copy idx to cpu which splitting the graph #524 @bgoldberg-habana
Fix missing max_position_embeddings in model config in run_clm.py #530 @regisss
Fix for attn_softmax_bf16 when generation_config is None #531 @schoi-habana
Fix loading on meta device for PEFT models with DS-inference #528 @regisss
Fix split by whitespaces not a single space #540 @oelayan7
Fix stable diffusion pipelines #548 @regisss
Update trainer.py #549 @skaulintel
Add fallback for PEFT when the base model doesn't exist #557 @regisss

Others

Update GaudiNIC multi-node-training dockerfile and setup #477 @yeonsily
Adding ignore_eos flag to use in generation #469 @bhargaveede
Add maximum hpugraphs and disable_tensor_cache arguments to GaudiTrainer #493 @skaulintel
Update BridgeTower example #561 @regisss
Remove mention of eager in readme. set use_lazy_mode to true by default #486 @skaulintel
Add another tokenizer to multilingual list #550 @ssarkar2
Specify problem type for classification #551 @ssarkar2

The regression tests associated to this release are here: https://github.com/huggingface/optimum-habana/actions/runs/7085551714

Contributors

bhargaveede, regisss, and 17 other contributors

Assets 2

02 Nov 10:31

regisss

v1.8.1

650ea13

V1.8.1: Patch release

Add a constraint on the Transformers dependency to make sure future versions are not installed.

Update Transformers dependency in setup.py #504 @regisss

Full Changelog: v1.8.0...v1.8.1

Contributors

regisss

Assets 2

20 Oct 09:39

regisss

v1.8.0

07a881a

v1.8: BART, bucketing for text generation, SD upscaler, SynapseAI v1.12 and many model optimizations

BART for inference

Enable BartForConditionalGeneration inference with greedy search #274 @bhargaveede

Bucketing for text generation

growing bucket optimization for greedy text generation #417 @ssarkar2

Stable Diffusion x4 upscaler

StableDiffusion x4 upscaler #387 @estelleafl

SynapseAI v1.12

Switch to SynapseAI v1.12.0 #453 @regisss

Various model optimizations

Fix graph compilation error from Falcon when batch size>1 #356 @schoi-habana
Add mpt optimization for gaudi #363 @sywangyi
Improve MPT inference performance #377 @schoi-habana
Allocate KV cache in contiguous memory for HPU performance #394 @puneeshkhanna
Add support for attention softmax in BF16 such as for llama #396 @puneeshkhanna
Add trim logit logic to reduce maximum memory usage for Llama inference #395 @BaihuiJin
Skip hpugraph usage for first token to save memory #397 @polisettyvarma
Llama inference:add reuse_cache to save memory #409 @regisss
GPT2 contiguous fix #421 @ZhaiFeiyue
Improve perf and memory usage with reuse cache by slicing inputs till token idx for 1st token generation #422 @puneeshkhanna
GPT-J/NeoX contiguous #454 @BaihuiJin

TGI

Fix gptj incorrect output issue in TGI #340 @sywangyi
Enable hpu graph #330 @htang2012
Upgrade to TGI v1.0.3 #373 @regisss
Accelerate the inference when input prompt length changes in TGI #386 @sywangyi
Support static shape in concatenate and filter in TGI #389 @sywangyi
Fix bloom concatenate and filter issue #401 @sywangyi
Fix error in logits process in hpu graph #404 @sywangyi
Fix first token #408 @regisss
Temporary fix in TGI for max total tokens #443 @hsubramony

Check min version in examples

A utility method was added to ensure that the latest version of Optimum Habana is installed to run the examples.

Add check_optimum_habana_min_version #335 @regisss

Others

Add support for autocast custom ops in GaudiTrainer #308 @regisss
Add warmup arg and move stats printing to the end #390 @polisettyvarma
Add a configurable max input tokens parameter #426 @puneeshkhanna
Add transformers model tests for gaudi #427 @ankurneog
Modify loraft llama falcon #415 @libinta
Option to not crop in dataset run #444 @ssarkar2
Enable auto tensor parallelism for Falcon #451 @mandy-li

Various fixes

Fixes for streaming dataset mode #324 @MohitIntel
Fix beam search output #360 @puneeshkhanna
Fix DDP for LoRA #368 @sywangyi
Load llama ckpt to meta to work around OOM issue on CPU #359 @mandy-li
Fix gradient checkpointing in LoRA example #398 @regisss
No need to wrap DDP when using Fast DDP #430 @ikurtchen
Fix falcon-40b error when DeepSpeed enabled #434 @schoi-habana
Revert "Fix T5 DeepSpeed ZeRO-3 (#393)" #466 @sywangyi

Regression tests for this release are available here: https://github.com/huggingface/optimum-habana/actions/runs/6580186897

Contributors

htang2012, polisettyvarma, and 15 other contributors

Assets 2

14 Sep 14:26

regisss

v1.7.5

3022304

v1.7.5: Patch Release

Fix a bug due a changing import in Diffusers.

Fix import from Diffusers #399 @regisss

Full Changelog: v1.7.4...v1.7.5

Contributors

regisss

Assets 2

12 Sep 18:35

regisss

v1.7.4

08ab680

v1.7.4: Patch Release

Fix a bug where DeepSpeed ZeRO-3 was not working.

Fix T5 DeepSpeed ZeRO-3 #393 @regisss

Full Changelog: v1.7.3...v1.7.4

Contributors

regisss

Assets 2

24 Aug 20:22

regisss

v1.7.2

0192d8f

v1.7.2: Patch release

Upgrade to Accelerate v0.22.0 to fix a bug with distributed runs.

Upgrade to Accelerate v0.22 #362 @regisss

Full Changelog: v1.7.1...v1.7.2

Contributors

regisss

Assets 2

23 Aug 10:41

regisss

v1.7.1

f4466f4

v1.7.1: Patch release

Upgrade to Transformers v4.32.0 to fix a bug with Llama.

Upgrade to Transformers v4.32 #354 @regisss

Full Changelog: v1.7.0...v1.7.1

Contributors

regisss

Assets 2

17 Aug 11:20

regisss

v1.7.0

daa24c9

v1.7: Llama 2, Falcon, LoRA, Transformers v4.31, SynapseAI v1.11

Transformers v4.31

Transformers v4.31 (latest stable release) is fully supported.

Upgrade to Transformers v4.31 #312 @regisss

SynapseAI v1.11

SynapseAI v1.11 (latest stable release) is fully supported.

Upgrade to Synapse 1.11 #333 @regisss

Optimizations for Llama 2, Falcon, StarCoder, OPT, GPT-NeoX, CodeGen

Added support for OPT-66B #285 @ZhaiFeiyue
Llama #296 @yeonsily
Improve Llama2 and gpt_neox performance with Habana fused RoPE and RMSNorm #321 @mandy-li
Enable Falcon-7b #326 @schoi-habana
Fix inference with Llama-2-70B #342 @regisss
Add model optimizations for codegen and gpt_bigcode #322 @PhillipHoward

Torch Autocast

⚠️ Habana Mixed Precision is deprecated and will be removed in SynapseAI v1.12.
Torch Autocast is becoming the default for managing mixed-precision runs.

Fix autocast for BERT-like models #287 @ANSHUMAN87
Add support for autocast in gradient checkpointing #307 @regisss

Improved text-generation example

Added constrained beam search #281 @vivekgoe
Fix padding error #282 @sywangyi
Various improvements for faster checkpoint downloading #284 #286 #294 @regisss
Add deepspeed TP policy for llama #303 @sywangyi
Add token and model_revision args for the text-generation example #331 @regisss

LoRA examples

Two new LoRA examples for fine-tuning and inference.

Add lora example for clm and text generation #305 @sywangyi

LDM3D

New Stable Diffusion pipeline that enables to generate images and depth maps.

Support for Ldm3d #304 @estelleafl

Added support for Text Generation Inference (TGI)

TGI is now supported on Gaudi.

Add support for TGI on Gaudi #297 @regisss

`GaudiGenerationConfig`

Transformers' GenerationConfig has been extended to be fully compatible with Gaudi. It adds two fields to better control generation with static shapes.

Add GaudiGenerationConfig #293 @regisss

Various fixes and improvements

Fix generation sampling when using repetition_penalty #301 @sywangyi
Remove kv cache wa #302 @ZhaiFeiyue
Fix T5 inference performance regression #310 @libinta
Fix gptj HCCL issue occured in DDP #318 @sywangyi
Revert partially Enable/Optimize flan t5 xxl on deepspeed z3 #320 @hsubramony
Modify flan-t5 deepspeed configuration #328 @yeonsily
Add commands for gptj and gptneox #325 @ankurhabana
Disable FusedRMSNorm for training #343 @hsubramony
Enable hpu rms fused kernel for t5 #344 @ZhaiFeiyue
Remove two workarounds on esmfold #334 @bzhu-habana

Contributors

regisss, mandy-li, and 12 other contributors

Assets 2

Releases: huggingface/optimum-habana

v1.10.2: Patch release

Upgrade to Transformers v4.37

v1.10: SDXL, Textual-Inversion, TRL, SynapseAI v1.14

SynapseAI v1.14

Stable Diffusion XL

Textual inversion fine-tuning

TRL

Full bf16 evaluation

Text-generation pipeline

Model optimizations

TGI

Various fixes

Others

Contributors

v1.9: Llama2-70B, Falcon-180B, Mistral, fp8, SynapseAI v1.13

SynapseAI v1.13

Fine-tuning Llama2-70B, Falcon-180B and BLOOM-7B

Llama2 fp8 inference

Mistral

Optimizations

Improved text generation

Support for Transformers v4.34 and Diffusers v0.23

TGI

Dynamic shape support

Habana Mixed Precision was removed in favor of Torch Autocast

Various fixes

Others

Contributors

V1.8.1: Patch release

Contributors

v1.8: BART, bucketing for text generation, SD upscaler, SynapseAI v1.12 and many model optimizations

BART for inference

Bucketing for text generation

Stable Diffusion x4 upscaler

SynapseAI v1.12

Various model optimizations

TGI

Check min version in examples

Others

Various fixes

Contributors

v1.7.5: Patch Release

Contributors

v1.7.4: Patch Release

Contributors

v1.7.2: Patch release

Contributors

v1.7.1: Patch release

Contributors

v1.7: Llama 2, Falcon, LoRA, Transformers v4.31, SynapseAI v1.11

Transformers v4.31

SynapseAI v1.11

Optimizations for Llama 2, Falcon, StarCoder, OPT, GPT-NeoX, CodeGen

Torch Autocast

Improved text-generation example

LoRA examples

LDM3D

Added support for Text Generation Inference (TGI)

GaudiGenerationConfig

Various fixes and improvements

Contributors

`GaudiGenerationConfig`