Releases: huggingface/optimum-habana
v1.10.2: Patch release
v1.10: SDXL, Textual-Inversion, TRL, SynapseAI v1.14
SynapseAI v1.14
The codebase is fully validated for the latest version of Habana SDK, SynapseAI v1.14.0.
Stable Diffusion XL
SDXL is now supported and optimized for Gaudi.
Textual inversion fine-tuning
An example of textual-inversion fine-tuning has been added.
TRL
The 🤗 TRL library is now supported on Gaudi for performing DPO and SFT.
- Add DPO and SFT of TRL support in Gaudi and example #601
- Restructure example/trl/stack_llama_2 for generic DPO #635 @libinta
- Add DPO of TRL in README.md #652 @libinta
- Add seed in DPO for reproduce the training result #646 @sywangyi
Full bf16 evaluation
Full bf16 evaluation inside the trainer can now be performed like in Transformers.
- Adding support for bf16_full_eval #610 @bhargaveede
Text-generation pipeline
A text-generation pipeline fully optimized for Gaudi has been added.
- Text-Generation Pipeline Example #526 @sjagtap1803
Model optimizations
- Enhances llama performance by removing the 'cast_f32_to_bf16' operation #564 @kalyanjk
- Refactoring LLama Attention and mlp layers #589 @bgoldberg-habana
- Support for FlashAttention in Llama2 #584 @wszczurekhabana
- Integrate Habana flash attention to Llama2-70B finetune #596 @mandy-li
- Enabling T5ForConditionalGeneration Inference using static shapes #425 @bhargaveede
- Avoid falcon perf drop from PR#607 when BS=1 @schoi-habana
- Enable fused rmsnorm in bf16 for llama #621 @puneeshkhanna
- Flash attention enhancement of repeatKV #626 @puneeshkhanna
- Update repeat KV llama logic for better TP-4 performance #639 @puneeshkhanna
- Falcon changes for v1.14.0 release #654 @schoi-habana
TGI
TGI on Gaudi has been moved to a dedicated repo: https://github.com/huggingface/tgi-gaudi
- Update tokenizer for tgi #572 @hsubramony
- Remove redundant requirements #575 @hsubramony
- Change next_token_chooser to HeterogeneousNextTokenChooser for TGI #574 @yeonsily
- Remove TGI folder from Optimum Habana #597 @regisss
Various fixes
- Fix messed up README for llama2-70b #571 @mandy-li
- Fix Diffusers tests #570 @ssarkar2
- Fix fp8 command in text-generation README #586 @regisss
- Fix wav2vec inference bug #588 @skaulintel
- Fix hash_with_views error #587 @bgoldberg-habana
- Add dataset disposal of b-mc2/sql-create-context for codegen and fix zero3 lora save issue #552 @sywangyi
- Fix gptj training issue #594 @BaihuiJin
- Fix DataLoaderDispatcher issue in Gaudi #600 @sywangyi
- Fix for Falcon error from PR #587 #608 @schoi-habana
- Falcon graph compilation error fix for when bs>1 #607 @regisss
- Fix crash if gaudi_config is not passed to GaudiTrainer #613 @sywangyi
- Fix flash attention output for llama for padded batched inputs #623 @puneeshkhanna
- Fix backward error in DDP when running reward model finetune in RLHF #507 @sywangyi
- Fix dpo graph compile error in evaluation #630 @sywangyi
- Fix error in run_image_classification.py #631 @regisss
- Fix RLHF llama rewarding modeling backward issue #612 @sywangyi
- Fix SD example so that custom bf16 ops can be used #642 @regisss
- Fix SD2 test #647 @regisss
- Fix typo in README #656 @yeonsily
- Fix error in PR#654 #661 @schoi-habana
- Fix compile error for torch_cmpile for llama #662 @jiminha
- Fix SDXL test #666 @regisss
Others
- Remove red crosses in model table #577 @regisss
- Misc changes for transformers tests #581 @ankurneog
- Remove delete_doc_comment workflows #582 @regisss
- Pin PEFT for the languge-modeling example #591 @regisss
- Remove workarounds to have causal_mask in uint8 for GPT2, GPT-J and CodeGen #592 @regisss
- Change Synapse validated version in README #603 @regisss
- Dyn prompt afterrefactor #543 @ssarkar2
- In peft, only the trainable parameters need to be saved #576 @sywangyi
- Add inheritance in Diffusers pipelines #611 @regisss
- Update generation config to enable flash attention for inference #609 @puneeshkhanna
- Remove setting of PT_HPU_LAZY_MODE=2 in training_args.py #625 @vivekgoe
- Remove hpu:X notation untill fully supported by bridge #637 @hsubramony
- Add use_flash_attention to Llama2-70B finetuning command in README #640 @mandy-li
- Enable master_port selecting for DeepSpeed and MPI #641 @yangulei
- Enabling Graphs in Wav2Vec AC training #622 @bhargaveede
- Add changes to support FSDP #598 @vivekgoe
- Run Llama2 with torch.compile on Gaudi2 #616 @kausikmaiti
- Hqt #648 @bgoldberg-habana
v1.9: Llama2-70B, Falcon-180B, Mistral, fp8, SynapseAI v1.13
SynapseAI v1.13
The codebase is fully validated for the latest version of Habana SDK, SynapseAI v1.13.
Fine-tuning Llama2-70B, Falcon-180B and BLOOM-7B
Added examples for fine-tuning Llama2-70B and Falcon-180B on Gaudi2 and BLOOM-7B on first-gen Gaudi.
- Enable llama2-70b LoRA finetuning #527 @mandy-li
- Add Deepspeed zero3 configuration to run bloom-7b on Gaudi1 #487
- Enable Falcon 180B #537 @hlahkar
Llama2 fp8 inference
- Add llamav2 fp8 inference #542 @bgoldberg-habana
Mistral
Optimizations
- Remove GPTJ dma before mha #468 @BaihuiJin
- Enable llama attention softmax in bf16 #521 @schoi-habana
- Add load_meta_device option to reduce host RAM #529 @jiminha
- Improve llama performance and reduce memory consumption by updating sin/cos cache when inferring more than max position embeddings (4096) #532 @puneeshkhanna
- Add hash_with_views arg for Falcon inference perf #534 @schoi-habana
- Automate skip_hash_with_views for text generation with Falcon #544 @regisss
Improved text generation
- Allow multi prompts #479 @ssarkar2
- Growing bucket for beam #450 @ssarkar2
- Some models have extra inputs, pad them too #488 @ssarkar2
- Refactor run generation #523 @bgoldberg-habana
- Fix setting of reuse cache #553 @puneeshkhanna
- No need to unsqueeze input_id in prepare_inputs_for_generation #559 @sywangyi
- Adding lm eval script #541 @bgoldberg-habana
Support for Transformers v4.34 and Diffusers v0.23
This version has been validated for Transformers v4.34 and Diffusers v0.23.
- Upgrade to Transformers 4.34 #475 @regisss
- Upgrade to Diffusers 0.23 #516 @regisss
- Pin Diffusers #565 @regisss
TGI
Dynamic shape support
Habana Mixed Precision was removed in favor of Torch Autocast
- Remove HMP from optimum-habana #349 @jwieczorekhabana
Various fixes
- Fix for SegFault during FT #483 @MohitIntel
- Enable/disable gradient_checkpointing as per training_args.gradient_checkpointing value #484 @vivekgoe
- Fix split validation dataset problem #489 @mandy-li
- Fix validate dataset problem for openassistant-guanaco #498 @mandy-li
- Fix for Accelerate #500 @regisss
- Fix deepspeed init issue when using external launcher #497 @yuanwu2017
- Update Transformers dependency in setup.py #504 @regisss
- Fix token transmission in text-generation example #509 @regisss
- Merge LoRA model before initializing DS inference in text-generation example #515 @regisss
- Fix for Falcon-40b inference with deepspeed #502 @schoi-habana
- Fixing FusedSDPA recompute bug #512 @skaulintel
- Fixing update method - avoid copy idx to cpu which splitting the graph #524 @bgoldberg-habana
- Fix missing max_position_embeddings in model config in run_clm.py #530 @regisss
- Fix for attn_softmax_bf16 when generation_config is None #531 @schoi-habana
- Fix loading on meta device for PEFT models with DS-inference #528 @regisss
- Fix split by whitespaces not a single space #540 @oelayan7
- Fix stable diffusion pipelines #548 @regisss
- Update trainer.py #549 @skaulintel
- Add fallback for PEFT when the base model doesn't exist #557 @regisss
Others
- Update GaudiNIC multi-node-training dockerfile and setup #477 @yeonsily
- Adding ignore_eos flag to use in generation #469 @bhargaveede
- Add maximum hpugraphs and disable_tensor_cache arguments to GaudiTrainer #493 @skaulintel
- Update BridgeTower example #561 @regisss
- Remove mention of eager in readme. set use_lazy_mode to true by default #486 @skaulintel
- Add another tokenizer to multilingual list #550 @ssarkar2
- Specify problem type for classification #551 @ssarkar2
The regression tests associated to this release are here: https://github.com/huggingface/optimum-habana/actions/runs/7085551714
V1.8.1: Patch release
Add a constraint on the Transformers dependency to make sure future versions are not installed.
Full Changelog: v1.8.0...v1.8.1
v1.8: BART, bucketing for text generation, SD upscaler, SynapseAI v1.12 and many model optimizations
BART for inference
- Enable BartForConditionalGeneration inference with greedy search #274 @bhargaveede
Bucketing for text generation
Stable Diffusion x4 upscaler
- StableDiffusion x4 upscaler #387 @estelleafl
SynapseAI v1.12
Various model optimizations
- Fix graph compilation error from Falcon when batch size>1 #356 @schoi-habana
- Add mpt optimization for gaudi #363 @sywangyi
- Improve MPT inference performance #377 @schoi-habana
- Allocate KV cache in contiguous memory for HPU performance #394 @puneeshkhanna
- Add support for attention softmax in BF16 such as for llama #396 @puneeshkhanna
- Add trim logit logic to reduce maximum memory usage for Llama inference #395 @BaihuiJin
- Skip hpugraph usage for first token to save memory #397 @polisettyvarma
- Llama inference:add reuse_cache to save memory #409 @regisss
- GPT2 contiguous fix #421 @ZhaiFeiyue
- Improve perf and memory usage with reuse cache by slicing inputs till token idx for 1st token generation #422 @puneeshkhanna
- GPT-J/NeoX contiguous #454 @BaihuiJin
TGI
- Fix gptj incorrect output issue in TGI #340 @sywangyi
- Enable hpu graph #330 @htang2012
- Upgrade to TGI v1.0.3 #373 @regisss
- Accelerate the inference when input prompt length changes in TGI #386 @sywangyi
- Support static shape in concatenate and filter in TGI #389 @sywangyi
- Fix bloom concatenate and filter issue #401 @sywangyi
- Fix error in logits process in hpu graph #404 @sywangyi
- Fix first token #408 @regisss
- Temporary fix in TGI for max total tokens #443 @hsubramony
Check min version in examples
A utility method was added to ensure that the latest version of Optimum Habana is installed to run the examples.
Others
- Add support for autocast custom ops in GaudiTrainer #308 @regisss
- Add warmup arg and move stats printing to the end #390 @polisettyvarma
- Add a configurable max input tokens parameter #426 @puneeshkhanna
- Add transformers model tests for gaudi #427 @ankurneog
- Modify loraft llama falcon #415 @libinta
- Option to not crop in dataset run #444 @ssarkar2
- Enable auto tensor parallelism for Falcon #451 @mandy-li
Various fixes
- Fixes for streaming dataset mode #324 @MohitIntel
- Fix beam search output #360 @puneeshkhanna
- Fix DDP for LoRA #368 @sywangyi
- Load llama ckpt to meta to work around OOM issue on CPU #359 @mandy-li
- Fix gradient checkpointing in LoRA example #398 @regisss
- No need to wrap DDP when using Fast DDP #430 @ikurtchen
- Fix falcon-40b error when DeepSpeed enabled #434 @schoi-habana
- Revert "Fix T5 DeepSpeed ZeRO-3 (#393)" #466 @sywangyi
Regression tests for this release are available here: https://github.com/huggingface/optimum-habana/actions/runs/6580186897
v1.7.5: Patch Release
Fix a bug due a changing import in Diffusers.
Full Changelog: v1.7.4...v1.7.5
v1.7.4: Patch Release
Fix a bug where DeepSpeed ZeRO-3 was not working.
Full Changelog: v1.7.3...v1.7.4
v1.7.2: Patch release
Upgrade to Accelerate v0.22.0 to fix a bug with distributed runs.
Full Changelog: v1.7.1...v1.7.2
v1.7.1: Patch release
Upgrade to Transformers v4.32.0 to fix a bug with Llama.
Full Changelog: v1.7.0...v1.7.1
v1.7: Llama 2, Falcon, LoRA, Transformers v4.31, SynapseAI v1.11
Transformers v4.31
Transformers v4.31 (latest stable release) is fully supported.
SynapseAI v1.11
SynapseAI v1.11 (latest stable release) is fully supported.
Optimizations for Llama 2, Falcon, StarCoder, OPT, GPT-NeoX, CodeGen
- Added support for OPT-66B #285 @ZhaiFeiyue
- Llama #296 @yeonsily
- Improve Llama2 and gpt_neox performance with Habana fused RoPE and RMSNorm #321 @mandy-li
- Enable Falcon-7b #326 @schoi-habana
- Fix inference with Llama-2-70B #342 @regisss
- Add model optimizations for codegen and gpt_bigcode #322 @PhillipHoward
Torch Autocast
Torch Autocast is becoming the default for managing mixed-precision runs.
- Fix autocast for BERT-like models #287 @ANSHUMAN87
- Add support for autocast in gradient checkpointing #307 @regisss
Improved text-generation example
- Added constrained beam search #281 @vivekgoe
- Fix padding error #282 @sywangyi
- Various improvements for faster checkpoint downloading #284 #286 #294 @regisss
- Add deepspeed TP policy for llama #303 @sywangyi
- Add token and model_revision args for the text-generation example #331 @regisss
LoRA examples
Two new LoRA examples for fine-tuning and inference.
LDM3D
New Stable Diffusion pipeline that enables to generate images and depth maps.
- Support for Ldm3d #304 @estelleafl
Added support for Text Generation Inference (TGI)
TGI is now supported on Gaudi.
GaudiGenerationConfig
Transformers' GenerationConfig
has been extended to be fully compatible with Gaudi. It adds two fields to better control generation with static shapes.
Various fixes and improvements
- Fix generation sampling when using
repetition_penalty
#301 @sywangyi - Remove kv cache wa #302 @ZhaiFeiyue
- Fix T5 inference performance regression #310 @libinta
- Fix gptj HCCL issue occured in DDP #318 @sywangyi
- Revert partially Enable/Optimize flan t5 xxl on deepspeed z3 #320 @hsubramony
- Modify flan-t5 deepspeed configuration #328 @yeonsily
- Add commands for gptj and gptneox #325 @ankurhabana
- Disable FusedRMSNorm for training #343 @hsubramony
- Enable hpu rms fused kernel for t5 #344 @ZhaiFeiyue
- Remove two workarounds on esmfold #334 @bzhu-habana