v1.8: BART, bucketing for text generation, SD upscaler, SynapseAI v1.12 and many model optimizations

regisss released this 20 Oct 09:39

· 846 commits to main since this release

07a881a

BART for inference

Enable BartForConditionalGeneration inference with greedy search #274 @bhargaveede

Bucketing for text generation

growing bucket optimization for greedy text generation #417 @ssarkar2

Stable Diffusion x4 upscaler

StableDiffusion x4 upscaler #387 @estelleafl

SynapseAI v1.12

Switch to SynapseAI v1.12.0 #453 @regisss

Various model optimizations

Fix graph compilation error from Falcon when batch size>1 #356 @schoi-habana
Add mpt optimization for gaudi #363 @sywangyi
Improve MPT inference performance #377 @schoi-habana
Allocate KV cache in contiguous memory for HPU performance #394 @puneeshkhanna
Add support for attention softmax in BF16 such as for llama #396 @puneeshkhanna
Add trim logit logic to reduce maximum memory usage for Llama inference #395 @BaihuiJin
Skip hpugraph usage for first token to save memory #397 @polisettyvarma
Llama inference:add reuse_cache to save memory #409 @regisss
GPT2 contiguous fix #421 @ZhaiFeiyue
Improve perf and memory usage with reuse cache by slicing inputs till token idx for 1st token generation #422 @puneeshkhanna
GPT-J/NeoX contiguous #454 @BaihuiJin

TGI

Fix gptj incorrect output issue in TGI #340 @sywangyi
Enable hpu graph #330 @htang2012
Upgrade to TGI v1.0.3 #373 @regisss
Accelerate the inference when input prompt length changes in TGI #386 @sywangyi
Support static shape in concatenate and filter in TGI #389 @sywangyi
Fix bloom concatenate and filter issue #401 @sywangyi
Fix error in logits process in hpu graph #404 @sywangyi
Fix first token #408 @regisss
Temporary fix in TGI for max total tokens #443 @hsubramony

Check min version in examples

A utility method was added to ensure that the latest version of Optimum Habana is installed to run the examples.

Add check_optimum_habana_min_version #335 @regisss

Others

Add support for autocast custom ops in GaudiTrainer #308 @regisss
Add warmup arg and move stats printing to the end #390 @polisettyvarma
Add a configurable max input tokens parameter #426 @puneeshkhanna
Add transformers model tests for gaudi #427 @ankurneog
Modify loraft llama falcon #415 @libinta
Option to not crop in dataset run #444 @ssarkar2
Enable auto tensor parallelism for Falcon #451 @mandy-li

Various fixes

Fixes for streaming dataset mode #324 @MohitIntel
Fix beam search output #360 @puneeshkhanna
Fix DDP for LoRA #368 @sywangyi
Load llama ckpt to meta to work around OOM issue on CPU #359 @mandy-li
Fix gradient checkpointing in LoRA example #398 @regisss
No need to wrap DDP when using Fast DDP #430 @ikurtchen
Fix falcon-40b error when DeepSpeed enabled #434 @schoi-habana
Revert "Fix T5 DeepSpeed ZeRO-3 (#393)" #466 @sywangyi

Regression tests for this release are available here: https://github.com/huggingface/optimum-habana/actions/runs/6580186897

Contributors

htang2012, polisettyvarma, and 15 other contributors

Assets 2