v1.8: BART, bucketing for text generation, SD upscaler, SynapseAI v1.12 and many model optimizations
BART for inference
- Enable BartForConditionalGeneration inference with greedy search #274 @bhargaveede
Bucketing for text generation
Stable Diffusion x4 upscaler
- StableDiffusion x4 upscaler #387 @estelleafl
SynapseAI v1.12
Various model optimizations
- Fix graph compilation error from Falcon when batch size>1 #356 @schoi-habana
- Add mpt optimization for gaudi #363 @sywangyi
- Improve MPT inference performance #377 @schoi-habana
- Allocate KV cache in contiguous memory for HPU performance #394 @puneeshkhanna
- Add support for attention softmax in BF16 such as for llama #396 @puneeshkhanna
- Add trim logit logic to reduce maximum memory usage for Llama inference #395 @BaihuiJin
- Skip hpugraph usage for first token to save memory #397 @polisettyvarma
- Llama inference:add reuse_cache to save memory #409 @regisss
- GPT2 contiguous fix #421 @ZhaiFeiyue
- Improve perf and memory usage with reuse cache by slicing inputs till token idx for 1st token generation #422 @puneeshkhanna
- GPT-J/NeoX contiguous #454 @BaihuiJin
TGI
- Fix gptj incorrect output issue in TGI #340 @sywangyi
- Enable hpu graph #330 @htang2012
- Upgrade to TGI v1.0.3 #373 @regisss
- Accelerate the inference when input prompt length changes in TGI #386 @sywangyi
- Support static shape in concatenate and filter in TGI #389 @sywangyi
- Fix bloom concatenate and filter issue #401 @sywangyi
- Fix error in logits process in hpu graph #404 @sywangyi
- Fix first token #408 @regisss
- Temporary fix in TGI for max total tokens #443 @hsubramony
Check min version in examples
A utility method was added to ensure that the latest version of Optimum Habana is installed to run the examples.
Others
- Add support for autocast custom ops in GaudiTrainer #308 @regisss
- Add warmup arg and move stats printing to the end #390 @polisettyvarma
- Add a configurable max input tokens parameter #426 @puneeshkhanna
- Add transformers model tests for gaudi #427 @ankurneog
- Modify loraft llama falcon #415 @libinta
- Option to not crop in dataset run #444 @ssarkar2
- Enable auto tensor parallelism for Falcon #451 @mandy-li
Various fixes
- Fixes for streaming dataset mode #324 @MohitIntel
- Fix beam search output #360 @puneeshkhanna
- Fix DDP for LoRA #368 @sywangyi
- Load llama ckpt to meta to work around OOM issue on CPU #359 @mandy-li
- Fix gradient checkpointing in LoRA example #398 @regisss
- No need to wrap DDP when using Fast DDP #430 @ikurtchen
- Fix falcon-40b error when DeepSpeed enabled #434 @schoi-habana
- Revert "Fix T5 DeepSpeed ZeRO-3 (#393)" #466 @sywangyi
Regression tests for this release are available here: https://github.com/huggingface/optimum-habana/actions/runs/6580186897