v1.6: Fast DDP, Torch Autocast, SynaspeAI v1.10 and various model optimizations
Fast DDP
A new distribution strategy is introduced. It is lighter, simpler and usually faster than Torch DDP. You can enable it in your runs with --distribution_strategy fast_ddp
.
- Improve performance and scalability of BERT FT training #200 @mlapinski-habana
Torch Autocast
It is now possible to use Torch Autocast as mixed precision backend. You can easily enable it in your runs with --bf16
(i.e. exactly like in Transformers).
- Enable usage of PyTorch autocast on Gaudi during training #226 @jwieczorekhabana
- Add Torch autocast and full bf16 to GaudiStableDiffusionPipeline #278 @regisss
SynapseAI v1.10
This release is fully compatible with SynapseAI v1.10.0.
HPU graphs for training
You can now use HPU graphs for training your models.
- Improve performance and scalability of BERT FT training #200 @mlapinski-habana
Check out the documentation for more information.
Various model optimizations
- Update BLOOM modeling for SynapseAI 1.10 #277
- Optimize conv1d forward #231 @ZhaiFeiyue
- Add static key-value cache for OPT, GPT-J, GPT-NeoX #246 #248 #249 @ZhaiFeiyue
- Optimizations for running FLAN T5 with DeepSpeed ZeRO-3 #257 @libinta
Asynchronous data copy
You can now enable asynchronous data copy between the host and devices during training using --non_blocking_data_copy
.
- Enable asynchronous data copy to get a better performance #211 @jychen-habana
Check out the documentation for more information.
Profiling
It is now possible to profile your training relying on GaudiTrainer
. You will need to pass --profiling_steps N
and --profiling_warmup_steps K
.
- Enable profiling #250 @ZhaiFeiyue
Adjusted throughput calculation
You can now let the GaudiTrainer
compute the real throughput of your run (i.e. not counting the time spent while logging, evaluating and saving the model) with --adjust_throughput
.
Check SynapseAI version at import
A check is performed when importing optimum.habana
to let you know if you are running the version of SynapseAI for which Optimum Habana has been tested.
Enhanced examples
Several examples have been added or improved. You can find them here.