26 Jun 09:41

regisss

1e4edfe

v1.6: Fast DDP, Torch Autocast, SynaspeAI v1.10 and various model optimizations

Fast DDP

A new distribution strategy is introduced. It is lighter, simpler and usually faster than Torch DDP. You can enable it in your runs with --distribution_strategy fast_ddp.

Improve performance and scalability of BERT FT training #200 @mlapinski-habana

Torch Autocast

It is now possible to use Torch Autocast as mixed precision backend. You can easily enable it in your runs with --bf16 (i.e. exactly like in Transformers).

Enable usage of PyTorch autocast on Gaudi during training #226 @jwieczorekhabana
Add Torch autocast and full bf16 to GaudiStableDiffusionPipeline #278 @regisss

SynapseAI v1.10

This release is fully compatible with SynapseAI v1.10.0.

Upgrade to SynapseAI v1.10.0 #255 @regisss

HPU graphs for training

You can now use HPU graphs for training your models.

Improve performance and scalability of BERT FT training #200 @mlapinski-habana

Check out the documentation for more information.

Various model optimizations

Update BLOOM modeling for SynapseAI 1.10 #277
Optimize conv1d forward #231 @ZhaiFeiyue
Add static key-value cache for OPT, GPT-J, GPT-NeoX #246 #248 #249 @ZhaiFeiyue
Optimizations for running FLAN T5 with DeepSpeed ZeRO-3 #257 @libinta

Asynchronous data copy

You can now enable asynchronous data copy between the host and devices during training using --non_blocking_data_copy.

Enable asynchronous data copy to get a better performance #211 @jychen-habana

Check out the documentation for more information.

Profiling

It is now possible to profile your training relying on GaudiTrainer. You will need to pass --profiling_steps N and --profiling_warmup_steps K.

Enable profiling #250 @ZhaiFeiyue

Adjusted throughput calculation

You can now let the GaudiTrainer compute the real throughput of your run (i.e. not counting the time spent while logging, evaluating and saving the model) with --adjust_throughput.

Added an option to remove save checkpoint time from throughput calculation #237 @libinta

Check SynapseAI version at import

A check is performed when importing optimum.habana to let you know if you are running the version of SynapseAI for which Optimum Habana has been tested.

Check Synapse version when optimum.habana is used #225 @regisss

Enhanced examples

Several examples have been added or improved. You can find them here.

the text-generation example now supports sampling and beam search decoding, and full bf16 generation #218 #229 #238 #251 #258 #271
the contrastive image-text example now supports HPU-accelerated data loading #256
new Seq2Seq QA example #221
new protein folding example with ESMFold #235 #276

Contributors

regisss, libinta, and 4 other contributors

Assets 2

17 Apr 18:07

regisss

v1.5.0

f346039

v1.5: BLOOM(Z), SynapseAI v1.9.0 and various speedups

BLOOM(Z)

BLOOM is introduced in this release with HPU-optimized tweaks to perform fast inference using DeepSpeed. A text-generation example is provided here so that you can easily try it.

Add text-generation example for BLOOM/BLOOMZ with DeepSpeed-inference #190 @regisss

Check out the blog post we recently released for a benchmark comparing BLOOMZ performance on Gaudi2 and A100.

SynapseAI v1.9.0

This release is fully compatible with SynapseAI v1.9.0.

Upgrade to SynapseAI 1.9.0 #193 @regisss

Transformers v4.28 and Diffusers v0.15

This release is fully compatible with the recently released Transformers v4.28 and Diffusers v0.15.

Upgrade to Diffusers 0.15.0 #201 @regisss
Upgrade to Transformers 4.28 #202 @regisss

Improved data sampling for training in lazy mode

This release enables to make sure that all batches will have the same size in lazy mode to prevent extra graph compilations.

Improve data sampling for training in lazy mode #152 @regisss

HPU graphs for distributed runs and generation

This release enables HPU graphs for distributed runs and text generation.

Enable HPU graphs for distributed runs and generation #179 @regisss

Recommend `dataloader_num_workers` for CV model training

ViT and Swin examples have been updated to add the dataloader_num_workers that enables to speed up training.

Adding dataloader_num_workers into example command for better performance #188 @ZhaiFeiyue

Enable to pipeline forward and backward passes

The argument pipelining_fwd_bwd enables to trigger the HPU compution of the forward pass while the CPU interprets the backward pass. This enables to speed up CV models.

Add mark_step between fwd and bwd for better performance #189 @ZhaiFeiyue

Contributors

regisss and ZhaiFeiyue

Assets 2

12 Feb 23:05

regisss

v1.4.0

d2c631c

v1.4: multi-node training and inference mode

Multi-node training

This release adds support for multi-node training through DeepSpeed. This enables you to scale out up to thousands of nodes to speed up your trainings even more!

Add support for multi-node training #116

Check out the documentation to get started.

Inference through HPU graphs

You can now perform inference faster on Gaudi with HPU graphs.

Add support for inference through HPU graphs in GaudiTrainer #151

HPU graphs are currently only supported for single-device runs. Check out the documentation for more information.

Synapse AI 1.8

This release is fully compatible with SynapseAI 1.8.0, which is the latest version. Check out Habana's documentation for more information about the new features.

DeepSpeed's gradient checkpointing

DeepSpeed's gradient checkpointing is now automatically used when setting gradient_checkpointing=True in a DeepSpeed run.

Enable DeepSpeed activation checkpointing #142

Assets 2

01 Dec 10:32

regisss

v1.3.0

bade9f0

v1.3: Stable Diffusion and Wav2Vec2

Stable Diffusion

This release adds a new interface for the 🤗 Diffusers library which enables to support the Stable Diffusion pipeline for inference. Thus, you can now generate images from text on Gaudi relying on the user-friendliness of 🤗 Diffusers.

Add support for Stable Diffusion #131

Check out the documentation and this example for more information.

Wav2Vec2

After text and image models, a third modality is now supported with the addition of Wav2Vec2.

Add suport for Wav2Vec2 #120

Check out the audio classification and speech recognition examples to see how to use it.

SynapseAI 1.7

This release is fully compatible with SynapseAI 1.7.0, which is the latest version. Check out Habana's documentation for more information about the new features.

Memory stats

Memory stats are now logged every logging_steps steps to give more information about the memory consumption of HPUs.

Memory stats #89

DeepSpeed demo notebook with GPT2-XL

This repository now has a notebook displaying how to use DeepSpeed to pre-train/fine-tune GPT2-XL on GAUDI. You can find it here.

Add DeepSpeed demo notebook #112

Fix gradient checkpointing for BERT/RoBERTa/ALBERT

An error used to be raised by PyTorch when running BERT-like models with gradient checkpointing. This has been fixed.

Fix gradient checkpointing for BERT/RoBERTa/ALBERT #118

Assets 2

12 Sep 09:19

regisss

v1.2.1

3c0d9ae

v1.2: DeepSpeed and CV Models

DeepSpeed

This release brings support for DeepSpeed. It is now possible to train bigger models on Gaudi with Optimum Habana!

Add support for DeepSpeed #93

Check the documentation here to know how to use it.

Computer Vision Models

Two computer-vision models have been validated for performing image classification in both single- and multi-cards configurations:

ViT #80
Swin

You can see how to use them in this example.

SynapseAI 1.6.0

This release is fully compatible with SynapseAI 1.6.0.

Update to SynapseAI 1.6.0 #91

It is recommended to use SynapseAI 1.6.0 for optimal performance.

Documentation

Optimum Habana now has a dedicated documentation. you can find it here.

It shows how to quickly make a Transformers-based script work with the library. It also contains guides explaining how to do distributed training, how to use DeepSpeed or how to make the most of HPUs to accelerate training.

Masked Language Modeling

A new example script has been added to perform masked language modeling. This is especially useful if you want to pretrain models such as BERT or RoBERTa.

Add run_mlm.py in language-modeling examples #83

Assets 2

12 Aug 08:23

regisss

v1.1.2

17cee2e

v1.1.2: Patch Release

This patch release fixes a bug where it is possible to initialize processes multiple times in distributed mode, leading to an error.

Assets 2

02 Aug 07:44

regisss

v1.1.1

b0fb5aa

V1.1.1 Patch Release

This patch release fixes a bug where the loss is equal to NaN from the first training iteration with Transformers 4.21.0.

Assets 2

15 Jul 10:33

regisss

v1.1.0

8db3195

v1.1.0: GPT2, T5 and SynapseAI 1.5.0

GPT2

You can now train or fine-tune GPT2 for causal language modeling on up to 8 HPUs. An example of fine-tuning on WikiText-2 is provided here.

Add support for language modeling (GPT2) #52

You can also use GPT2 for text generation in lazy mode.

Accelerate generation #61

T5

Encoder-decoder architectures are now supported. In particular, examples relying on T5 for the following tasks are available:

summarization, with an example of fine-tuning T5 on the CNN/DailyMail dataset,
translation, with an example of fine-tuning T5 on the WMT16 dataset for translating English to Romanian.

You can also use T5 for text generation in lazy mode.

Accelerate generation #61

Support for SynapseAI 1.5.0

The newly released SynapseAI 1.5.0 is now supported. You can find more information about it here.

Add support for SynapseAI 1.5.0 #65

This is a breaking change, you should update your version of SynapseAI as described here in order to use this new release.

GaudiConfig instantiation is not mandatory anymore

If the name of your Gaudi configuration is given in the training arguments, you do not have to instantiate it and provide it to the trainer anymore. This will be automatically taken care of. You can still instantiate a Gaudi configuration and provide it to the trainer.

Enable GaudiConfig instantiation from inside the trainer #55

Refined throughput computation in lazy mode

In lazy mode, the first two steps are warmup steps used for graph compilation. In order to discard them from the throughput computation, you can just add the following training argument: --throughput_warmup_steps 2.

Add a new argument for taking warmup steps into account in throughput computation #48

Assets 2

26 Apr 10:45

regisss

v1.0.1

20e83cd

Optimum Habana v1

With this release, we enable easy and fast deployment of models from the Transformers library on Habana Gaudi Processors (HPU).

The class GaudiTrainer is built on top of the original class Trainer and enables to train and evaluate models from the Transformers library on HPUs.
The class GaudiTrainingArguments is built on top of the original class TrainingArguments and adds 3 new arguments:
- use_habana to deploy on HPU
- use_lazy_mode to use lazy mode instead of eager mode
- gaudi_config_name to specify the name of or the path to the Gaudi configuration file
The class GaudiConfig enables to specify a configuration for deployment on HPU, such as the use of Habana Mixed Precision, the use of custom ops,...
Multi-card deployment is enabled
Examples are provided for question answering and text classification in both single- and multi-card settings.
The following models have been validated:
- BERT base/large
- RoBERTa base/large
- ALBERT large/XXL
- DistilBERT

Assets 2

Releases: huggingface/optimum-habana

v1.6: Fast DDP, Torch Autocast, SynaspeAI v1.10 and various model optimizations

Fast DDP

Torch Autocast

SynapseAI v1.10

HPU graphs for training

Various model optimizations

Asynchronous data copy

Profiling

Adjusted throughput calculation

Check SynapseAI version at import

Enhanced examples

Contributors

v1.5: BLOOM(Z), SynapseAI v1.9.0 and various speedups

BLOOM(Z)

SynapseAI v1.9.0

Transformers v4.28 and Diffusers v0.15

Improved data sampling for training in lazy mode

HPU graphs for distributed runs and generation

Recommend dataloader_num_workers for CV model training

Enable to pipeline forward and backward passes

Contributors

v1.4: multi-node training and inference mode

Multi-node training

Inference through HPU graphs

Synapse AI 1.8

DeepSpeed's gradient checkpointing

v1.3: Stable Diffusion and Wav2Vec2

Stable Diffusion

Wav2Vec2

SynapseAI 1.7

Memory stats

DeepSpeed demo notebook with GPT2-XL

Fix gradient checkpointing for BERT/RoBERTa/ALBERT

v1.2: DeepSpeed and CV Models

DeepSpeed

Computer Vision Models

SynapseAI 1.6.0

Documentation

Masked Language Modeling

v1.1.2: Patch Release

V1.1.1 Patch Release

v1.1.0: GPT2, T5 and SynapseAI 1.5.0

GPT2

T5

Support for SynapseAI 1.5.0

GaudiConfig instantiation is not mandatory anymore

Refined throughput computation in lazy mode

Optimum Habana v1

Recommend `dataloader_num_workers` for CV model training