The base model used is Wav2vec 2.0. The model learns speech representations on unlabeled data as described in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020).
The paper shows that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Please follow the instructions provided in the Gaudi Installation Guide
to set up the environment including the $PYTHON
environment variable.
The guide will walk you through the process of setting up your system to run the model on Gaudi.
To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform guide.
In the docker container, clone this repository and switch to the branch that matches your Intel Gaudi software version.
You can run the hl-smi
utility to determine the Intel Gaudi software version.
git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/fairseq
Note: If the repository is not in the PYTHONPATH, make sure to update by running the below.
export PYTHONPATH=/path/to/fairseq:$PYTHONPATH
In the docker container, go to Fairseq directory and install fairseq along with the required packages using pip:
cd fairseq
pip install --editable .
Follow the steps below to set up Wav2vec dataset:
- Download the dataset from http://www.openslr.org/12.
- Create the train-960 directory comprised of the untared train-clean-100, train-clean-360 train-other-500 ( totaling 960 hours of speech).
- Run the following command to create the manifest file:
$PYTHON wav2vec_manifest.py /path/to/dataset/train-960/ --dest /path/to/dataset/train-960/manifest --valid-percent 0.05
You can obtain “wav2vec_manifest.py” file from /path/to/fairseq/examples/wav2vec.
An example layout of the dataset will look like below:
100/
1001/
1006/
101/
1012/
...
manifest/
Note:
- Please make sure the first line in /path/to/dataset/train-960/manifest/train.tsv and /path/to/dataset/train-960/manifest/valid.tsv points to the correct directory. e.g.
/data/pytorch/wav2vec/data/LibriSpeech/train-960
. - Going forward we assume the above Wav2vec dataset is available at path
/data/pytorch/wav2vec/data/LibriSpeech/train-960
.
Run training on 1 HPU:
- Run training on 1 HPU, Gradient accumulation=64, mixed precision (BF16):
fairseq-hydra-train task.data=/data/pytorch/wav2vec/data/LibriSpeech/train-960/manifest/ --config-dir examples/wav2vec/config/pretraining --config-name wav2vec2_base_librispeech_hpu
To run multi-card demo, the following is required:
- The host machine has 512 GB of RAM installed.
- Make sure to follow the Gaudi Setup and Installation Guide to install and set up the docker, so that it has access to all 8 cards required for multi-card demo.
- Before executing the multi-card demo scripts, make sure all server network interfaces are up. You can change the state of each network interface managed by the
habanalabs
driver using the following command:To identify if a specific network interface is managed by thesudo ip link set <interface_name> up
habanalabs
driver type, run:sudo ethtool -i <interface_name>
Run training on 8 HPUs:
Note: The number of cards can be configured using --world_size
option in the demo script as shown below.
-
Modify the
wav2vec2_base_librispeech_hpu.yaml
under/path/to/fairseq/examples/wav2vec/config/pretraining/
. -
Set
distributed_world_size
to 8:
distributed_training:
distributed_world_size: 8
- Set
update_freq
to 8:
optimization:
max_update: 400000
lr: [0.0005]
update_freq: [8]
- Run the following command (first-gen Gaudi):
fairseq-hydra-train task.data=/data/pytorch/wav2vec/data/LibriSpeech/train-960/manifest/ --config-dir examples/wav2vec/config/pretraining --config-name wav2vec2_base_librispeech_hpu
- Run the following command (Gaudi 2):
PT_HPU_RECIPE_CACHE_CONFIG="./cache_dir/" common.log_interval=111 common.hpu_graphs=true fairseq-hydra-train task.data=/data/pytorch/wav2vec/data/LibriSpeech/train-960/manifest/ --config-dir examples/wav2vec/config/pretraining --config-name wav2vec2_base_librispeech_hpu
Device | Intel Gaudi Software Version | PyTorch Version | Mode |
---|---|---|---|
Gaudi | 1.13.0 | 2.1.0 | Training |
Gaudi 2 | 1.13.0 | 2.1.0 | Training |
- Added HPU graph support to model script. Enabled HPU graph flags for Gaudi 2 only.
- Marked copy to device(inputs) as async.
- Added async allreduce for sample_size.
- Removed host barrier in Wav2vec.
- Replaced isnonzero with where op to unblock the host.
- Only fetch the log statistics to CPU when needed.
- Replaced broadcast+sum with equal algorithm to save memory in Quantizer module.
- Created a customized version of cos_similarity via removing the broadcast operations.
- Moved negative indices generation to HPU.
- Changed the data type of randint to int16 to save the memory copyfrom host to device when generating negative indics.
- Replaced conv1d with equivalent conv2d.
- Replaced group norm with equivalent instance norm.
The following are the changes made to the training scripts:
- Added support for Gaudi devices:
- Defined certain environment variables Gaudi devices.
- Added support to run training in lazy mode.
- mark_step() is performed to trigger execution.
- Added support of bucketting, padding, and Precompute loss for HPU.
- Added support to use HPU accelerator plugin, DDP plugin(for multi-HPU training) and mixed precision plugin.
- Added support of
fairseq_hydra_train
for multi-node training. - Disabled auto dynamic shape support.
- Only the above configurations mentioned are supported and verified.
- Training on 1 HPU with FP32 data type has OOM issue.