ddlBoJack / Awesome-Speech-Pretraining Public

Notifications You must be signed in to change notification settings
Fork 14
Star 202

Paper, Code and Statistics for Self-Supervised Learning and Pre-Training on Speech.

202 stars 14 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
README.md		README.md

Repository files navigation

Table of Contents generated with DocToc

Awesome-Speech-Pretraining
- Papers
- Resources
- Statistics

Awesome-Speech-Pretraining

Papers, Resources, and Statistics for Self-Supervised Learning and Pre-Training on Speech.

🌟 represents important papers.

Papers

2018

🌟 CPC: Representation Learning with Contrastive Predictive Coding - A Oord et al, arXiv 2018

2019

APC: An Unsupervised Autoregressive Model for Speech Representation Learning - YA Chung et al, INTERSPEECH 2019
🌟 wav2vec: Unsupervised Pre-training for Speech Recognition - S Schneider et al, INTERSPEECH 2019
🌟 vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations - A Baevski et al, arXiv 2019, ICLR 2020
MPC: Improving Transformer-based Speech Recognition Using Unsupervised Pre-training - D Jiang et al, arXiv 2019
PASE: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - S Pascual et al, INTERSPEECH 2019

2020

Bidir CPC: Learning robust and multilingual speech representations - K Kawakami et al, EMNLP 2020
Multi-target APC: Improved speech representations with multi-target autoregressive predictive coding - YA Chung et al, ACL 2020
Modified CPC: Unsupervised pretraining transfers well across languages - M Riviere et al, ICASSP 2020
Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders - AT Liu et al, ICASSP 2020
vq-wav2vec-FT: Effectiveness of self-supervised pre-training for asr - A Baevski et al, ICASSP 2020
DeCoAR: Deep contextualized acoustic representations for semi-supervised speech recognition - S Ling et al, ICASSP 2020
Improved noisy student training for automatic speech recognition - DS Park et al, INTERSPEECH 2020
🌟 wav2vec 2.0: A framework for self-supervised learning of speech representations - A Baevski et al, NeurIPS 2020
Multi-lingual wav2vec 2.0: Unsupervised cross-lingual representation learning for speech recognition - A Conneau et al, arXiv 2020
Self-Training wav2vec 2.0: Self-training and Pre-training are Complementary for Speech Recognition - Q Xu et al, arXiv 2020, ICASSP 2021
Decoar 2.0: Deep contextualized acoustic representations with vector quantizationarXiv 2020, ICASSP 2021
Pushing the limits of semi-supervised learning for automatic speech recognition - Y Zhang et al, arXiv 2020, NeurIPS Workshop 2020

2021

Unispeech: Unified speech representation learning with labeled and unlabeled data- C Wang et al, ACL 2021
Tera: Self-supervised learning of transformer encoder representation for speech - AT Liu et al, TASLP 2021
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training - WN Hsu et al, INTERSPEECH 2021
Zero-shot wav2vec 2.0: Simple and Effective Zero-shot Cross-lingual Phoneme Recognition - Q Xu et al, arXiv 2021
🌟 wav2vec-U: Unsupervised Speech Recognition - A Baevski et al, NeurIPS 2021
🌟 HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - WN Hsu et al, TASLP 2021
🌟 SUPERB: Speech processing Universal PERformance Benchmark - S Yang et al, INTERSPEECH 2021
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition - G Zheng et al, EMNLP 2021
ILS-SSL: Self-Supervised Learning for speech recognition with Intermediate layer supervision - C Wang et al, ICASSP 2021
Wavlm: Large-scale self-supervised pre-training for full stack speech processing - S Chen et al, arXiv 2021, JSTSP 2022
Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition - Y Zhang et al, arXiv 2021, JSTSP 2022
Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing - J Ao et al, arXiv 2021, ACL 2022

2022

🌟 Data2vec: A general framework for self-supervised learning in speech, vision and language - A Baevski et al, ICML 2022
BEST-RQ: Self-supervised Learning with Random-projection Quantizer for Speech Recognition - CC Chiu et al, ICML 2022
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities - HS Tsai et al, ACL 2022
🌟 wav2vec-U 2.0: Towards End-to-end Unsupervised Speech Recognition - AH Liu et al, SLT 2022
c-siam: Contrastive Siamese Network for Semi-Supervised Speech Recognition - S Khorram et al, ICASSP 2022
Speech2C: Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data - J Ao et al, INTERSPEECH 2022
SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training - W Huang et al, ICLR 2022
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages - F Wu et al, arXiv 2022, ICASSP 2023
HuBERT-AP: Speech Pre-training with Acoustic Piece - S Ren et al, INTERSPEECH 2022
PBERT: Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training - C Wang et al, INTERSPEECH 2022
data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - A Baevski et al, arXiv 2022
CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning - C Meng et al, arXiv 2022, INTERSPEECH 2023
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets - Z Ma et al, arXiv 2022, INTERSPEECH 2023

2023

CTCBERT: Advancing Hidden-unit BERT with CTC Objectives - R Fan et al, ICASSP 2023
data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup - VS Lodagala et al, ICASSP 2023
MonoBERT & PolyBERT: Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation - Z Ma et al, INTERSPEECH 2023
MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization - JW Yoon et al, INTERSPEECH 2023

Speech + Text

A general multi-task learning framework to leverage text data for speech to text tasks - Y Tang et al, ICASSP 2021
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training - A Bapna et al, arXiv 2021
mSLAM: Massively multilingual joint pre-training for speech and text - A Bapna et al, arXiv 2022
Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding - W Wang et al, INTERSPEECH 2022
STPT: Unified Speech-Text Pre-training for Speech Translation and Recognition - Y Tang et al, ACL 2022
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data - Y Kang et al, AAAI 2022
Distill-L2S: Distilling a Pretrained Language Model to a Multilingual ASR Model - K Choi et al, INTERSPEECH 2022
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training - Z Zhang et al, EMNLP 2022
TESSP: Text-Enhanced Self-Supervised Speech Pre-training - Z Yao et al, arXiv 2022
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data - Z Zhang et al, arXiv 2022
token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text - X Yue et al, ICASSP 2023

SSL for Audio

BYOL-A: BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation - D Niizumi et al, IJCNN 2021
Audio-MAE: Masked Autoencoders that Listen - H Xu et al, NeurIPS 2022
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - A Baade et al, INTERSPEECH 2022
BEATs: Audio Pre-Training with Acoustic Tokenizers - S Chen et al, ICML 2023
ATST: Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks - X Li et al, arXiv 2023
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer - W Chen et al, arXiv 2024

SSL for TTS

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks - R Eloff et al, INTERSPEECH 2019
Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages - H Zhang et al, INTERSPEECH 2020
Towards Unsupervised Speech Synthesis - AH Liu et al, NAACL 2022

SSL Model Distillation, Compression and Acceleration

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT - H Chang et al, ICASSP 2022
FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning- Y Lee et al, INTERSPEECH 2022
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT- R Wang et al, INTERSPEECH 2022
Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models - T Ashihara et al, INTERSPEECH 2022
Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition - Y Wang et al, arXiv 2022
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning - G Yang et al, ASRU 2023

Resources

Speech processing Universal PERformance Benchmark (SUPERB)

Self-Supervised Speech Pre-training and Representation Learning (S3PRL)

Statistics

Statistics on speech pretraining.

wav2vec 2.0

Pre-training

Size	Transformer	Samples	Batch Size	Train Time
BASE	12 blocks, model dimension 768, FFN 3072, 8 heads	1.4m(cropped)/GPU	1.6h	400k updates, 64 V100 * 1.6d
LARGE	24 blocks, model dimension 1024, FFN 4096, 16 heads	1.2m(cropped)/GPU	2.7h	250k updates, 128 V100 * 2.3d(Librispeech) 600k updates, 128 V100 * 5.2d(LibriVox)

Fine-tuning

wav2vec-u

Method	Feature Extractor	Batch Size	Train Time
wav2vec-U	wav2vec 2.0 LARGE	160 unlabeled audio + 160 text samples	150k steps, single V100 * 12h
wav2vec-U + self training	wav2vec 2.0 LARGE	/	80k updates, 8 V100(Librispeech) 13k updates, 4V100(TIMIT)

HuBERT

Pre-training

Size	Feature Extractor	Batch Size	Stage	Train Time
BASE	wav2vec 2.0 BASE(95M)	87.5s	1: MFCC 250k steps 2: 6-th transformer layer 400k steps	9.5h/100k steps, 32GPUs(Librispeech-960)
LARGE	wav2vec 2.0 LARGE(317M)	56.25s	3: 9-th transformer layer from BASE HuBERT 400k steps	9.5h/100k steps, 128GPUs(Libri-light-60k)
X-LARGE	Conformer XXL(964M)	22.5s	3: 9-th transformer layer from BASE HuBERT 400k steps	9.5h/100k steps, 256GPUs(Libri-light-60k)

Fine-tuning

About

Paper, Code and Statistics for Self-Supervised Learning and Pre-Training on Speech.

Report repository

Releases

No releases published

Packages

No packages published

Contributors 2