Table of Contents generated with DocToc
Papers, Resources, and Statistics for Self-Supervised Learning and Pre-Training on Speech.
π represents important papers.
- π CPC: Representation Learning with Contrastive Predictive Coding - A Oord et al,
arXiv 2018
- APC: An Unsupervised Autoregressive Model for Speech Representation Learning - YA Chung et al,
INTERSPEECH 2019
- π wav2vec: Unsupervised Pre-training for Speech Recognition - S Schneider et al,
INTERSPEECH 2019
- π vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations - A Baevski et al,
arXiv 2019, ICLR 2020
- MPC: Improving Transformer-based Speech Recognition Using Unsupervised Pre-training - D Jiang et al,
arXiv 2019
- PASE: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - S Pascual et al,
INTERSPEECH 2019
- Bidir CPC: Learning robust and multilingual speech representations - K Kawakami et al,
EMNLP 2020
- Multi-target APC: Improved speech representations with multi-target autoregressive predictive coding - YA Chung et al,
ACL 2020
- Modified CPC: Unsupervised pretraining transfers well across languages - M Riviere et al,
ICASSP 2020
- Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders - AT Liu et al,
ICASSP 2020
- vq-wav2vec-FT: Effectiveness of self-supervised pre-training for asr - A Baevski et al,
ICASSP 2020
- DeCoAR: Deep contextualized acoustic representations for semi-supervised speech recognition - S Ling et al,
ICASSP 2020
- Improved noisy student training for automatic speech recognition - DS Park et al,
INTERSPEECH 2020
- π wav2vec 2.0: A framework for self-supervised learning of speech representations - A Baevski et al,
NeurIPS 2020
- Multi-lingual wav2vec 2.0: Unsupervised cross-lingual representation learning for speech recognition - A Conneau et al,
arXiv 2020
- Self-Training wav2vec 2.0: Self-training and Pre-training are Complementary for Speech Recognition - Q Xu et al,
arXiv 2020, ICASSP 2021
- Decoar 2.0: Deep contextualized acoustic representations with vector quantization
arXiv 2020, ICASSP 2021
- Pushing the limits of semi-supervised learning for automatic speech recognition - Y Zhang et al,
arXiv 2020, NeurIPS Workshop 2020
- Unispeech: Unified speech representation learning with labeled and unlabeled data- C Wang et al,
ACL 2021
- Tera: Self-supervised learning of transformer encoder representation for speech - AT Liu et al,
TASLP 2021
- Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training - WN Hsu et al,
INTERSPEECH 2021
- Zero-shot wav2vec 2.0: Simple and Effective Zero-shot Cross-lingual Phoneme Recognition - Q Xu et al,
arXiv 2021
- π wav2vec-U: Unsupervised Speech Recognition - A Baevski et al,
NeurIPS 2021
- π HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - WN Hsu et al,
TASLP 2021
- π SUPERB: Speech processing Universal PERformance Benchmark - S Yang et al,
INTERSPEECH 2021
- Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition - G Zheng et al,
EMNLP 2021
- ILS-SSL: Self-Supervised Learning for speech recognition with Intermediate layer supervision - C Wang et al,
ICASSP 2021
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing - S Chen et al,
arXiv 2021, JSTSP 2022
- Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition - Y Zhang et al,
arXiv 2021, JSTSP 2022
- Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing - J Ao et al,
arXiv 2021, ACL 2022
- π Data2vec: A general framework for self-supervised learning in speech, vision and language - A Baevski et al,
ICML 2022
- BEST-RQ: Self-supervised Learning with Random-projection Quantizer for Speech Recognition - CC Chiu et al,
ICML 2022
- SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities - HS Tsai et al,
ACL 2022
- π wav2vec-U 2.0: Towards End-to-end Unsupervised Speech Recognition - AH Liu et al,
SLT 2022
- c-siam: Contrastive Siamese Network for Semi-Supervised Speech Recognition - S Khorram et al,
ICASSP 2022
- Speech2C: Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data - J Ao et al,
INTERSPEECH 2022
- SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training - W Huang et al,
ICLR 2022
- Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages - F Wu et al,
arXiv 2022, ICASSP 2023
- HuBERT-AP: Speech Pre-training with Acoustic Piece - S Ren et al,
INTERSPEECH 2022
- PBERT: Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training - C Wang et al,
INTERSPEECH 2022
- data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - A Baevski et al,
arXiv 2022
- CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning - C Meng et al,
arXiv 2022, INTERSPEECH 2023
- MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets - Z Ma et al,
arXiv 2022, INTERSPEECH 2023
- CTCBERT: Advancing Hidden-unit BERT with CTC Objectives - R Fan et al,
ICASSP 2023
- data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup - VS Lodagala et al,
ICASSP 2023
- MonoBERT & PolyBERT: Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation - Z Ma et al,
INTERSPEECH 2023
- MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization - JW Yoon et al,
INTERSPEECH 2023
- A general multi-task learning framework to leverage text data for speech to text tasks - Y Tang et al,
ICASSP 2021
- SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training - A Bapna et al,
arXiv 2021
- mSLAM: Massively multilingual joint pre-training for speech and text - A Bapna et al,
arXiv 2022
- Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding - W Wang et al,
INTERSPEECH 2022
- STPT: Unified Speech-Text Pre-training for Speech Translation and Recognition - Y Tang et al,
ACL 2022
- Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data - Y Kang et al,
AAAI 2022
- Distill-L2S: Distilling a Pretrained Language Model to a Multilingual ASR Model - K Choi et al,
INTERSPEECH 2022
- SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training - Z Zhang et al,
EMNLP 2022
- TESSP: Text-Enhanced Self-Supervised Speech Pre-training - Z Yao et al,
arXiv 2022
- SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data - Z Zhang et al,
arXiv 2022
- token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text - X Yue et al,
ICASSP 2023
- BYOL-A: BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation - D Niizumi et al,
IJCNN 2021
- Audio-MAE: Masked Autoencoders that Listen - H Xu et al,
NeurIPS 2022
- MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - A Baade et al,
INTERSPEECH 2022
- BEATs: Audio Pre-Training with Acoustic Tokenizers - S Chen et al,
ICML 2023
- ATST: Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks - X Li et al,
arXiv 2023
- EAT: Self-Supervised Pre-Training with Efficient Audio Transformer - W Chen et al,
arXiv 2024
- Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks - R Eloff et al,
INTERSPEECH 2019
- Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages - H Zhang et al,
INTERSPEECH 2020
- Towards Unsupervised Speech Synthesis - AH Liu et al,
NAACL 2022
- DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT - H Chang et al,
ICASSP 2022
- FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning- Y Lee et al,
INTERSPEECH 2022
- LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT- R Wang et al,
INTERSPEECH 2022
- Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models - T Ashihara et al,
INTERSPEECH 2022
- Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition - Y Wang et al,
arXiv 2022
- Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning - G Yang et al,
ASRU 2023
Speech processing Universal PERformance Benchmark (SUPERB)
Self-Supervised Speech Pre-training and Representation Learning (S3PRL)
Statistics on speech pretraining.
Size | Transformer | Samples | Batch Size | Train Time |
---|---|---|---|---|
BASE | 12 blocks, model dimension 768, FFN 3072, 8 heads | 1.4m(cropped)/GPU | 1.6h | 400k updates, 64 V100 * 1.6d |
LARGE | 24 blocks, model dimension 1024, FFN 4096, 16 heads | 1.2m(cropped)/GPU | 2.7h | 250k updates, 128 V100 * 2.3d(Librispeech) 600k updates, 128 V100 * 5.2d(LibriVox) |
Method | Feature Extractor | Batch Size | Train Time |
---|---|---|---|
wav2vec-U | wav2vec 2.0 LARGE | 160 unlabeled audio + 160 text samples | 150k steps, single V100 * 12h |
wav2vec-U + self training | wav2vec 2.0 LARGE | / | 80k updates, 8 V100(Librispeech) 13k updates, 4V100(TIMIT) |
Size | Feature Extractor | Batch Size | Stage | Train Time |
---|---|---|---|---|
BASE | wav2vec 2.0 BASE(95M) | 87.5s | 1: MFCC 250k steps 2: 6-th transformer layer 400k steps |
9.5h/100k steps, 32GPUs(Librispeech-960) |
LARGE | wav2vec 2.0 LARGE(317M) | 56.25s | 3: 9-th transformer layer from BASE HuBERT 400k steps | 9.5h/100k steps, 128GPUs(Libri-light-60k) |
X-LARGE | Conformer XXL(964M) | 22.5s | 3: 9-th transformer layer from BASE HuBERT 400k steps | 9.5h/100k steps, 256GPUs(Libri-light-60k) |