Hi, there! This is a summary of Expressive speech synthesis papers! It may include some papers on song/audio generation.
If you have interest in our program, welcome to star⭐ or give some advice👏 (Pull Requests/Email📧 me)!
Latest update: 16, Jan, 2025
Title | Date | Venue |
---|---|---|
Speech Synthesis along Perceptual Voice Quality Dimensions | 15 January, 2025 | ICASSP 2025 |
Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis | 11 January, 2025 | Information Fusion 2025 |
PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control | 10 January, 2025 | ARXIV |
DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions | 7 January, 2025 | ICASSP25 |
FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles | 1 January, 2025 | ARXIV |
Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis | 24 December, 2024 | ICASSP 2025 |
Simi-SFX: A similarity-based conditioning method for controllable sound effect synthesis | 24 December, 2024 | ARXIV |
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation | 22 December, 2024 | ARXIV |
Hierarchical Control of Emotion Rendering in Speech Synthesis | 16 December, 2024 | Submitted to IEEE Transactions |
AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation | 13 December, 2024 | submitted and under review at the IEEE Transactions on Affective Computing |
CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder | 13 December, 2024 | AAAI2025 |
EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations | 12 December, 2024 | ARXIV |
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis | 9 November, 2024 | ARXIV |
Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis | 24 October, 2024 | ARXIV |
Annotation-Free MIDI-to-Audio Synthesis via Concatenative Synthesis and Generative Refinement | 22 October, 2024 | ARXIV |
Continuous Speech Synthesis using per-token Latent Diffusion | 21 October, 2024 | ARXIV |
DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech | 17 October, 2024 | ARXIV |
DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis | 17 October, 2024 | ICASSP2024 |
SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model | 16 October, 2024 | ICASSP2024 |
MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech | 4 October, 2024 | EMNLP 2024 Findings |
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control | 30 September, 2024 | EMNLP 2024 Main |
EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis | 27 September, 2024 | ARXIV |
Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech | 24 September, 2024 | ECCV Workshop ABAW(Affective Behavior Analysis in-the-wild)7 (to be appear) |
ViolinDiff: Enhancing Expressive Violin Synthesis with Pitch Bend Conditioning | 19 September, 2024 | ARXIV |
What happens to diffusion model likelihood when your model is conditional? | 10 September, 2024 | ARXIV |
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling | 28 August, 2024 | ACM Multimedia 2024 |
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description | 24 August, 2024 | ACM Multimedia 2024 |
Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music | 26 August, 2024 | International Society for Music Information Retrieval (ISMIR) 2024 |
Generative Expressive Conversational Speech Synthesis | 31 July, 2024 | ACM MM 2024 |
Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings | 19 July, 2024 | INTERSPEECH 2024 |
MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis | 18 July, 2024 | INTERSPEECH 2024 |
DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability | 27 June, 2024 | Preprint |
A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons | 26 June, 2024 | ARXIV |
GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis | 15 June, 2024 | ARXIV |
VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation | 12 June, 2024 | SLT 2024 |
TokSing: Singing Voice Synthesis based on Discrete Tokens | 12 June, 2024 | Interspeech 2024 |
Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation | 12 June, 2024 | ARXIV |
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling | 9 June, 2024 | Interspeech2024 |
Text-aware and Context-aware Expressive Audiobook Speech Synthesis | 9 June, 2024 | INTERSPEECH2024 |
Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study | 7 June, 2024 | ARXIV |
Style Mixture of Experts for Expressive Text-To-Speech Synthesis | 5 June, 2024 | NeurIPS 2024 Workshop |
RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis | 27 May, 2024 | 8th APWeb-WAIM International Joint Conference on Web and Big Data |
Expressivity and Speech Synthesis | 30 April, 2024 | ARXIV |
MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis | 28 April, 2024 | ARXIV |
Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness | 17 April, 2024 | LREC-COLING 2024 |
Fine-Grained Quantitative Emotion Editing for Speech Generation | 4 March, 2024 | IEEE APSIPA ASC 2024 |
Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting | 24 January, 2024 | ICASSP 2024 |
StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis | 19 December, 2023 | ICASSP 2024 |
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling | 19 December, 2023 | AAAI'2024 |
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis | 17 December, 2023 | AAAI2024 |
SECap: Speech Emotion Captioning with Large Language Model | 23 December, 2023 | AAAI 2024 |
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models | 13 December, 2023 | CVPR2024 |