Skip to content

Speech Synthesis and Automatic Speech Recognition (ASR) conducted for SBB using Azure Services and Conformer Model

Notifications You must be signed in to change notification settings

fabian-gubler/sbb-speech-synthesis

Repository files navigation

SBB – Speech Detection with Data Augmentation

📚 Final Project Report:

Leveraging Synthetic Data for Speech Recognition: Improving Shunting Commandos with Conformer-CTC at Swiss Federal Railways – Click the link for a detailed report on our ASR project with SBB.


Table of Contents

Overview and Motivation

This project, conducted with SBB (Schweizerische Bundesbahnen), aimed to improve Automatic Speech Recognition (ASR) systems through the use of synthetic data. By generating additional training data using Text-to-Speech (TTS) technology, we addressed common challenges in ASR such as accent diversity and limited datasets.

We used the Conformer-CTC model, a state-of-the-art ASR architecture, and demonstrated that augmenting real-world data with synthetic samples significantly enhances the model’s performance, especially in environments with multilingual and accent variations like those found at SBB.

This approach showcases how synthetic data can be an effective solution for overcoming data limitations and improving ASR accuracy in real-world applications.

Contents of this Repository

This repository contains the following files and directories, which are crucial for understanding and replicating the project:

  • /

    • generate_available_voice.py: Fetches a list of available voices from the Azure voice pool for synthetic data generation.
    • generate_commandos.py: Used for the rule-based generation of commandos.
    • generate_input_combinations.py: Creates various combinations of text, voices, and styles for speech generation.
    • generate_speech.py: A script that does the following:
      1. Creates SSML (Speech Synthesis Markup Language) strings based on input.
      2. Feeds the SSML strings into the speech synthesizer.
      3. Calculates the duration of each synthesized speech.
      4. Adds an entry to the manifest file for each synthesized speech.
  • /conformer/

    • experiments.py: The main batch script responsible for training the Conformer-CTC model. This script manages the entire training process including loading the dataset, initializing the model, setting up the optimizer, and handling the training loop.
    • generate_manifest.py: Accepts all the samples and generates several manifest files for training, validation, and testing. These manifest files serve as the primary data source for the experiments.py script.
    • eval.py: Loads .nemo model files for evaluation. While this is used mainly for debugging, extensive logging is done automatically through weights and biases integration for tracking the model's performance over epochs.
  • /data/: Holds all necessary output files needed for speech generation. This includes combinations of text, voices, and styles (generated by generate_input_combinations.py), commandos (generated by generate_commandos.py), and available voices (fetched by generate_available_voice.py).

Installation & Usage

This repository contains scripts for synthetic speech generation and a Conformer-CTC model training routine. Follow the steps below to set up and run these scripts.

Generating Speech samples

Prerequisites

  • An API key from Azure Speech Services. This key needs to be added to the settings.json file.
  • Vagrant and the libvirtd client packages.

Steps

  1. Copy settings-example.json and rename the copy to settings.json. Modify this file to include your Azure Speech Services API key.

  2. Install Vagrant and the libvirtd client packages on your local machine.

  3. Set up Vagrant by following the standard process (i.e., installing, running vagrant up, etc.). The provision.sh script should run automatically and install all necessary packages on the virtual machine.

  4. Run main.py for a guided step-by-step process through the synthetic speech generation.

Conformer

Prerequisites

The Conformer model training scripts require several Python packages. Install them via pip:

pip install nemo-toolkit[all]
pip install pytorch_lightning
pip install torch
pip install wandb

Note

The Conformer model requires a powerful machine to run. You may need to adjust the batch size and GPU settings according to your machine's specifications.

Acknowledgements

We would like to extend our sincere gratitude to our esteemed professors for their continuous support throughout this project. Their guidance was invaluable in shaping our understanding and approach to this complex field of study. Moreover, their assistance in providing us with powerful hardware significantly eased the computational challenges we faced, and for this, we are deeply grateful.

We also wish to express our profound appreciation to our supervisor at SBB. His keen insights, constructive feedback, and consistent help were instrumental to the success of this project. He not only offered us a platform to explore and contribute to the field of Automatic Speech Recognition but also nurtured our learning process every step of the way.

This opportunity has truly been a rewarding experience, and we are profoundly thankful to everyone who contributed to making it possible.

About

Speech Synthesis and Automatic Speech Recognition (ASR) conducted for SBB using Azure Services and Conformer Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages