Leveraging Synthetic Data for Speech Recognition: Improving Shunting Commandos with Conformer-CTC at Swiss Federal Railways – Click the link for a detailed report on our ASR project with SBB.
This project, conducted with SBB (Schweizerische Bundesbahnen), aimed to improve Automatic Speech Recognition (ASR) systems through the use of synthetic data. By generating additional training data using Text-to-Speech (TTS) technology, we addressed common challenges in ASR such as accent diversity and limited datasets.
We used the Conformer-CTC model, a state-of-the-art ASR architecture, and demonstrated that augmenting real-world data with synthetic samples significantly enhances the model’s performance, especially in environments with multilingual and accent variations like those found at SBB.
This approach showcases how synthetic data can be an effective solution for overcoming data limitations and improving ASR accuracy in real-world applications.
This repository contains the following files and directories, which are crucial for understanding and replicating the project:
-
/
generate_available_voice.py
: Fetches a list of available voices from the Azure voice pool for synthetic data generation.generate_commandos.py
: Used for the rule-based generation of commandos.generate_input_combinations.py
: Creates various combinations of text, voices, and styles for speech generation.generate_speech.py
: A script that does the following:- Creates SSML (Speech Synthesis Markup Language) strings based on input.
- Feeds the SSML strings into the speech synthesizer.
- Calculates the duration of each synthesized speech.
- Adds an entry to the manifest file for each synthesized speech.
-
/conformer/
experiments.py
: The main batch script responsible for training the Conformer-CTC model. This script manages the entire training process including loading the dataset, initializing the model, setting up the optimizer, and handling the training loop.generate_manifest.py
: Accepts all the samples and generates several manifest files for training, validation, and testing. These manifest files serve as the primary data source for theexperiments.py
script.eval.py
: Loads .nemo model files for evaluation. While this is used mainly for debugging, extensive logging is done automatically through weights and biases integration for tracking the model's performance over epochs.
-
/data/: Holds all necessary output files needed for speech generation. This includes combinations of text, voices, and styles (generated by
generate_input_combinations.py
), commandos (generated bygenerate_commandos.py
), and available voices (fetched bygenerate_available_voice.py
).
This repository contains scripts for synthetic speech generation and a Conformer-CTC model training routine. Follow the steps below to set up and run these scripts.
- An API key from Azure Speech Services. This key needs to be added to the
settings.json
file. - Vagrant and the libvirtd client packages.
-
Copy
settings-example.json
and rename the copy tosettings.json
. Modify this file to include your Azure Speech Services API key. -
Install Vagrant and the libvirtd client packages on your local machine.
-
Set up Vagrant by following the standard process (i.e., installing, running
vagrant up
, etc.). Theprovision.sh
script should run automatically and install all necessary packages on the virtual machine. -
Run
main.py
for a guided step-by-step process through the synthetic speech generation.
The Conformer model training scripts require several Python packages. Install them via pip:
pip install nemo-toolkit[all]
pip install pytorch_lightning
pip install torch
pip install wandb
The Conformer model requires a powerful machine to run. You may need to adjust the batch size and GPU settings according to your machine's specifications.
We would like to extend our sincere gratitude to our esteemed professors for their continuous support throughout this project. Their guidance was invaluable in shaping our understanding and approach to this complex field of study. Moreover, their assistance in providing us with powerful hardware significantly eased the computational challenges we faced, and for this, we are deeply grateful.
We also wish to express our profound appreciation to our supervisor at SBB. His keen insights, constructive feedback, and consistent help were instrumental to the success of this project. He not only offered us a platform to explore and contribute to the field of Automatic Speech Recognition but also nurtured our learning process every step of the way.
This opportunity has truly been a rewarding experience, and we are profoundly thankful to everyone who contributed to making it possible.