The transcription service is the API for requesting transcriptions.
The service allows you to:
- Request asynchronous transcriptions from a variety of audio or video files formats.
- Specify transcription subtask such as diarization and punctuation.
- Follow transcription task state and progress.
- Automaticaly store transcription results in a database.
- Fetch transcription results with different formats and options.
To use the transcription service you must have at least:
- One or multiple instances of linto-stt running and configured with the same
SERVICE_NAME
. - A REDIS broker running at
SERVICES_BROKER
. - A mongo DB running at
MONGO_HOST:MONGO_PORT
.
Optionnaly, for diarization or punctuation the following are needed:
- One or multiple instances of linto-diarization-worker > 1.2.0 for speaker diarization configured on the same service broker (LANGUAGE must be compatible).
- One or multiple instances of linto-punctuation-worker > 1.2.0 for text punctuation configured on the same service broker (LANGUAGE must be compatible).
To share audio files across the different services they must be configured with the same shared volume RESSOURCE_FOLDER
.
1- First build the image:
cd linto-transcription-service &&
docker build . -t transcription_service
2- Create and fill the .env
cp .envdefault .env
Fill the .env with the value described bellow Environement Variables
2- Launch a container:
docker run --rm -it -p $SERVING_PORT:80 \
-v $YOUR_SHARED_FOLDER:/opt/audio \
--env-file .env \
--name my_transcription_api \
transcription_service \
/bin/bash
Fill SERVING_PORT
, YOUR_SHARED_FOLDER
with your values.
1- Create and fill the .env
cp .envdefault .env
Fill the .env with the value described bellow Environement Variables
2- Compose
docker-compose up .
Env variable | Description | Example |
---|---|---|
SERVICE_NAME | STT service name, use to connect to the proper redis channel and mongo collection | my_stt_service |
LANGUAGE | Language code as a BCP-47 code | fr-FR |
KEEP_AUDIO | Either audio files are kept after request | 1 (true) / 0 (false) |
CONCURRENCY | Number of workers (default 10) | 10 |
SERVICES_BROKER | Message broker address | redis://broker_address:6379 |
BROKER_PASS | Broker Password | Password |
MONGO_HOST | MongoDB results url | my-mongo-service |
MONGO_PORT | MongoDB results port | 27017 |
RESOLVE_POLICY | Subservice resolve policy (default ANY) * | ANY | DEFAULT | STRICT |
<SERVICE_TYPE>_DEFAULT | Default serviceName for subtask <SERVICE_TYPE> * | punctuation-1 |
*: See Subservice Resolution
The transcription service offers a transcription API REST to submit transcription requests.
The transcription service revolves arround 2 concepts:
- Asynchronous jobs identified with job_id: A job_id represents an ongoing transcription task.
- Transcription results identified by result_id.
Typical transcription process follows this steps:
- Submit your file and the transcription configuration on
/transcribe
. The route returns a 201 with the job_id - Use the
/job/{job_id}
route to follow the job's progress. When the job is finished, you'll be greated with a 201 alongside a result_id. - Fetch the transcription result using the
/results/{result_id}
route specifying your desired format and options.
The list-services GET route fetch available sub-services for transcription.
It returns a json object containing list of deployed services indexed by service type. Services listed are filtered using the set LANGUAGE parameters.
{
"diarization": [ # Service type
{
"service_name": "diarization-1", # Service name. Used as parameter in transcription config to call this specific service.
"service_type": "diarization", # Service type
"service_language": "*", # Supported language
"queue_name": "diarization-queue", # Celery queue used by this service
"info": "A diarization service", # Information about the service.
"instances": [ # Instances of this specific service.
{
"host_name": "feb42aacd8ad", # Instance unique id
"last_alive": 1665996709, # Last heartbeat
"version": "1.2.0", # Service version
"concurrency": 1 # Concurrency of the instance
}
]
}
],
"punctuation": [
{
"service_name": "punctuation-1",
"service_type": "punctuation",
"service_language": "fr-FR",
"queue_name": "punctuation-queue",
"info": "A punctuation service",
"instances": [
{
"host_name": "b0e9e24349a9",
"last_alive": 1665996709,
"version": "1.2.0",
"concurrency": 1
}
]
}
]
}
Subservice resolution is the mecanism allowing the transcription service to use the proper optionnal subservice such as diarization or punctuation prediction. Resolution is applied when no serviceName is passed along subtask configs.
There is 3 policies to resolve service names:
- ANY: Use any compatible subservice.
- DEFAULT: Use the service default subservice (must be declared)
- STRICT: If the service is not specified, raise an error.
Resolve policy is declared at launch using the RESOLVE_POLICY environement variable: ANY | DEFAULT | STRICT (default ANY).
Default service names must be declared at launch: <SERVICE_TYPE>_DEFAULT. E.g. The default punctuation subservice is "punctuation-1", PUNCTUATION_DEFAULT=punctuation1
.
Language compatibily
A subservice is compatible if its language(s) is(are) compatible with the transcription-service language:
transcription-service language <-> subservice language.
- Same BCP-27 code: fr_Fr <-> fr-FR => OK
- Language contained: fr-FR <-> fr-FR|it_IT|en_US => OK
- Star token (all_language): fr-FR <-> * => OK
The /transcribe route allows POST request containing an audio file.
The route accepts multipart/form-data requests.
Response format can be application/json or text/plain as specified in the accept field of the header.
Form Parameter | Description | Required |
---|---|---|
transcriptionConfig | (object optionnal) A transcriptionConfig Object describing transcription parameters | See Transcription config |
force_sync | (boolean optionnal) If True do a synchronous request | [true | false | null] |
If the request is accepted, answer should be 201
with a json or text response containing the jobid.
With accept: application/json
{"jobid" : "the-job-id"}
With accept: text/plain
the-job-id
If the force_sync flag is set to true, the request returns a 200
with the transcription (see Transcription Results) using the same accept options as the /result/{result_id} route.
The use of force_sync for big files is not recommended as it blocks a worker for the duration of the transcription.
Additionnaly a timestamps file can be uploaded alongside the audio file containing segments timestamps to transcribe. Timestamps file are text file containing a segment per line with optionnal speakerid such as:
# start stop [speakerid]
0.0 7.05 1
7.05 13.0
The transcriptionConfig object describe the transcription parameters and flags of the request. It is structured as follows:
{
"punctuationConfig": {
"enablePunctuation": false, # Applies punctuation
"serviceName": null # Force serviceName (See SubService resolution)
},
"enablePunctuation": false, # Applies punctuation (Do not use, kept for backward compatibility)
"diarizationConfig": {
"enableDiarization": false, #Enables speaker diarization
"numberOfSpeaker": null, #If set, forces number of speaker
"maxNumberOfSpeaker": null #If set and and numberOfSpeaker is not, limit the maximum number of speaker.
"serviceName": null # Force serviceName (See SubService Resolving)
}
}
ServiceNames can be filled to use a specific subservice version. Available services are available on /list-services.
The /transcribe-multi route allows POST request containing multiple audio files. It is assumed each file contains a speaker or a group of speaker and files taken together form a conversation.
The route accepts multipart/form-data requests.
Response format can be application/json or text/plain as specified in the accept field of the header.
Form Parameter | Description | Required |
---|---|---|
transcriptionConfigMulti | (object optionnal) A transcriptionConfig Object describing transcription parameters | See MultiTranscription config |
If the request is accepted, answer should be 201
with a json or text response containing the jobid.
With accept: application/json
{"jobid" : "the-job-id"}
With accept: text/plain
the-job-id
The transcriptionConfig object describe the transcription parameters and flags of the request. It is structured as follows:
{
"punctuationConfig": {
"enablePunctuation": false, # Applies punctuation
"serviceName": null # Force serviceName (See SubService resolution)
}
}
The /job/{jobid} GET route allow you to get the state of the given transcription job.
Response format is application/json.
- If the job state is started, it returns a code
102
with informations on the progress. - If the job state is done, it returns a code
201
with theresult_id
. - If the job state is pending returns a code
404
. Pending can mean 2 things: a transcription worker is not yet available or the jobid does not exist. - If the job state is failed returns a code
400
.
{
#Task pending or wrong jobid: 404
{"state": "pending"}
#Task started: 102
{"state": "started", "progress": {"current": 1, "total": 3, "step": "Transcription (75%)"}}
#Task completed: 201
{"state": "done", "result_id" : "result_id"}
#Task failed: 400
{"state": "failed", "reason": "Something went wrong"}
}
The /results/{result_id} GET route allows you to fetch transcription result associated to a result_id.
The accept header specifies the format of the result:
- application/json returns the complete result as a json object;
{
"confidence": 0.9, # Overall transcription confidence
"raw_transcription": "this is a transcription diarization and punctuation are set", # Raw transcription
"segments": [ # Speech segment representing continious speech by a single speaker
{
"duration": 3.12991, # Segment duration
"end": 3.12991, # Segment stop time
"raw_segment": "this is a transcription", # Raw transcription of the speech segment
"segment": "This is a transcription", # Processed transcription of the segment (punctuation, normalisation, ...)
"spk_id": "spk1", # Speaker id
"start": 0, # Segment start time
"words": [ # Segment's word informations
{
"duration": 4.59,
"end": 7.71991,
"raw_segment": "diarization and punctuation are set",
"segment": "Diarization and punctuation are set",
"spk_id": "spk2",
"start": 3.12991,
"words": [
{
"conf": 0.89654,
"end": 4.1382,
"start": 3.12991,
"word": "diarization"
}
...
]
}
]
}
],
"transcription_result": "spk1: This is a transcription\nspk2: Diarization and punctuation are set" # Final transcription
}
- text/plain returns the final transcription as text
spk1: This is a transcription
spk2: Diarization and punctuation are set
- text/vtt returns the transcription formated as WEBVTT captions.
WEBVTT Kind: captions; Language: en_US
00:00.000 --> 00:03.129
This is a transcription
00:03.129 --> 00:07.719
Diarization and punctuation are set
- text/srt returns the transcription formated as SubRip Subtitle.
1
00:00:00,000 --> 00:00:03,129
This is a transcription
2
00:00:03,129 --> 00:00:07,719
Diarization and punctuation are set
Additionnaly you can specify options using query string:
- return_raw: if set to true, return the raw transcription (No punctuation and no post processing).
- convert_number: if set to true, convert numbers from characters to digits.
- wordsub: accepts multiple values formated as
originalWord:substituteWord
. Substitute words in the final transcription.
The /job-log/{jobid} GET route to is used retrieve job details for debugging. Returns logs as raw text.
The /docs route offers access to a swagger-ui interface with the API specifications (OAS3).
It also allows to directly test requests using pre-filled modifiable parameters.
Request exemple:
Initial request
curl -X POST "http://MY_HOST:MY_PORT/transcribe" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F 'transcriptionConfig={
"enablePunctuation": {
"enablepunctuation": true,
"servicename": null
},
"diarizationConfig": {
"enableDiarization": true,
"numberOfSpeaker": null,
"maxNumberOfSpeaker": null,
"servicename": null
}
}' -F "force_sync=" -F "file=@MY_AUDIO.wav;type=audio/x-wav"
> {"jobid": "de37224e-fd9d-464d-9004-dcbf3c5b4300"}
Request job status
curl -X GET "http://MY_HOST:MY_PORT/job/6e3f8b5a-5b5a-4c3d-97b6-3c438d7ced25" -H "accept: application/json"
> {"result_id": "769d9c20-ad8c-4957-9581-437172434ec0", "state": "done"}
Fetch result
curl -X GET "http://MY_HOST:MY_PORT/results/769d9c20-ad8c-4957-9581-437172434ec0" -H "accept: text/vtt"
> WEBVTT Kind: captions; Language: en_US
00:00.000 --> 00:03.129
This is a transcription
00:03.129 --> 00:07.719
Diarization and punctuation are set
This project is licensed under AGPLv3 license. Please refer to the LICENSE file for full description of the license.
- celery: Distributed Task Queue.
- pymongo: A MongoDB python client.
- text2num: A text to number convertion library.
- Supervisor: A Process Control System.