Skip to content

Transcriber & translator for audio files. Like Otter.ai but free and build with Gemini 1.5 Pro and Flash.

License

Notifications You must be signed in to change notification settings

vasiliadi/transcriber

Repository files navigation

The Transcriber

Python

Transcriber & translator for audio files. Like Otter.ai but open-source and almost free.

Screenshot

Otter.ai

Otter.ai monthly subscription is $16.99/per user.
Where you get:

1200 monthly transcription minutes; 90 minutes per conversation

The Transcriber app

Transcription: Replicate AI models cloud-hosting with current prices and models used, 1200 minutes will cost approximately $1.60 - $5.50
At least three times cheaper with the same or even better quality of transcription, in my opinion.
And you pay as you go.

Translation and summerization: Gemini 1.5 Pro/Flash is free, if you use Gemini API from a project that has billing disabled, without the benefits available in paid plan.

Hosting: Free tires or trials of Render, Google Cloud, Orcale Cloud, AWS, Azure, IBM Cloud, or low-cost DigitalOcean, or any you like.

Total:1
Pay as you go for 10 hours audio.
Replicate with whisper-diarization + free Gemini API + DigitalOcean = $2.00 + $0.00 + $0.10 = $2.10
Replicate with incredibly-fast-whisper + free Gemini API + DigitalOcean = $0.70 + $0.00 + $0.10 = $0.80

Note

Prices are subject to change without notice

Technical details

Run Whisper model on Replicate much cheaper than using OpenAI API for Whisper.

I use three models:

vaibhavs10/incredibly-fast-whisper best for speed
thomasmol/whisper-diarization best for dialogs
openai/whisper best in accuracy

Same audio 45 minutes (6 speakers) comparison by model Comparison of processing times by model

Limitations

OpenAI Whisper model

OpenAI Speech to text Whisper model

File uploads are currently limited to 25 MB.

To avoid this limitation, I use compression (Even though I know the models I'm using use compression, too. In practice, I've encountered a limit when relying on compression in a model). The file size without compression is 63 MB for 45 minutes of audio. However, after compression, the file size reduces to 4 MB for the same duration. Therefore, using compression, we can avoid splitting audio into chunks, and we can increase the limit to approximately 3 hours and 45 minutes of audio without losing transcription quality.

But if you still need to transcript more you can split file using pydub's silence.split_on_silence() or silence.detect_silence() or silence.detect_nonsilent(). This function's speed is hardware-dependent, but it is about 10 times faster than listening to the entire file.

In my tests, I face three main problems:

  1. These functions are not working as I expect.
  2. If split just by time, you can cut in the middle of a word.
  3. Post-processing becomes a challenge. It's hard to identify the speaker smoothly. Loss of timestamps.

All this beloongs to very long audio only.

Gemini 1.5 Pro/Flash

Gemini 1.5 Pro/Flash model names and properties

Max output tokens: 8,192

0.75 words per token = ~6,144 words or about 35 minutes of speaking. But for non-English languages, most words are counted as two or more tokens.

The maximum number of tokens for output is currently 8,192. Audio post-processing, which includes correction and translation, can only be done for files that are approximately 35 minutes long. Other models have a maximum output of 4,096 or less. If you need to process more than 8,192 tokens, you may need to do it in batches, but this will significantly increase the processing time.

Translation by chunks still works, but the quality little bit lower.

Max audio length: approximately 8.4 hours

It still works well for summarization.

2 queries per minute and 1000 per day for Gemini-1.5-pro. 15 and 1500 for Gemini-1.5-flash

Languages support for translation.

Optional settings

HuggingFace.co

For diarization, all models rely on pyannote.audio solutions. As a developer, you must agree to the user conditions for accessing the models offered by pyannote. Therefore, it is necessary to accept the user conditions for pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1 and obtain the HuggingFace API token.

The thomasmol/whisper-diarization model also uses the same models for diarization, but the developer uses his own HuggingFace API token. This means that an additional token is not required.

Text to Speach

By default, I use the ElevenLabs eleven_turbo_v2_5 model to generate high-quality audio for summaries in various languages. It's very fast and 50% cheaper than the eleven_multilingual_v2 model. You get 10,000 credits per month for free, which is about 15 generated audios. If you need more, you'll need to purchase a plan or use OpenAI TTS.

OpenAI TTS is pay as you go service, which costs $0.015 / 1K characters.
OpenAI's input is limited to a maximum of 4096 characters. To overcome this limitation, I split the text into chunks using semantic_text_splitter and pydub.

Additionally, the xtts-v2 model is another high-quality multilanguage model, but Coqui, the developer of this model, is shutting down. As a result, I use ElevenLabs or OpenAI.

Config

Example of .env file:

GEMINI_API_KEY = "your_api_key"
REPLICATE_API_TOKEN = "your_api_key"
HF_ACCESS_TOKEN = "your_api_key" # only for incredibly-fast-whisper model with enabled diarization
ELEVENLABS_API_KEY = "your_api_key" # only if you want to use ElevenLabs TTS
OPENAI_API_KEY = "your_api_key" # only if you want to use OpenAI TTS
PROXY = "" # only if you need to use proxy

All keys are mandatory, but you can fill some of them with the wrong key to complete the function. Using functions that require a specific key filled with the incorrect key will result in an error.

You need to replace the path to the env_file in compose.yaml

Get Gemini API key
Get Replicate API token
Get HF API tokens and don't forget to accept pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1 user conditions. Needed only for incredibly-fast-whisper model with enabled diarization.
Get ElevenLabs API key
Get OpenAI API key

Streamlit Secrets management

PS

Your transcription and Google NotebookLM is very powerfull tool.
Using context caching, you can ask ton of questions about the topic.

Docs

Links
Libraries streamlit
replicate
google-generativeai
pytube
yt-dlp
elevenlabs
bs4
openai
pydub
semantic_text_splitter
Docker Docker Best Practices

Docker
Dockerfile reference
Dockerfile Linter

.dockerignore
.dockerignore validator

Docker Compose
Syntax for environment files in Docker Compose
Ways to set environment variables with Compose
Compose file version 3 reference
GitHub Actions Workflow syntax for GitHub Actions
Publishing images to Docker Hub and GitHub Packages
Dev Containers An open specification for enriching containers with development specific content and settings
Developing inside a Container

Deploy

Platform Links
Render Deploy from GitHub / GitLab / Bitbucket
Google Cloud Quickstart: Deploy to Cloud Run
Tutorial: Deploy your dockerized application on Google Cloud
Oracle Cloud Container Instances
IBM Cloud IBM Cloud® Code Engine
AWS AWS App Runner
Azure Web App for Containers
Deploy a containerized app to Azure
Digital Ocean How to Deploy from Container Images

Footnotes

  1. For August 2024

About

Transcriber & translator for audio files. Like Otter.ai but free and build with Gemini 1.5 Pro and Flash.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages