Poc deploy Whisper using FastAPI #52

lebaudantoine · 2024-07-20T01:41:13Z

Purpose

(this PR is not intended to be merged)

Serve a Whisper model using FastAPI.

Proposal

Inspired from insanely-fast-whisper instructions (cf. here).

I tried to containerize a FastAPI server, which serves a whisper model.
This runs thanks to pytorch and Hugging Face transformers library.

You can run locally the FastAPI server:

$ (base) cd ./src/data/transcribe
$ (base) python -m venv poc-whisper && source poc-whisper/bin/activate
$ (poc-whisper) pip install .

Run server in watch mode:

$ (poc-whisper) fastapi dev transcribe/main.py

Server should start on port 8000. You can get one of the two health checks to make sure it's running.
Then, you can post a file to get a transcription, ex:

curl -X POST "http://127.0.0.1:8000/api/v1/transcribe/" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@<your-filename>"

You can build and run the docker image from the root folder:

$ docker build -f ./src/data/dockerfile . -t transcribe:latest
$ docker run -p 8000:8000 transcribe:latest

If you are on a mac device, please take a look at the main.py file, you might need to edit it to adapt the code to your hardware.

basic FastAPI app with only two health endpoints. Sources are copied/pasted from warren api. I'll add Transformers lib and business logic in the upcoming commits.

I used to build the image the following command docker build -f ./src/data/dockerfile . -t transcribe:latest and to run it this one docker run -p 8000:8000 transcribe:latest Todo: - check installed required system libs - rename meet ocurences - configure production logger using yaml - add a target for development with watch mode - integrate it to the compose stack

torch and transformers (from HF) are the basis to make a model run optimum (from HF) and accelerate are recommended on the insanely-fast-whisper readme, to make the whisper model run faster

oopsie.

@rouja

This is a work-in-progress to expose Whisper through a '/transcribe' endpoint. I am currently working on running the model on my Mac M2, which lacks CUDA support and uses MPS (the Mac equivalent). Many operators are not implemented for MPS, causing PyTorch to fall back to CPU. This results in extremely slow processing and poor output quality for a 2-minute audio. To address the permissions issue, we need to configure a writable folder for Hugging Face so it can download models and write to the cache. This could be improved (cc @rouja) This setup is a starting point for experimenting with the model. We should manually publish the image to DockerHub and deploy it on Kubernetes using @rouja's generic chart or on Scalingo. To improve Whisper's performance, finding GPU resources is essential.

lebaudantoine · 2024-07-20T18:04:59Z

I may need to adapt my code based on these two articles:

lebaudantoine · 2024-07-22T18:47:32Z

I have created a HF spaces, following Docker tutorial. Some dependencies were outdated in the tutorial. I've pushed my code on another repository. This repo is managed by HF. Each time I pushed an update, the image is re-built and a new container deployed. That's so smooth.

lebaudantoine · 2024-07-22T20:02:26Z

Statics were added to interact with the FastAPI /transcribe endpoint, in commit 761e7c8

lebaudantoine · 2024-07-22T21:46:51Z

HF documentation on Flash Attention https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention

From my (poor) understanding, Flash attention optimize memory manipulation for some specific hardware, not compatible with all GPUs. Flash Attention 2 does not support Turing GPU yet (T4 is a Turing GPU), but they do support Ampere ones (ex: A100).

lebaudantoine · 2024-07-22T21:51:36Z

Server is quite slow to start. HF is downloading Model weights to the cache; I should try to enhance this part. Especially when we'll scaling the API to several pods

lebaudantoine · 2024-07-22T22:42:20Z

To investigate https://github.com/Vaibhavs10/optimise-my-whisper, from this Subreddit https://www.reddit.com/r/LocalLLaMA/comments/1d1xzpi/optimise_whisper_for_blazingly_fast_inference/

Same author as insanely-fast-whisper.

lebaudantoine · 2024-07-22T22:58:23Z

I've created a lot of layer in my Image, which is a bad practice, to avoid losing time with HF re-installing deps.

lebaudantoine added 5 commits July 20, 2024 01:32

wip quick start a FastAPI app

eb56155

basic FastAPI app with only two health endpoints. Sources are copied/pasted from warren api. I'll add Transformers lib and business logic in the upcoming commits.

wip install whisper-related dependencies

7db45a9

torch and transformers (from HF) are the basis to make a model run optimum (from HF) and accelerate are recommended on the insanely-fast-whisper readme, to make the whisper model run faster

wip remove duplicated docker user instructions

3b636f0

oopsie.

lebaudantoine self-assigned this Jul 20, 2024

lebaudantoine closed this Jul 22, 2024

lebaudantoine deleted the poc-fastapi branch October 9, 2024 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poc deploy Whisper using FastAPI #52

Poc deploy Whisper using FastAPI #52

lebaudantoine commented Jul 20, 2024 •

edited

Loading

lebaudantoine commented Jul 20, 2024

lebaudantoine commented Jul 22, 2024

lebaudantoine commented Jul 22, 2024

lebaudantoine commented Jul 22, 2024 •

edited

Loading

lebaudantoine commented Jul 22, 2024

lebaudantoine commented Jul 22, 2024 •

edited

Loading

lebaudantoine commented Jul 22, 2024

Poc deploy Whisper using FastAPI #52

Poc deploy Whisper using FastAPI #52

Conversation

lebaudantoine commented Jul 20, 2024 • edited Loading

Purpose

Proposal

lebaudantoine commented Jul 20, 2024

lebaudantoine commented Jul 22, 2024

lebaudantoine commented Jul 22, 2024

lebaudantoine commented Jul 22, 2024 • edited Loading

lebaudantoine commented Jul 22, 2024

lebaudantoine commented Jul 22, 2024 • edited Loading

lebaudantoine commented Jul 22, 2024

lebaudantoine commented Jul 20, 2024 •

edited

Loading

lebaudantoine commented Jul 22, 2024 •

edited

Loading

lebaudantoine commented Jul 22, 2024 •

edited

Loading