Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poc deploy Whisper using FastAPI #52

Closed
wants to merge 5 commits into from
Closed

Poc deploy Whisper using FastAPI #52

wants to merge 5 commits into from

Conversation

lebaudantoine
Copy link
Collaborator

@lebaudantoine lebaudantoine commented Jul 20, 2024

Purpose

(this PR is not intended to be merged)

Serve a Whisper model using FastAPI.

Proposal

Inspired from insanely-fast-whisper instructions (cf. here).

I tried to containerize a FastAPI server, which serves a whisper model.
This runs thanks to pytorch and Hugging Face transformers library.


You can run locally the FastAPI server:

$ (base) cd ./src/data/transcribe
$ (base) python -m venv poc-whisper && source poc-whisper/bin/activate
$ (poc-whisper) pip install .

Run server in watch mode:

$ (poc-whisper) fastapi dev transcribe/main.py

Server should start on port 8000. You can get one of the two health checks to make sure it's running.
Then, you can post a file to get a transcription, ex:

curl -X POST "http://127.0.0.1:8000/api/v1/transcribe/" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@<your-filename>"

You can build and run the docker image from the root folder:

$ docker build -f ./src/data/dockerfile . -t transcribe:latest
$ docker run -p 8000:8000 transcribe:latest

If you are on a mac device, please take a look at the main.py file, you might need to edit it to adapt the code to your hardware.

basic FastAPI app with only two health endpoints.
Sources are copied/pasted from warren api.

I'll add Transformers lib and business logic in the upcoming commits.
I used to build the image the following command

docker build -f ./src/data/dockerfile . -t transcribe:latest

and to run it this one

docker run -p 8000:8000 transcribe:latest

Todo:
- check installed required system libs
- rename meet ocurences
- configure production logger using yaml
- add a target for development with watch mode
- integrate it to the compose stack
torch and transformers (from HF) are the basis to make a model run

optimum (from HF) and accelerate are recommended on the insanely-fast-whisper readme,
to make the whisper model run faster
This is a work-in-progress to expose Whisper through a '/transcribe' endpoint.

I am currently working on running the model on my Mac M2, which lacks CUDA support
and uses MPS (the Mac equivalent). Many operators are not implemented for MPS, causing PyTorch
to fall back to CPU. This results in extremely slow processing and poor output quality for a 2-minute audio.

To address the permissions issue, we need to configure a writable folder for Hugging Face
so it can download models and write to the cache. This could be improved (cc @rouja)

This setup is a starting point for experimenting with the model. We should manually publish the image
to DockerHub and deploy it on Kubernetes using @rouja's generic chart or on Scalingo.

To improve Whisper's performance, finding GPU resources is essential.
@lebaudantoine lebaudantoine self-assigned this Jul 20, 2024
@lebaudantoine
Copy link
Collaborator Author

@lebaudantoine
Copy link
Collaborator Author

I have created a HF spaces, following Docker tutorial. Some dependencies were outdated in the tutorial. I've pushed my code on another repository. This repo is managed by HF. Each time I pushed an update, the image is re-built and a new container deployed. That's so smooth.

@lebaudantoine
Copy link
Collaborator Author

Statics were added to interact with the FastAPI /transcribe endpoint, in commit 761e7c8

@lebaudantoine
Copy link
Collaborator Author

lebaudantoine commented Jul 22, 2024

HF documentation on Flash Attention https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention

From my (poor) understanding, Flash attention optimize memory manipulation for some specific hardware, not compatible with all GPUs. Flash Attention 2 does not support Turing GPU yet (T4 is a Turing GPU), but they do support Ampere ones (ex: A100).

@lebaudantoine
Copy link
Collaborator Author

Server is quite slow to start. HF is downloading Model weights to the cache; I should try to enhance this part. Especially when we'll scaling the API to several pods

@lebaudantoine
Copy link
Collaborator Author

lebaudantoine commented Jul 22, 2024

@lebaudantoine
Copy link
Collaborator Author

I've created a lot of layer in my Image, which is a bad practice, to avoid losing time with HF re-installing deps.

@lebaudantoine lebaudantoine deleted the poc-fastapi branch October 9, 2024 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant