-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poc deploy Whisper using FastAPI #52
Conversation
basic FastAPI app with only two health endpoints. Sources are copied/pasted from warren api. I'll add Transformers lib and business logic in the upcoming commits.
I used to build the image the following command docker build -f ./src/data/dockerfile . -t transcribe:latest and to run it this one docker run -p 8000:8000 transcribe:latest Todo: - check installed required system libs - rename meet ocurences - configure production logger using yaml - add a target for development with watch mode - integrate it to the compose stack
torch and transformers (from HF) are the basis to make a model run optimum (from HF) and accelerate are recommended on the insanely-fast-whisper readme, to make the whisper model run faster
This is a work-in-progress to expose Whisper through a '/transcribe' endpoint. I am currently working on running the model on my Mac M2, which lacks CUDA support and uses MPS (the Mac equivalent). Many operators are not implemented for MPS, causing PyTorch to fall back to CPU. This results in extremely slow processing and poor output quality for a 2-minute audio. To address the permissions issue, we need to configure a writable folder for Hugging Face so it can download models and write to the cache. This could be improved (cc @rouja) This setup is a starting point for experimenting with the model. We should manually publish the image to DockerHub and deploy it on Kubernetes using @rouja's generic chart or on Scalingo. To improve Whisper's performance, finding GPU resources is essential.
I may need to adapt my code based on these two articles: |
I have created a HF spaces, following Docker tutorial. Some dependencies were outdated in the tutorial. I've pushed my code on another repository. This repo is managed by HF. Each time I pushed an update, the image is re-built and a new container deployed. That's so smooth. |
Statics were added to interact with the FastAPI |
HF documentation on Flash Attention https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention From my (poor) understanding, Flash attention optimize memory manipulation for some specific hardware, not compatible with all GPUs. Flash Attention 2 does not support Turing GPU yet (T4 is a Turing GPU), but they do support Ampere ones (ex: A100). |
Server is quite slow to start. HF is downloading Model weights to the cache; I should try to enhance this part. Especially when we'll scaling the API to several pods |
To investigate https://github.com/Vaibhavs10/optimise-my-whisper, from this Subreddit https://www.reddit.com/r/LocalLLaMA/comments/1d1xzpi/optimise_whisper_for_blazingly_fast_inference/ Same author as insanely-fast-whisper. |
I've created a lot of layer in my Image, which is a bad practice, to avoid losing time with HF re-installing deps. |
Purpose
(this PR is not intended to be merged)
Serve a Whisper model using FastAPI.
Proposal
Inspired from insanely-fast-whisper instructions (cf. here).
I tried to containerize a FastAPI server, which serves a whisper model.
This runs thanks to pytorch and Hugging Face transformers library.
You can run locally the FastAPI server:
Run server in watch mode:
Server should start on port
8000
. You can get one of the two health checks to make sure it's running.Then, you can post a file to get a transcription, ex:
You can build and run the docker image from the root folder:
If you are on a mac device, please take a look at the
main.py
file, you might need to edit it to adapt the code to your hardware.