Tools for Modal. Intended for personal use and experiments.
My Llamas is a Modal app that can download models to a Modal volume. You can specify Ollama model names to have ollama manage pulling them to the .ollama
directory in the volume, or specify Huggingface paths to files, including multipart files.
The app has a scale-to-zero Ollama inference server with token authentication through FastAPI. The FastAPI app just proxies requests to the Ollama REST server running in the container.
Connect from a chat client of your choice like open-webui. You can also use client.py
for quick testing, which I pulled from modal examples.
You get $30 of free credits per month from Modal.
All of the models you want to store, whether through Ollama or Huggingface, are stored in args.py
. Create this file in the project root and fill it with the following example.
The beauty here is you can create as many configs as you like and all the models can be downloaded and built to the volume.
from typing import Literal
DOWNLOAD = {
"qwen": {
"hf_path": "Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-fp16.gguf",
"pet_name": "Qwen",
"modelfile": "qwen-test",
"gpu": "t4:1",
},
}
DOWNLOAD_DEFAULT = "qwen"
PULL = {"qwq": {"gpu": "l4:1"}}
PULL_DEFAULT = "qwq"
CHOSEN_SOURCE: Literal["download"] | Literal["pull"] = "download"
Set the CHOSEN_SOURCE
to pull
or download
depending on if you want to deploy a model you pulled or deploy one you downloaded. It will use the respective config to create/use the Modal app with that config's GPU size.
modal run tame_llama::pull
This will use PULL_DEFAULT
config.
modal run --detach tame_llama::download
modal run --detach tame_llama::compile
This will use DOWNLOAD_DEFAULT
config.
Test with qwen to make sure everything is working.
# first, create a python virtual environment please
# activate it
# See Config section to populate this file
touch args.py
pip install modal
modal secret create huggingface-secret HF_TOKEN=<secret>
modal secret create llama-food LLAMA_FOOD=<secret> # Bearer auth for fastapi
modal setup
modal run --detach tame_llama::pull
See the modal CLI for app, shell, deploy, secret, volume commands, etc.
journalctl -u ollama --no-pager
alias chat='python client.py \
--app-name=myllamas-gpu-l4-1-myllamas \
--function-name=serve \
--model=qwq \
--max-tokens 1000 \
--api-key $(echo $LLAMA_FOOD) \
--temperature 0.9 \
--frequency-penalty 1.03 \
--chat'