Skip to content

jrcastropy/mistral7b-gptq-runpod

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

runpod-worker-exllamav2

Note

GPTQ models with fast ExLlamaV2 inference on RunPod Serverless.

This is based on Yushi Homma's exllama-runpod-serverless repo, but has been rewritten for exllamav2.

Summary

This Docker image runs a GPTQ model on a serverless RunPod worker using the optimized turboderp's exllamav2 repo.

This was also a repo from ashleyk ashleyk's github

Set Up

  1. Create a RunPod account and navigate to the Serverless Console.
  2. (Optional) Create a Network Volume to cache your model to speed up cold starts (but will incur some cost per hour for storage).
    • Note: Only certain Network Volume regions are compatible with certain instance types on RunPod, so try out if your Network Volume makes your desired instance type Unavailable, try other regions for your Network Volume.

Mistral 7b Instruct v0.2 Config

  1. Navigate to Templates and click on the New Template button.

  2. Enter in the following fields and click on the Save Template button:

    Template Field Value
    Template Name exllama2-mistral7b-instructv0.2-gptq-V2
    Container Image jrcastropy/runpod-exllamav2:v0.3
    Container Disk A size large enough to store your libraries + your desired model in 4bit. 20gb
    • Container Disk Size Guide:

      Model Parameters Storage & VRAM
      7B 6GB
      13B 9GB
      33B 19GB
      65B 35GB
      70B 38GB
    • Environment Variables:

      Environment Variable Example Value
      (Required) MODEL_REPO TheBloke/Mistral-7B-Instruct-v0.2-GPTQ or any other repo for GPTQ Mistral model. See https://huggingface.co/models?other=llama&sort=trending&search=thebloke+gptq for other models. Must have .safetensors file(s).
      (Optional) MAX_SEQ_LEN 4096
      (Optional) ALPHA_VALUE 1
      (If using Network Volumes) HUGGINGFACE_HUB_CACHE /runpod-volume/hub
      (If using Network Volumes) TRANSFORMERS_CACHE /runpod-volume/hub

Mistral 7b Instruct v0.2 Config

  1. Now click on Serverless and click on the New Endpoint button.
  2. Fill in the following fields and click on the Create button:
    Endpoint Field Value
    Endpoint Name exllamav2-mistral7b-instruct0.2-gptq
    Select Template exllama2-mistral7b-instructv0.2-gptq-V2
    Min Provisioned Workers 0
    Max Workers 3
    Idle Timeout 20 seconds
    FlashBoot Checked/Enabled
    GPU Type(s) Use the Container Disk section of step 3 to determine the smallest GPU that can load the entire 4 bit model. In our example's case, use 16 GB GPU. Make smaller if using Network Volume instead.
    (Optional) Network Volume container-mistral7b

Inference Usage

See the predict.py file for an example.

Run the predict.py using the following command in terminal with the RunPod endpoint id assigned to your endpoint in step 5.

RUNPOD_AI_API_KEY='**************' RUNPOD_ENDPOINT_ID='*******' python predict.py

To run with streaming enabled, use the --stream option. To set generation parameters, use the --params_json option to pass a JSON string of parameters:

RUNPOD_AI_API_KEY='**************' RUNPOD_ENDPOINT_ID='*******' python predict.py --params_json '{"temperature": 0.3, "max_tokens": 1000, "prompt_prefix": "USER: ", "prompt_suffix": "ASSISTANT: "}'

You can generate the API key under your RunPod Settings under API Keys.

Build your own docker image

If you want to build the image yourself, please see the tutorial.txt

Appreciate Ashleyk's work?

Buy Me A Coffee

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published