Skip to content

Commit

Permalink
Merge pull request #15 from VectorInstitute/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
XkunW authored Sep 3, 2024
2 parents 3641ef2 + 59e7622 commit d10758d
Show file tree
Hide file tree
Showing 9 changed files with 20 additions and 27 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ vec-inf list Meta-Llama-3.1-70B-Instruct

## Send inference requests
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
> {"id":"cmpl-bdf43763adf242588af07af88b070b62","object":"text_completion","created":2983960,"model":"/model-weights/Llama-2-7b-hf","choices":[{"index":0,"text":"\nCanada is close to the actual continent of North America. Aside from the Arctic islands","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}
> {"id":"cmpl-c08d8946224747af9cce9f4d9f36ceb3","object":"text_completion","created":1725394970,"model":"Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" is a question that many people may wonder. The answer is, of course, Ottawa. But if","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.

Expand Down
2 changes: 1 addition & 1 deletion examples/inference/llm/chat_completions.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

# Update the model path accordingly
completion = client.chat.completions.create(
model="/model-weights/Meta-Llama-3-8B-Instruct",
model="Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "system",
Expand Down
2 changes: 1 addition & 1 deletion examples/inference/llm/completions.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

# Update the model path accordingly
completion = client.completions.create(
model="/model-weights/Meta-Llama-3-8B",
model="Meta-Llama-3.1-8B-Instruct",
prompt="Where is the capital of Canada?",
max_tokens=20,
)
Expand Down
2 changes: 1 addition & 1 deletion examples/inference/llm/completions.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ export API_BASE_URL=http://gpuXXX:XXXX/v1
curl ${API_BASE_URL}/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model-weights/Meta-Llama-3-8B",
"model": "Meta-Llama-3.1-8B-Instruct",
"prompt": "What is the capital of Canada?",
"max_tokens": 20
}'
2 changes: 1 addition & 1 deletion examples/inference/vlm/vision_completions.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

# Update the model path accordingly
completion = client.chat.completions.create(
model="/model-weights/llava-1.5-13b-hf",
model="llava-1.5-13b-hf",
messages=[
{
"role": "user",
Expand Down
2 changes: 1 addition & 1 deletion examples/logits/logits.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
client = OpenAI(base_url="http://gpuXXX:XXXXX/v1", api_key="EMPTY")

completion = client.completions.create(
model="/model-weights/Meta-Llama-3-8B",
model="Meta-Llama-3.1-8B-Instruct",
prompt="Where is the capital of Canada?",
max_tokens=1,
logprobs=32000, # Set to model vocab size to get logits
Expand Down
23 changes: 9 additions & 14 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 3 additions & 6 deletions profile/gen.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import requests
import time

import requests

# Change the ENDPOINT and MODEL_PATH to match your setup
ENDPOINT = "http://gpuXXX:XXXX/v1"
MODEL_PATH = "Meta-Llama-3-70B"
Expand Down Expand Up @@ -71,11 +72,7 @@


def send_request(prompt):
data = {
"model": f"/model-weights/{MODEL_PATH}",
"prompt": prompt,
"max_tokens": 100,
}
data = {"model": f"{MODEL_PATH}", "prompt": prompt, "max_tokens": 100}
start_time = time.time()
response = requests.post(f"{ENDPOINT}/completions", headers=HEADERS, json=data)
duration = time.time() - start_time
Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "vec-inf"
version = "0.3.2"
version = "0.3.3"
description = "Efficient LLM inference on Slurm clusters using vLLM."
authors = ["Marshall Wang <[email protected]>"]
license = "MIT license"
Expand All @@ -11,6 +11,7 @@ python = "^3.10"
requests = "^2.31.0"
click = "^8.1.0"
rich = "^13.7.0"
pandas = "^2.2.2"
vllm = { version = "^0.5.0", optional = true }
vllm-nccl-cu12 = { version = ">=2.18,<2.19", optional = true }
ray = { version = "^2.9.3", optional = true }
Expand Down

0 comments on commit d10758d

Please sign in to comment.