Unable to perform LLM inference when using VllmServer #14420

xKwan · 2024-06-27T03:07:54Z

xKwan
Jun 27, 2024

Issue:
I want to serve LLM application in production mode, thus I am hosting LLM using vLLM and connect documents to LLM using LlamaIndex. When I tried a sample inferencing, I got KeyError: 'text'

Library versions used:
vllm: 0.4.0.post1
llama_index: 0.10.42
llama_index.llms.vllm: 0.1.7

Server Setup:
I installed vllm and started a vllm server with the following command in the terminal:

python3 -m vllm.entrypoints.openai.api_server --model=/path/to/my/local/model --dtype=float16 --tensor-parallel-size=8 --quantization=awq --gpu-memory-utilization=0.7

It is hosted on localhost:8000.

I did a sanity check with curl command:

curl http://localhost:8000/v1/completions\
-H "Content-Type": "application/json" \
-d '{
    "model": "/path/to/my/local/model",
    "prompt" "What is a black hole?",
    "max_tokens": 1024,
    "temperature": 0,
}'

Application Setup:
I followed the reference guide here:
https://docs.llamaindex.ai/en/stable/api_reference/llms/vllm/#llama_index.llms.vllm.VllmServer

from llama_index.llms.vllm import VllmServer

def messages_to_prompt(messages):
    prompt = "\n".join([str(x) for x in messages])
    return f"<s>[INST] {prompt} [/INST] </s>\n"

def completion_to_prompt(completion):
    return f"<s>[INST] {completion} [/INST] </s>\n"

llm = VllmServer(
    api_url="http://localhost:8000/v1/completions",
    max_new_tokens=256,
    temperature=0.1,
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

llm.complete(
    "What is a black hole?"
)

KeyError: 'text'

Traceback:
File "/......./lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 198, in wrapper
  result = func(*args, **kwargs)
File "/......./lib/python3.10/site-packages/llama_index/core/llms/callbacks.py", line 389, in wrapped_llm_predict
  f_return_val = f(_self, *args, **kwargs)
File "/......./lib/python3.10/site-packages/llama_index/llms/vllm/base.py", line 368, in complete
  output = get_response(response)
File "/......./lib/python3.10/site-packages/llama_index/llms/vllm/utils.py", line 9, in get_response
  return data["text"]
KeyError: 'text'

@xKwan · 2024-06-27T03:15:02Z

dosubot[bot]
bot Jun 27, 2024

Hey @xKwan! I'm here to help with your issue. I can assist with bug fixes, answer questions, and guide you to become a contributor. Let's tackle this challenge together!

To resolve the KeyError: 'text' when using VllmServer for LLM inference, you need to ensure that the response from the post_http_request function contains the expected 'text' key. The issue likely arises because the response does not include this key.

Here's a potential fix for the complete method in your VllmServer setup:

def complete(
    self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponse:
    kwargs = kwargs if kwargs else {}
    params = {**self._model_kwargs, **kwargs}

    # build sampling parameters
    sampling_params = dict(**params)
    sampling_params["prompt"] = prompt
    response = post_http_request(self.api_url, sampling_params, stream=False)
    output = get_response(response)

    # Ensure the response contains the 'text' key
    if 'text' not in output[0]:
        raise KeyError("The response does not contain the 'text' key")

    return CompletionResponse(text=output[0]['text'])

This code checks if the 'text' key is present in the output[0] dictionary before attempting to access it. If the key is not present, a KeyError is raised with a descriptive message, preventing the error from occurring unexpectedly ^[1]^[2].

Additionally, ensure that your VllmServer class is correctly set up and that the post_http_request and get_response functions are properly defined and used. Here is an example of how the VllmServer class is initialized and used:

from llama_index.llms.vllm import VllmServer

def messages_to_prompt(messages):
    prompt = "\n".join([str(x) for x in messages])
    return f"<s>[INST] {prompt} [/INST] </s>\n"

def completion_to_prompt(completion):
    return f"<s>[INST] {completion} [/INST] </s>\n"

llm = VllmServer(
    api_url="http://localhost:8000/v1/completions",
    max_new_tokens=256,
    temperature=0.1,
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

llm.complete("What is a black hole?")

Ensure that the api_url is correctly set to the URL of your vLLM server ^[1]^[2].

To continue talking to Dosu, mention @dosu.

0 replies

stg609 · 2024-08-05T22:37:35Z

stg609
Aug 5, 2024

did you find a solution? I have the same issue.

0 replies

nickakube · 2024-09-21T17:34:15Z

nickakube
Sep 21, 2024

I followed your suggestion same error:

ne 9, in get_response
return data["text"]
~~~~^^^^^^^^
KeyError: 'text'

(propertyrag) akube@akube-dev:/nickswork$ vi testvllm.py
(propertyrag) akube@akube-dev:/nickswork$ cat testvllm.py

from llama_index.llms.vllm import VllmServer
from llama_index.core.llms import ChatMessage
from llama_index.core.llms import (
CustomLLM,
CompletionResponse,
CompletionResponseGen,
LLMMetadata,
)

def complete(
self, prompt: str, formatted: bool = False, **kwargs: any)->CompletionResponse:
kwargs = kwargs if kwargs else {}
params = {**self._model_kwargs, **kwargs}

# build sampling parameters
sampling_params = dict(**params)
sampling_params["prompt"] = prompt
response = post_http_request(self.api_url, sampling_params, stream=False)
output = get_response(response)

# Ensure the response contains the 'text' key
if 'text' not in output[0]:
    raise KeyError("The response does not contain the 'text' key")

return CompletionResponse(text=output[0]['text'])
    # specific functions to format for mistral instruct

def messages_to_prompt(messages):
prompt = "\n".join([str(x) for x in messages])
return f"~~[INST] {prompt} [/INST]~~ \n"

def completion_to_prompt(completion):
return f"~~[INST] {completion} [/INST]~~ \n"

def main():
llm = VllmServer(
api_url="http://localhost:8000/completions", model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",max_new_tokens=100, temperature=0,
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,

)
llm.complete("what is a black hole ?")

if name=='main':
main()

0 replies

SauronLee · 2024-10-17T05:32:43Z

SauronLee
Oct 17, 2024

I tried to rewrite the function in the VllmServer class to solve this problem：

from llama_index.llms.vllm import VllmServer
from llama_index.core.llms.callbacks import llm_completion_callback
from typing import Any, Dict, List
from llama_index.core.base.llms.types import CompletionResponse
import requests
import json

def custom_post_http_request(
        api_url: str, sampling_params: dict = {}, stream: bool = False
) -> requests.Response:
    # headers = {"User-Agent": "Test Client"}
    headers = {'Content-Type': 'application/json'}
    sampling_params["stream"] = stream
    return requests.post(api_url, headers=headers, json=sampling_params, stream=True)

def custom_get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    return data['choices'][0]['text']

class CustomVllmServer(VllmServer):
    @property
    def _model_kwargs(self) -> Dict[str, Any]:
        base_kwargs = {
            "temperature": self.temperature,
            "max_tokens": self.max_new_tokens,
            "model": "/data1/workspace-lxr/pretrained_model_weights/llm_weights/qwen/Qwen2-7B-Instruct",
            "stop": ["<|im_end|>"],
            "stream": False
        }
        return {**base_kwargs}

    @llm_completion_callback()
    def complete(
            self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponse:
        kwargs = kwargs if kwargs else {}
        params = {**self._model_kwargs, **kwargs}

        # build sampling parameters
        sampling_params = dict(**params)
        sampling_params["prompt"] = prompt
        response = custom_post_http_request(self.api_url, sampling_params, stream=False)
        output = custom_get_response(response)

        return CompletionResponse(text=output)


llm = CustomVllmServer(
    api_url="http://192.168.2.253:9678/v1/completions",
    max_new_tokens=256,
    temperature=0.1
)

system_text = "You are a helpful assistant."
prompt = f"<|im_start|>system\n{system_text}<|im_end|>\n<|im_start|>user你是谁？<|im_end|><|im_start|>assistant\n"

response = llm.complete(prompt)
print(response)

Output：

我是阿里云开发的一款超大规模语言模型，我叫通义千问。

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to perform LLM inference when using VllmServer #14420

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Unable to perform LLM inference when using VllmServer #14420

xKwan Jun 27, 2024

Replies: 4 comments

dosubot[bot] bot Jun 27, 2024

stg609 Aug 5, 2024

nickakube Sep 21, 2024

SauronLee Oct 17, 2024

xKwan
Jun 27, 2024

dosubot[bot]
bot Jun 27, 2024

stg609
Aug 5, 2024

nickakube
Sep 21, 2024

SauronLee
Oct 17, 2024