LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs #2218

cphoward · 2023-12-20T17:49:56Z

Describe the bug

I am attempting to run the LLaMA2 demo at https://github.com/openvinotoolkit/model_server/blob/main/demos/llama_chat/python/README.md. When I run:

python client.py --url localhost:9000 --question "Write python function to sum 3 numbers." --seed 1332 --actor python-programmer

I get

raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INVALID_ARGUMENT
	details = "Invalid number of inputs - Expected: 67; Actual: 66"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Invalid number of inputs - Expected: 67; Actual: 66", grpc_status:3, created_time:"2023-12-20T17:45:16.689999237+00:00"}"

To Reproduce
Steps to reproduce the behavior:

Follow the demo steps.

Expected behavior
A clear and concise description of what you expected to happen.

I expected results similar to demo documentation

Configuration

docker run -d --rm -p 9000:9000 -v $(pwd)/models/llama-2-7b-hf:/model:ro openvino/model_server \
    --port 9000 \
    --model_name llama \
    --model_path /model \
    --plugin_config '{"PERFORMANCE_HINT":"LATENCY","NUM_STREAMS":1}'

Additional context
I did install nncf for int8 compression. Is there a way to configure the example to use int4 compression?

** Update **
It seems the missing argument is position_ids.

The text was updated successfully, but these errors were encountered:

cphoward · 2023-12-21T22:45:39Z

After playing around with models, I've found something like

def prepare_preprompt_kv_cache(preprompt):
    inputs = tokenizer(preprompt, return_tensors="np", add_special_tokens=False)
    model_inputs = {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"]
    }

    # Generate position ids based on the length of the input
    seq_length = inputs["input_ids"].shape[1]
    model_inputs["position_ids"] = np.arange(seq_length)[None, :]

    # Initialize past key values for each layer
    for i in range(32):
        model_inputs[f"past_key_values.{i}.key"] = np.zeros((1, 32, 0, 128), dtype=np.float32)
        model_inputs[f"past_key_values.{i}.value"] = np.zeros((1, 32, 0, 128), dtype=np.float32)

    return client.predict(inputs=model_inputs, model_name='llama')

won't crash if I also change the PREPROMPT to something relatively short. It crashes when attempting to run with the default PREPROMT:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "Internal inference error"
	debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:9000 {created_time:"2023-12-21T22:42:05.540244646+00:00", grpc_status:13, grpc_message:"Internal inference error"}"

The server logs give:

[2023-12-21 22:25:15.297][62][serving][error][modelinstance.cpp:1168] Async caught an exception Internal inference error: Exception from src/inference/src/infer_request.cpp:256:
Exception from src/inference/src/dev/converter_utils.cpp:707:
[ GENERAL_ERROR ] Shape inference of Multiply node with name __module.model.layers.0.self_attn/aten::mul/Multiply failed: Exception from src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:47:

I have confirmed with a custom script I wrote that the model can do inference, but it's mostly gibberish and results in very few characters.

dkalinowski · 2024-01-05T09:02:24Z

Hello @cphoward

We have recently removed the demo you refer to.
However, please check the new version which uses new MediaPipe python calculator feature that makes it easier to serve llama: https://github.com/openvinotoolkit/model_server/tree/main/demos/python_demos/llm_text_generation

Please let us know if you have any feedback.

cphoward added the bug Something isn't working label Dec 20, 2023

cphoward changed the title ~~LLaMA2 Model Serving Chat Demo Errors on Invalid number of arguments~~ LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs #2218

LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs #2218

cphoward commented Dec 20, 2023 •

edited

Loading

cphoward commented Dec 21, 2023

dkalinowski commented Jan 5, 2024

LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs #2218

LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs #2218

Comments

cphoward commented Dec 20, 2023 • edited Loading

cphoward commented Dec 21, 2023

dkalinowski commented Jan 5, 2024

cphoward commented Dec 20, 2023 •

edited

Loading