Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs #2218

Open
cphoward opened this issue Dec 20, 2023 · 2 comments
Open

LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs #2218

cphoward opened this issue Dec 20, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@cphoward
Copy link

cphoward commented Dec 20, 2023

Describe the bug

I am attempting to run the LLaMA2 demo at https://github.com/openvinotoolkit/model_server/blob/main/demos/llama_chat/python/README.md. When I run:

python client.py --url localhost:9000 --question "Write python function to sum 3 numbers." --seed 1332 --actor python-programmer

I get

raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INVALID_ARGUMENT
	details = "Invalid number of inputs - Expected: 67; Actual: 66"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Invalid number of inputs - Expected: 67; Actual: 66", grpc_status:3, created_time:"2023-12-20T17:45:16.689999237+00:00"}"

To Reproduce
Steps to reproduce the behavior:

  1. Follow the demo steps.

Expected behavior
A clear and concise description of what you expected to happen.

I expected results similar to demo documentation

Configuration

docker run -d --rm -p 9000:9000 -v $(pwd)/models/llama-2-7b-hf:/model:ro openvino/model_server \
    --port 9000 \
    --model_name llama \
    --model_path /model \
    --plugin_config '{"PERFORMANCE_HINT":"LATENCY","NUM_STREAMS":1}'

Additional context
I did install nncf for int8 compression. Is there a way to configure the example to use int4 compression?

** Update **
It seems the missing argument is position_ids.

@cphoward cphoward added the bug Something isn't working label Dec 20, 2023
@cphoward cphoward changed the title LLaMA2 Model Serving Chat Demo Errors on Invalid number of arguments LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs Dec 20, 2023
@cphoward
Copy link
Author

After playing around with models, I've found something like

def prepare_preprompt_kv_cache(preprompt):
    inputs = tokenizer(preprompt, return_tensors="np", add_special_tokens=False)
    model_inputs = {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"]
    }

    # Generate position ids based on the length of the input
    seq_length = inputs["input_ids"].shape[1]
    model_inputs["position_ids"] = np.arange(seq_length)[None, :]

    # Initialize past key values for each layer
    for i in range(32):
        model_inputs[f"past_key_values.{i}.key"] = np.zeros((1, 32, 0, 128), dtype=np.float32)
        model_inputs[f"past_key_values.{i}.value"] = np.zeros((1, 32, 0, 128), dtype=np.float32)

    return client.predict(inputs=model_inputs, model_name='llama')

won't crash if I also change the PREPROMPT to something relatively short. It crashes when attempting to run with the default PREPROMT:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "Internal inference error"
	debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:9000 {created_time:"2023-12-21T22:42:05.540244646+00:00", grpc_status:13, grpc_message:"Internal inference error"}"

The server logs give:

[2023-12-21 22:25:15.297][62][serving][error][modelinstance.cpp:1168] Async caught an exception Internal inference error: Exception from src/inference/src/infer_request.cpp:256:
Exception from src/inference/src/dev/converter_utils.cpp:707:
[ GENERAL_ERROR ] Shape inference of Multiply node with name __module.model.layers.0.self_attn/aten::mul/Multiply failed: Exception from src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:47:

I have confirmed with a custom script I wrote that the model can do inference, but it's mostly gibberish and results in very few characters.

@dkalinowski
Copy link
Collaborator

Hello @cphoward

We have recently removed the demo you refer to.
However, please check the new version which uses new MediaPipe python calculator feature that makes it easier to serve llama: https://github.com/openvinotoolkit/model_server/tree/main/demos/python_demos/llm_text_generation

Please let us know if you have any feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants