-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLaMA2 Model Serving Chat Demo Errors on Invalid number of inputs #2218
Comments
After playing around with models, I've found something like def prepare_preprompt_kv_cache(preprompt):
inputs = tokenizer(preprompt, return_tensors="np", add_special_tokens=False)
model_inputs = {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"]
}
# Generate position ids based on the length of the input
seq_length = inputs["input_ids"].shape[1]
model_inputs["position_ids"] = np.arange(seq_length)[None, :]
# Initialize past key values for each layer
for i in range(32):
model_inputs[f"past_key_values.{i}.key"] = np.zeros((1, 32, 0, 128), dtype=np.float32)
model_inputs[f"past_key_values.{i}.value"] = np.zeros((1, 32, 0, 128), dtype=np.float32)
return client.predict(inputs=model_inputs, model_name='llama') won't crash if I also change the
The server logs give:
I have confirmed with a custom script I wrote that the model can do inference, but it's mostly gibberish and results in very few characters. |
Hello @cphoward We have recently removed the demo you refer to. Please let us know if you have any feedback. |
Describe the bug
I am attempting to run the LLaMA2 demo at https://github.com/openvinotoolkit/model_server/blob/main/demos/llama_chat/python/README.md. When I run:
python client.py --url localhost:9000 --question "Write python function to sum 3 numbers." --seed 1332 --actor python-programmer
I get
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
I expected results similar to demo documentation
Configuration
Additional context
I did install
nncf
forint8
compression. Is there a way to configure the example to useint4
compression?** Update **
It seems the missing argument is
position_ids
.The text was updated successfully, but these errors were encountered: