Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to decode the batch #5

Open
hoverflow opened this issue Oct 11, 2023 · 1 comment
Open

failed to decode the batch #5

hoverflow opened this issue Oct 11, 2023 · 1 comment

Comments

@hoverflow
Copy link

Hi, When I use server-parallel I get an error: updateSlots : failed to decode the batch, n_batch = 1, ret = 1

this is the complete log before the error:
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 107.54 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 800.00 MB
llama_new_context_with_model: kv self size = 800.00 MB
llama_new_context_with_model: compute buffer total size = 118.13 MB
llama_new_context_with_model: VRAM scratch buffer: 112.00 MB
llama_new_context_with_model: total VRAM used: 9606.21 MB (model: 8694.21 MB, context: 912.00 MB)
Available slots:

  • slot 0
  • slot 1

llama server listening at http://0.0.0.0:8080

system prompt updated
slot 0 is processing
slot 0 released
slot 0 is processing
slot 0 released
slot 0 is processing
slot 0 released
slot 0 is processing
slot 0 released
slot 0 is processing
slot 1 is processing
updateSlots : failed to decode the batch, n_batch = 1, ret = 1

I run server-parallel with the following command:
./server-parallel -m models/xyz.gguf --ctx_size 2048 -t 4 -ngl 40 --host 0.0.0.0 --batch-size 512 --parallel 2

Of course this only happens if both slots are performing inference at the same time. Could you please help me resolve this issue?

@FSSRepo
Copy link
Owner

FSSRepo commented Oct 11, 2023

That can only happen if the input prompt is too long. If you instead provide a video demonstrating the behavior, please remember that the 2048 context is shared between the two sequences.

Edit:

Use the branch fixes for apply latest changes. master is outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants