You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
system prompt updated
slot 0 is processing
slot 0 released
slot 0 is processing
slot 0 released
slot 0 is processing
slot 0 released
slot 0 is processing
slot 0 released
slot 0 is processing
slot 1 is processing
updateSlots : failed to decode the batch, n_batch = 1, ret = 1
I run server-parallel with the following command:
./server-parallel -m models/xyz.gguf --ctx_size 2048 -t 4 -ngl 40 --host 0.0.0.0 --batch-size 512 --parallel 2
Of course this only happens if both slots are performing inference at the same time. Could you please help me resolve this issue?
The text was updated successfully, but these errors were encountered:
That can only happen if the input prompt is too long. If you instead provide a video demonstrating the behavior, please remember that the 2048 context is shared between the two sequences.
Edit:
Use the branch fixes for apply latest changes. master is outdated.
Hi, When I use server-parallel I get an error: updateSlots : failed to decode the batch, n_batch = 1, ret = 1
this is the complete log before the error:
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 107.54 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 800.00 MB
llama_new_context_with_model: kv self size = 800.00 MB
llama_new_context_with_model: compute buffer total size = 118.13 MB
llama_new_context_with_model: VRAM scratch buffer: 112.00 MB
llama_new_context_with_model: total VRAM used: 9606.21 MB (model: 8694.21 MB, context: 912.00 MB)
Available slots:
llama server listening at http://0.0.0.0:8080
system prompt updated
slot 0 is processing
slot 0 released
slot 0 is processing
slot 0 released
slot 0 is processing
slot 0 released
slot 0 is processing
slot 0 released
slot 0 is processing
slot 1 is processing
updateSlots : failed to decode the batch, n_batch = 1, ret = 1
I run server-parallel with the following command:
./server-parallel -m models/xyz.gguf --ctx_size 2048 -t 4 -ngl 40 --host 0.0.0.0 --batch-size 512 --parallel 2
Of course this only happens if both slots are performing inference at the same time. Could you please help me resolve this issue?
The text was updated successfully, but these errors were encountered: