Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Speculative decoding – load 2 models at the same time! #1207

Closed
aleksusklim opened this issue Nov 9, 2024 · 13 comments
Closed
Labels
enhancement New feature or request

Comments

@aleksusklim
Copy link

ggerganov#2926
ggerganov#3624
ggerganov#5625

Feature request

  1. Implement speculative sampling in koboldcpp
  2. Implement optional loading of 2 models which is required by speculative decoding
  3. Also allow switching the main model during normal generation: the client can choose, whether the current request should be:
  • Generated speculatively using both main and draft models?
  • Generated using only the main large model without drafting?
  • Generated using only the small model as if it was a separate one?
  1. Context cache is independent for those two models when the speculative sampling is disabled in Lite, but gets synched on the next speculative request.

Background

I have been using MIQU model (https://huggingface.co/miqudev/miqu-1-70b) for quite a long time many months ago. It is 70b and of course it won't fit in my 3060 GPU with 12 Gb of VRAM.

That model was better than anything else I've ever tried! I wouldn't want to run it heavily quantized to 2 bits because I didn't want to sacrifice its quality, especially because I have 128 Gb of DDR4 RAM.

I could get, like, 1 token/second (or slightly more while the context is short) by activating CuBLAS with 0 offloaded layers.
But later, llama.cpp was updated with new quantization algorithm that hurts performance for older models (that have to be requantized, which is not the case for this stolen/unofficial miqu model). Anyway, it was not that bad even when I continued to play with miqu.

But recently a Mistral Large 2 model came out (https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF) that has 123b parameters!

For me it was superior to miqu for every possible task, and it is even less censored.
Unfortunately, such huge model is running 0.3 tokens/second from the empty context and it gets even slower over time…

I tried different attempts to speed it up, but CuBLAS with 0 layers is still the best (and I cannot roll back to the older koboldcpp version because of GGUF format changes, to see if the previous CUDA kernel versions might be faster or not).
Q4 quant instead of Q5 gives a slight improvement: 0.4 tokens/seconds (+0.1 comparing to Q5).

After searching information about which model can be used as a draft model for speculative sampling for Mistral Large 2 I decided to try Mistral 7B Instruct v0.3 (https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF)

Strangely enough, llama.cpp has some redundant vocabulary checks (https://github.com/ggerganov/llama.cpp/blob/f018acba22095b8995bf6c5ef815b16a3ce4cf1b/examples/speculative/speculative.cpp#L119-L136). I had to recompile from source with those asserts commented out to make it accept Mistral 7B Instruct v0.3 (as Q5_K_M) as draft model for Mistral Large Instruct 2407 (as Q4_K_S).
Also I had to build with full CUDA support for a fair comparation.

The final speedup was huge! At 3-5 drafted tokes I got doubled speed of 0.85-1.0 token/second!

I think it worth to have it in koboldcpp as well.

What exactly I propose

  1. You need a separate model loader on a dedicated tab in config GUI. There a used can set the device and layers offload strategy. (I don't know technical details, for example how much of configuration have to be equal for both models for the sake of speculative sampling to work; but I imagine you want to get free control for anything that can be different for the draft model).
  2. All critical parameters should be read from the main model, like the context size (at worst the drafting degrades, but will not break the result provided the speculative sampling was implemented correctly).
  3. Add a new dropdown for Lite sampling tab: essentially asking what model to use: default/speculative, large/main, or small/draft. Default means "allow drafting if enabled", while other options would tell koboldcpp to not enable speculative decoding.
  4. When the client asked to not use speculative, koboldcpp proceeds with a normal generation, but with the chosen model (whether it is main or draft).
  5. Looks like you need to deal with two independent context caches, and synchronize them together only when the speculative drafting is requested. Meaning, if one "user" generates with only the main model, the other user can later generate with the draft model without destroying the first user's context cache.
  6. Default to speculative sampling for unknown clients who do not pass the new field over the API, so that they can benefit from the improved speed anyway.
  7. Think of some heuristics for the number of drafted tokens. For example, if the draft model heavily agrees with the main model during several steps, koboldcpp can increment the draft amount (up to the specified value in server config; default to e.g. 8); and otherwise decrement it if the last two drafted tokens were discarded enough times (down the other specified value, defaulting to minimal possible 2).
  8. Try to accept any model as a draft, even if it is not quite compatible. In case it is not possible to use a declared draft model also as a normal model in current llama.cpp – then some upstream changes would be necessary too.
  9. Probably, ContextShift would not work; maybe something else would not work too. Since this mode is completely optional, it won't hurt anyone who don't need it.

Sampling

Only greedy sampling (temp=0 or top_k=1) is straightforward to implement for speculative decoding. Though, some algorithms exist to allow stochastic sampling from several token probabilities (I'm not quite sure how it is implemented in llama.cpp: are they generate a most probable depth tree recursively? Are they just estimating output probabilities, sacrificing the authenticity of main model actual logits?)

Here I suggest to live with whatever is implemented in llama.cpp. Even if only greedy sampling would work correctly – this would be still a huge improvement, because:

  1. Large models are very confident in tokens: the latest update with logits list shows that Mistral Large (and miqu) most of the time returns 100% probabilities even with "sane" sampling parameters (top_p<=0.9, temp<=0.9), so de-facto it tends to behave as if we are already sampling greedily. This can be easily confirmed by multiple retries at any moment, and the model will basically say the same thing each time.
  2. Other than roleplay, sometimes a user might want to "have a local ChatGPT" and asking it questions (other than "come up with a story"). In those cases, it is pretty normal to sample with zero temperature, for many tasks like text translation, summarization, source code generation and answering factual questions.
  3. With a runtime setting to switch to the draft model, the user can just switch and retry many times whenever the story becomes boring! Since the draft model is both fast and smart (to be actually useful to be a drafter for the large one), its genuine answers would not be that bad.

Use cases!

  1. Speculative decoding for improving speed without compromising the quality of output texts, provided the user has extra VRAM/RAM. Basically, you add a correct drafter model and transparently get a noticeable speedup!
  2. Comparation of two unrelated models, ability to switch between them as if they are loaded in two instances of koboldcpp, but more conveniently. You write a story and then change the model to see how different it would be (especially useful to test finetuned versions by comparing their logits in the middle of a good story).
  3. Preserving the context cache when running two stories simultaneously in two Lite tabs, again as if they are loaded in two separated servers.
  4. Performing small tasks like memory summarization or image prompt expansion using the draft model while running a large story (with or without speculation), because the small model should be quick enough to reprocess the whole context, which is not the case for the large model.

I see another improvement that technically will be possible if everything is implemented: the ability to use two cache contexts while still running one model:

  • The user selects something like "use the draft slot only as a separate context cache" when starting the server
  • The speculative mechanism is not instantiated
  • Instead, its separate context cache assigned to the same main model
  • In the default drafting mode koboldcpp behaves as if the user choses to generate on the main model without speculation
  • When Lite asks specifically to use the draft model – only its context cache slot is used, without destroying the main one
  • This would allow to run two stories together with one model in one instance of koboldcpp, which is cheaper than running two instances on different ports
  • If this idea becomes popular, you may turn in to arbitrary number of contexts instead, totally separating this logic from the speculative stuff

There are two things need to be done: 2 (or more) models and contexts at the same time, and speculative decoding using 2 models.
If you would implement several models – then you can rather easily add the speculative decoding too.
Otherwise, if you would want speculative decoding – you would have to implement loading of several models for this anyway.
Then, when having two models in memory – you can imagine something like "model offloading", or "switching on demand", where a model my be unloaded and replaced with another model at runtime.

But those are future possible improvements, while the speculative decoding is a useful thing by itself!

@LostRuins LostRuins added the enhancement New feature or request label Nov 9, 2024
@stepfunction83
Copy link

stepfunction83 commented Nov 25, 2024

Support for this was added to llama.cpp today:

ggerganov#10455

It shows 40-60% speedups in practice.

@LostRuins
Copy link
Owner

Support is now added to KoboldCpp. Do give it a try!

@aleksusklim
Copy link
Author

How do I confirm whether it is working or not?

It prints Attempting to load draft model for speculative decoding. It will be fully offloaded if possible. Vocab must match the main model.

While I know that Mistal tokenizer is partially different between Large and 7B, there were no errors following.
It says it offloaded 33/33 layers (while I had 0 chosen for CuBLAS). Are you sure we don't need a cap configured?

During generation, I see 0% load on GPU, suddenly spiking to 40-60% periodically. Is that the draft model running?
Also, "Generating (X / Y tokens)" is incremented with more than 1 step.

But where is statistics of speculative sampling? For example, "drafted 40 tokens, accepted 29 tokens, in average 3.4 tokens for step"
So that I can judge, how well the drafting is going.

Still, can you add a flag to switch to the draft model at runtime? (Al least that: not actually use two models as independent, but simply force generation on a small model instead of running both speculatively)
This way it will be possible to quickly compare, what the drafting model "thinks" about your story in general, and how fast it goes. For example, if its output looks like nonsense – then something is wrong (too long context, unknown input language, foreign template format).

@LostRuins
Copy link
Owner

You can run it in --debugmode. The draft model output will be displayed along with the validated result and whether it matches.

To verify drafting you can compare with two kinds of instructions:

"Please give me the first 100 positive integers" - draft model will do very well on this at low temperature. You will see large chunks of correctly guessed tokens being output

"Write a funny story about zebras" - draft model will do poorly on this as it's hard to speculate creative output.

@aleksusklim
Copy link
Author

aleksusklim commented Dec 1, 2024

Oh, Debug Mode, should have thought of that!

It's working, but it renders UTF-8 characters in system locale I think? For me it's win-1251.
Trying to generate some Russian texts, and it prints in console two-character sequences instead of Cyrillic letters.

I even tried to execute chcp 65001 before launching koboldcpp, but it gave no effect.
Windows 10 console is capable of rendering Unicode characters (for example chr(1044) prints 'Д' and not Р” in python 3.10) as long as the corresponding glyph is present in the chosen font.

Or are you afraid of decoding Unicode at that point?

UPD: when the full generated text is printed on console at the end as a string, it renders correctly.

@LostRuins
Copy link
Owner

Unicode characters are often made of multiple tokens. For example, is made of 0xe3 0x81 0x91. In the case where the AI only generates the token for 0xe3 for example, it will not be rendered as what you expect.

You should cross reference the token IDs with the byte representations in the vocab. It's more complicated than you think.

For example, しけ is made of the bytes 0xe3 0x81 0x97 0xe3 0x81 0x91. When using Qwen2.5, this tokenizes to [0xe3 0x81 0x97 0xe3 0x81 , 0x91]. None of the subsequences make sense on their own.

@aleksusklim
Copy link
Author

Oh, I got it. You print individual tokens up to the first failed guess but not more.
While actually you have all N drafted tokens that you got from the small model!

To be able to visually compare texts, it is enough to just utf8-decode whatever we have from the draft model so far, not truncating it on the very first wrong token. (Yeah, if the failed one was the very last – we get the same exact thing: nothing known after it)

If you would additionally print the full drafted string, it will be trivial to compare what was generated against "what the smaller model was trying to say"!

@LostRuins
Copy link
Owner

LostRuins commented Dec 2, 2024

sure. you can print the full drafted tokens by printing out the token IDs here:
https://github.com/LostRuins/koboldcpp/blob/concedo/gpttype_adapter.cpp#L694

I have a helper function print_tok_vec_str which you can use, just pass in the draft ID array.

@aleksusklim
Copy link
Author

@LostRuins, I have mistaken.

chcp 65001 in the console before launching koboldcpp.exe actually solves the gibberish rendering!
(I have no idea how my previous test didn't work out; now I've rebuilt from source inside w64devkit shell, then noticed that from there I saw a different kind of gibberish, not cp1251; then tried to run python koboldcpp.py in CMD to compare, and then tried executing chcp 65001 before which suddenly make it correctly render Unicode! Surprised, I tried that again with an official .exe – and it worked too, changing printed tokens to utf8)

This is what I tried to add into the code, as you suggested:

Before the line
// if we have somehow skipped ahead (e.g drafting), ensure that all tokens after npast are purged
(after the big loop that drafts tokens)

if(debugmode==1 && draft_used){
    printf("\nSpeculation: [%s] (correct=%d/%d)\n", get_tok_vec_str_concat(draft_results.draftids).c_str(), logits_sampled, logits_to_sample);
}

Where get_tok_vec_str_concat is like yours but concatenating:

static std::string get_tok_vec_str_concat(std::vector<int> &embd)
{
    std::string tmp = "";
    for (auto id : embd)
    {
        tmp += FileFormatTokenizeID(id, file_format, true);
    }
    ::utreplace(tmp, "\n", "\\n");
    return tmp;
}

Here are full logs of a run (Mistral-Large-Instruct-2407 + Mistral-7B-Instruct-v0.3):

CMD shell
C:\tmp\koboldcpp>chcp 65001
Active code page: 65001

C:\tmp\koboldcpp>python koboldcpp.py
***
Welcome to KoboldCpp - Version 1.79.1
For command line arguments, please refer to --help
***
Auto Selected Vulkan Backend...

Attempting to use CPU library.
Initializing dynamic library: koboldcpp_default.dll
==========
Namespace(model='', model_param='I:/Mistral-Large-Instruct-2407.Q5_K_S-00001-of-00004.gguf', port=5001, port_param=5001, host='127.0.0.1', launch=True, config=None, threads=8, usecublas=None, usevulkan=None, useclblast=None, usecpu=True, contextsize=1024, gpulayers=0, tensor_split=None, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=8, lora=None, noshift=True, nofastforward=False, nommap=False, usemlock=True, noavx2=False, debugmode=1, onready='', benchmark=None, prompt='', promptlimit=100, multiuser=1, multiplayer=False, remotetunnel=False, highpriority=False, foreground=False, preloadstory=None, quiet=False, ssl=None, nocertify=False, mmproj=None, draftmodel='I:/Mistral-7B-Instruct-v0.3-Q5_K_M.gguf', draftamount=16, password=None, ignoremissing=False, chatcompletionsadapter=None, flashattention=True, quantkv=2, forceversion=0, smartcontext=False, unpack='', nomodel=False, showgui=False, skiplauncher=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=7, sdclamped=0, sdt5xxl='', sdclipl='', sdclipg='', sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None, noblas=False)
==========
Loading model: I:\Mistral-Large-Instruct-2407.Q5_K_S-00001-of-00004.gguf

The reported GGUF Arch is: llama
Arch Category: 0

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 37 key-value pairs and 795 tensors from I:\Mistral-Large-Instruct-2407.Q5_K_S-00001-of-00004.gguf (version GGUF V3 (latest))
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 12288
llm_load_print_meta: n_layer          = 88
llm_load_print_meta: n_head           = 96
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 12
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = unknown, may not work (guessed)
llm_load_print_meta: model params     = 122.61 B
llm_load_print_meta: model size       = 78.56 GiB (5.50 BPW)
llm_load_print_meta: general.name     = Mistral Large Instruct 2407
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 781 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors:   CPU_Mapped model buffer size = 22722.80 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 22813.64 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 22797.14 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 12113.72 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 1024
llama_new_context_with_model: n_ctx_per_seq = 1024
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =    99.00 MiB
llama_new_context_with_model: KV self size  =   99.00 MiB, K (q4_0):   49.50 MiB, V (q4_0):   49.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =   187.01 MiB
llama_new_context_with_model: graph nodes  = 2471
llama_new_context_with_model: graph splits = 1

Attempting to load draft model for speculative decoding. It will be fully offloaded if possible. Vocab must match the main model.
llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from I:/Mistral-7B-Instruct-v0.3-Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1731 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work (guessed)
llm_load_print_meta: model params     = 7.25 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW)
llm_load_print_meta: general.name     = Mistral-7B-Instruct-v0.3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 781 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors:   CPU_Mapped model buffer size =  4897.52 MiB
...................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 1024
llama_new_context_with_model: n_ctx_per_seq = 1024
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (1024) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =    36.00 MiB
llama_new_context_with_model: KV self size  =   36.00 MiB, K (q4_0):   18.00 MiB, V (q4_0):   18.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    83.01 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 1
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Starting Kobold API on port 5001 at http://127.0.0.1:5001/api/
Starting OpenAI Compatible API on port 5001 at http://127.0.0.1:5001/v1/
======
Please connect to custom endpoint at http://127.0.0.1:5001
IPv6 Socket Failed to Bind. IPv6 will be unavailable.
127.0.0.1 - - [05/Dec/2024 03:38:09] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [05/Dec/2024 03:38:09] "GET /api/v1/model HTTP/1.1" 200 -
127.0.0.1 - - [05/Dec/2024 03:38:09] "GET /api/v1/info/version HTTP/1.1" 200 -
127.0.0.1 - - [05/Dec/2024 03:38:09] "GET /api/v1/config/max_context_length HTTP/1.1" 200 -
127.0.0.1 - - [05/Dec/2024 03:38:09] "GET /manifest.json HTTP/1.1" 200 -
127.0.0.1 - - [05/Dec/2024 03:38:09] "GET /api/extra/version HTTP/1.1" 200 -
127.0.0.1 - - [05/Dec/2024 03:38:09] "GET /api/extra/true_max_context_length HTTP/1.1" 200 -
127.0.0.1 - - [05/Dec/2024 03:38:09] "GET /sdapi/v1/sd-models HTTP/1.1" 200 -
127.0.0.1 - - [05/Dec/2024 03:38:09] "GET /api/extra/preloadstory HTTP/1.1" 200 -

Input: {"n": 1, "max_context_length": 16384, "max_length": 128, "rep_pen": 1, "temperature": 1, "top_p": 1, "top_k": 1, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 360, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "trim_stop": true, "genkey": "KCPP8066", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "banned_tokens": [], "render_special": true, "logprobs": false, "dry_multiplier": 1, "dry_base": 1.75, "dry_allowed_length": 3, "dry_penalty_last_n": 360, "dry_sequence_breakers": ["\n", ":", "\"", "*"], "presence_penalty": 0, "logit_bias": {}, "prompt": "[INST]\u041d\u0430\u043f\u0438\u0448\u0438 \u0432\u0441\u0435 \u0431\u0443\u043a\u0432\u044b \u0440\u0443\u0441\u0441\u043a\u043e\u0433\u043e \u0430\u043b\u0444\u0430\u0432\u0438\u0442\u0430 \u0447\u0435\u0440\u0435\u0437 \u0437\u0430\u043f\u044f\u0442\u0443\u044e.[/INST]\n\n\u0410, \u0411", "quiet": true, "stop_sequence": ["<|user|>", "<|model|>", "\n", "<"], "use_default_badwordsids": false, "bypass_eos": false}

(Warning! Request max_context_length=16384 exceeds allocated context size of 1024. It will be reduced to fit. Consider launching with increased --contextsize to avoid errors. This message will only show once per session.)127.0.0.1 - - [05/Dec/2024 03:38:37] "POST /api/extra/generate/stream HTTP/1.1" 200 -


Processing 4 dry break strings...
Found a total of 285 restart heads, 285 trivial, 0 non-trivial.

Using Seed: 351917

[Debug: Dump Raw Input Tokens, format: 6]
'<s> (1)', '[INST] (3)', 'На (24960)', 'пи (3517)', 'ши (3875)', ' все (11548)', ' бу (5981)', 'к (29563)', 'вы (5857)', ' рус (25147)', 'ского (5577)', ' ал (29119)', 'фа (14228)', 'ви (2219)', 'та (1714)', ' через (20853)', ' за (2354)', 'п (29575)', 'я (29579)', 'ту (3277)', 'ю (29610)', '. (29491)', '[/INST] (4)', '\n (781)', '\n (781)', 'А (29626)', ', (29493)', ' Б (2984)',


[Debug: Dump Forwarded Input Tokens, format: 6]
'<s> (1)', '[INST] (3)', 'На (24960)', 'пи (3517)', 'ши (3875)', ' все (11548)', ' бу (5981)', 'к (29563)', 'вы (5857)', ' рус (25147)', 'ского (5577)', ' ал (29119)', 'фа (14228)', 'ви (2219)', 'та (1714)', ' через (20853)', ' за (2354)', 'п (29575)', 'я (29579)', 'ту (3277)', 'ю (29610)', '. (29491)', '[/INST] (4)', '\n (781)', '\n (781)', 'А (29626)', ', (29493)', ' Б (2984)',

[Debug: n_past=0 Context Size = 0]


Processing Prompt (28 / 28 tokens)
Generating (1 / 128 tokens) [(, 100.00%)]
(Draft 1/16): Predicted=2287 ( В), Actual=2287 ( В) [PASS]
Generating (2 / 128 tokens) [( В 100.00%)]
(Draft 2/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (3 / 128 tokens) [(, 100.00%)]
(Draft 3/16): Predicted=3360 ( Г), Actual=3360 ( Г) [PASS]
Generating (4 / 128 tokens) [( Г 100.00%)]
(Draft 4/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (5 / 128 tokens) [(, 100.00%)]
(Draft 5/16): Predicted=3143 ( Д), Actual=3143 ( Д) [PASS]
Generating (6 / 128 tokens) [( Д 100.00%)]
(Draft 6/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (7 / 128 tokens) [(, 100.00%)]
(Draft 7/16): Predicted=6380 ( Е), Actual=6380 ( Е) [PASS]
Generating (8 / 128 tokens) [( Е 100.00%)]
(Draft 8/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (9 / 128 tokens) [(, 100.00%)]
(Draft 9/16): Predicted=11768 ( Ж), Actual=29473 ( ) [FAIL]
Generating (10 / 128 tokens) [(  100.00%)]

Speculation: [ В, Г, Д, Е, Ж, З, И, Й,] (correct=9/16)
(Draft 1/16): Predicted=31024 (Ё), Actual=31024 (Ё) [PASS]
Generating (11 / 128 tokens) [(Ё 100.00%)]
(Draft 2/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (12 / 128 tokens) [(, 100.00%)]
(Draft 3/16): Predicted=11768 ( Ж), Actual=11768 ( Ж) [PASS]
Generating (13 / 128 tokens) [( Ж 100.00%)]
(Draft 4/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (14 / 128 tokens) [(, 100.00%)]
(Draft 5/16): Predicted=4708 ( З), Actual=4708 ( З) [PASS]
Generating (15 / 128 tokens) [( З 100.00%)]
(Draft 6/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (16 / 128 tokens) [(, 100.00%)]
(Draft 7/16): Predicted=4383 ( И), Actual=4383 ( И) [PASS]
Generating (17 / 128 tokens) [( И 100.00%)]
(Draft 8/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (18 / 128 tokens) [(, 100.00%)]
(Draft 9/16): Predicted=20868 ( Й), Actual=20868 ( Й) [PASS]
Generating (19 / 128 tokens) [( Й 100.00%)]
(Draft 10/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (20 / 128 tokens) [(, 100.00%)]
(Draft 11/16): Predicted=2456 ( К), Actual=2456 ( К) [PASS]
Generating (21 / 128 tokens) [( К 100.00%)]
(Draft 12/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (22 / 128 tokens) [(, 100.00%)]
(Draft 13/16): Predicted=3986 ( Л), Actual=3986 ( Л) [PASS]
Generating (23 / 128 tokens) [( Л 100.00%)]
(Draft 14/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (24 / 128 tokens) [(, 100.00%)]
(Draft 15/16): Predicted=2662 ( М), Actual=2662 ( М) [PASS]
Generating (25 / 128 tokens) [( М 100.00%)]
(Draft 16/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (26 / 128 tokens) [(, 100.00%)]

Speculation: [Ё, Ж, З, И, Й, К, Л, М,] (correct=16/16)
(Draft 1/16): Predicted=2965 ( Н), Actual=2965 ( Н) [PASS]
Generating (27 / 128 tokens) [( Н 100.00%)]
(Draft 2/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (28 / 128 tokens) [(, 100.00%)]
(Draft 3/16): Predicted=3575 ( О), Actual=3575 ( О) [PASS]
Generating (29 / 128 tokens) [( О 100.00%)]
(Draft 4/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (30 / 128 tokens) [(, 100.00%)]
(Draft 5/16): Predicted=2299 ( П), Actual=2299 ( П) [PASS]
Generating (31 / 128 tokens) [( П 100.00%)]
(Draft 6/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (32 / 128 tokens) [(, 100.00%)]
(Draft 7/16): Predicted=3167 ( Р), Actual=3167 ( Р) [PASS]
Generating (33 / 128 tokens) [( Р 100.00%)]
(Draft 8/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (34 / 128 tokens) [(, 100.00%)]
(Draft 9/16): Predicted=2174 ( С), Actual=2174 ( С) [PASS]
Generating (35 / 128 tokens) [( С 100.00%)]
(Draft 10/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (36 / 128 tokens) [(, 100.00%)]
(Draft 11/16): Predicted=3543 ( Т), Actual=3543 ( Т) [PASS]
Generating (37 / 128 tokens) [( Т 100.00%)]
(Draft 12/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (38 / 128 tokens) [(, 100.00%)]
(Draft 13/16): Predicted=4143 ( У), Actual=4143 ( У) [PASS]
Generating (39 / 128 tokens) [( У 100.00%)]
(Draft 14/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (40 / 128 tokens) [(, 100.00%)]
(Draft 15/16): Predicted=4879 ( Ф), Actual=4879 ( Ф) [PASS]
Generating (41 / 128 tokens) [( Ф 100.00%)]
(Draft 16/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (42 / 128 tokens) [(, 100.00%)]

Speculation: [ Н, О, П, Р, С, Т, У, Ф,] (correct=16/16)
(Draft 1/16): Predicted=5492 ( Х), Actual=5492 ( Х) [PASS]
Generating (43 / 128 tokens) [( Х 100.00%)]
(Draft 2/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (44 / 128 tokens) [(, 100.00%)]
(Draft 3/16): Predicted=9679 ( Ц), Actual=9679 ( Ц) [PASS]
Generating (45 / 128 tokens) [( Ц 100.00%)]
(Draft 4/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (46 / 128 tokens) [(, 100.00%)]
(Draft 5/16): Predicted=6245 ( Ч), Actual=6245 ( Ч) [PASS]
Generating (47 / 128 tokens) [( Ч 100.00%)]
(Draft 6/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (48 / 128 tokens) [(, 100.00%)]
(Draft 7/16): Predicted=7374 ( Ш), Actual=7374 ( Ш) [PASS]
Generating (49 / 128 tokens) [( Ш 100.00%)]
(Draft 8/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (50 / 128 tokens) [(, 100.00%)]
(Draft 9/16): Predicted=29473 ( ), Actual=29473 ( ) [PASS]
Generating (51 / 128 tokens) [(  100.00%)]
(Draft 10/16): Predicted=29892 (Щ), Actual=29892 (Щ) [PASS]
Generating (52 / 128 tokens) [(Щ 100.00%)]
(Draft 11/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (53 / 128 tokens) [(, 100.00%)]
(Draft 12/16): Predicted=29473 ( ), Actual=29473 ( ) [PASS]
Generating (54 / 128 tokens) [(  100.00%)]
(Draft 13/16): Predicted=31813 (Ъ), Actual=31813 (Ъ) [PASS]
Generating (55 / 128 tokens) [(Ъ 100.00%)]
(Draft 14/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (56 / 128 tokens) [(, 100.00%)]
(Draft 15/16): Predicted=29473 ( ), Actual=29473 ( ) [PASS]
Generating (57 / 128 tokens) [(  100.00%)]
(Draft 16/16): Predicted=31352 (Ы), Actual=31352 (Ы) [PASS]
Generating (58 / 128 tokens) [(Ы 100.00%)]

Speculation: [ Х, Ц, Ч, Ш, Щ, Ъ, Ы] (correct=16/16)
(Draft 1/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (59 / 128 tokens) [(, 100.00%)]
(Draft 2/16): Predicted=29473 ( ), Actual=29473 ( ) [PASS]
Generating (60 / 128 tokens) [(  100.00%)]
(Draft 3/16): Predicted=31451 (Ь), Actual=31451 (Ь) [PASS]
Generating (61 / 128 tokens) [(Ь 100.00%)]
(Draft 4/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (62 / 128 tokens) [(, 100.00%)]
(Draft 5/16): Predicted=9179 ( Э), Actual=9179 ( Э) [PASS]
Generating (63 / 128 tokens) [( Э 100.00%)]
(Draft 6/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (64 / 128 tokens) [(, 100.00%)]
(Draft 7/16): Predicted=12164 ( Ю), Actual=12164 ( Ю) [PASS]
Generating (65 / 128 tokens) [( Ю 100.00%)]
(Draft 8/16): Predicted=29493 (,), Actual=29493 (,) [PASS]
Generating (66 / 128 tokens) [(, 100.00%)]
(Draft 9/16): Predicted=10787 ( Я), Actual=10787 ( Я) [PASS]
Generating (67 / 128 tokens) [( Я 100.00%)]
(Draft 10/16): Predicted=2 (</s>), Actual=2 (</s>) [PASS]
Generating (68 / 128 tokens) [(</s> 100.00%)]

(EOS token triggered! ID:2)
Speculation: [, Ь, Э, Ю, Я</s>столько я не знаю] (correct=10/16)

llama_perf_context_print:        load time =   17372.31 ms
llama_perf_context_print: prompt eval time =       0.00 ms /   112 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   80457.07 ms /   113 tokens

[03:39:57] CtxLimit:96/1024, Amt:68/128, Init:0.02s, Process:17.35s (619.8ms/T = 1.61T/s), Generate:63.09s (927.7ms/T = 1.08T/s), Total:80.44s (0.85T/s)
Output: , В, Г, Д, Е, Ё, Ж, З, И, Й, К, Л, М, Н, О, П, Р, С, Т, У, Ф, Х, Ц, Ч, Ш, Щ, Ъ, Ы, Ь, Э, Ю, Я

The drafted part of </s>столько я не знаю looks very funny, since I've asked it to spell the Russian alphabet, and after the EOS it tried to say "I don't know that many" as if the model felt like it was forced to keep printing letters!

So, two final questions:

  1. Don't you think you need to always force UTF-8 codepage for the koboldcpp process yourself, for C++ core to be able to correctly print Unicode to the console, at least on windows and at least in Debug Mode? (Implementation example: https://stackoverflow.com/a/55171823)
  2. Would you take my implementation of printing draft result? (I'm not sure about the initial \n in string though, probably it is not actually needed)

@LostRuins
Copy link
Owner

LostRuins commented Dec 5, 2024

  1. The configuration of a user's command prompt terminal is not up to me, and I wouldn't want to modify it from what they've set. Either way, it's easy enough for a user to change it.
  2. I guess I can print it out, to me the draft doesn't matter much but I see no harm displaying it.

5106816

@aleksusklim
Copy link
Author

For 1: most of the time your .exe is run via Explorer and not from a terminal, and thus it is your process who creates and ultimately owns the console. But checking for this on Windows is even more inconvenient, so I agree that it is easy enough to just create a .bat file that would run chcp and then koboldcpp.
For 2: get_tok_vec_str prints a list of tokens and ids in parenthesis, not a "text"! The purpose was to easily read it…

@aleksusklim
Copy link
Author

Weird, now chcp 65001 is not working anymore… Neither with v1.80, nor with older.
But… how!? I am 100% sure the last time it was just working as-is for me. Now I've tried running from .bat file, from simple CMD, from elevated admin prompt, under w64devkit, from Powershell, and even with
$OutputEncoding = [System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
$PSDefaultParameterValues['*:Encoding'] = 'utf8'
(https://stackoverflow.com/q/51933189)
I see the same utf8 byte-gibberish in console again.

But then I clicked "Extra → Unpack KoboldCpp To Folder" and then tried chcp 65001 + python koboldcpp.py there.
Now it works, printing correct characters!

@LostRuins
Copy link
Owner

Like I said before, it's likely a setting from the terminal console itself and not within koboldcpp. Different terminals can have different settings for text encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants