Replies: 5 comments 7 replies
-
Thanks for pinging me, it's interesting to learn about your past attempts with SVD. In the LQER paper they don't seem to use it on top of SOTA quantization methods (they seem to use it on top of MXINT), so I'm simply curious to see if it's viable to apply it on top of k-quants and i-quants. It might not be worth it, though, as you say. But there's also something else which they did not try in the paper: subtracting a low-rank decomposition of the weights to then quantize only what remains, while the LoRA adapter of the quantization error should be able to recover it. I did not yet experiment with different ranks for both of theses low-rank approximations. And in my preliminary tests this does help with pure It's possible that a specialized quantization type for the not-low-rank part of weights could be useful, but I did not yet study how the distribution is changed when subtracting a low-rank approximation. My hypothesis is that non-linear assymetric quant types have an advantage for this, so the new I did not yet implement L²QER, so I dont know how it would perform yet. You're likely very right that it won't be good, but I want to try, because it will enable other experiments like different error-minimization objectives for the quantized dense tensor and the low-rank adapter. Also, I have not yet implemented Numpy dequantization for most of the |
Beta Was this translation helpful? Give feedback.
-
Perhaps you should ask Georgi? According to But more seriously: the short answer is 'no'. To generate these tables, I quantized a bunch of models using the full E8 or D4 lattice, and collected statistics how often each lattice point is being used. This data is already orders of magnitude larger than the final
If you use enough principle components you will eventually get an improvement, of course. But the question is, given the extra bits spent, is the improvement better than what is achievable by using a different quant, using quantization mixes, etc., with the same extra bits spent. Also, as demonstrated by
This is the first thing I tried. If that had been successful, we would have gotten not just a model compression, but a massive increase in performance too as matrix multiplications with a low rank decomposition are much faster than using the full matrix. I did have moderate success with the But then again, I'm one of those people suffering from the NIH syndrome, so used my own hand-rolled tools for this investigation. Perhaps you will be more lucky just using standard tooling. |
Beta Was this translation helpful? Give feedback.
-
Btw, on this branch there is some exploration of using SVD before or after the quantization. I have misused the |
Beta Was this translation helpful? Give feedback.
-
@compilade With your PR-9400 in Oh well, I'll need to keep my own copy of the |
Beta Was this translation helpful? Give feedback.
-
@compilade Thank you for responding to my concerns.
I must admit I don't understand the concerns. The issue is that one cannot (correctly) combine imatrices computed with different
Here is what I do
I.e., my imatrix files always carry the context length that was used in their name. Worth noting that a) The context length has a surprisingly small influence on the quantization results b) One may want to combine imatrices computed with a different context length to see what happens (what context length are you going to record for the combined imatrix file?)
The imatrix is one and only one thing. I wouldn't know how one wants to "extend" it without it no longer being an imatrix. But suppose we really wanted to extend it. Here is what I would do
Voila, all existing imatrices continue to work, you can add whatever extensions you like (anywhere you like, not just at the end), we don't need to include |
Beta Was this translation helpful? Give feedback.
-
LQER/L²QER is the latest hype about LLM quantization. Promptly, there is an issue in
llama.cpp
to use that to improve the existing quantization methods because, you know, the gras is always greener on the other side of the road. But, unlike many earlier calls to improve quantization with the latest "SOTA" quantization advertisement, err, scientific paper, on arXiv, there are already efforts underway to actually implement this. E.g., this PR adds Numpy dequantization so one can use Numpy to do the SVD of the difference between the full model and a quantized model.People are of course free to spend their energy any way they see fit, and I should rather mind my own business, but I couldn't help myself but put this prediction on the record:
LQER/L²QER will not help to improve any of the k- or I-quants in
llama.cpp
.Why do I think so?
Having spent so much time on developing all k- and i-quants in
llama.cpp
, I basically remember perplexity (PPL) values for a lot of models, especially the early once such as LLaMA-v1 and LLaMA-v2. And these are exactly the models the LQER authors compare their quantization against in Table 3 of the paper. So, for me, just a quick look was sufficient to see that the results of the paper are nowhere near being SOTA as they are being advertised. But let's do the comparison. I reproduce the Table 3.1 here for convenience:Activation quantization is not quite there yet in
llama.cpp
, so we will focus on the upper part of the table, which shows results when only the model weights are quantized. Let us do some comparisons. I'll useQ4_K_S
,IQ4_XS
, and the newly addedIQ4_K
andIQ3_K
. The L²QER quantization is 4.3 bpw, so it is in the same range asIQ3_XS
(4.25 bpw) andQ4_K_S/IQ4_K
(4.5 bpw).IQ3_K
(3.4 bpw) is there to put things into perspective.I have archived my LLaMA-v1 models and didn't feel like restoring (or re-downloading) the 33B and 65B models, so we will look at 7B and 13B. The PPL results in the paper are computed with standard Python tooling, and it is known that perplexities computed with
llama.cpp
can be quite different from people get in the Python Universe. But the ratio of the quantized PPL to the PPL of thef16
model is nearly independent of the way PPL has been computed. The authors of the LQER paper have chosen to use the differencePPL(Q) - PPL(f16)
(the ∆PPL column in Table 3), which is basically the same thing. Nevertheless, let's put some effort into makingllama.cpp
PPL more comparable to Python tooling. As far as I can tell, there are two main differences how PPL is computed:llama.cpp
PPL is evaluated by sequentially going over the provided evaluation text, while in Python samples of the given context length are selected at random. This should not result in a different result, at least not beyond the statistical uncertainty of the PPL estimate, so I did not changellama.cpp
.llama.cpp
the mean log probability is evaluated over the second half of the context windown_ctx
, while in Python the whole context window is used. Both are approximations to PPL for a contextn_ctx
. Thellama.cpp
approximation is better (to first order, it reports PPL for3/4 n_ctx
, while the Python estimate is for1/2 n_ctx
. Nevertheless, let's just change it inllama.cpp
by adjusting this line. But instead of just usingfirst = 1
, I adjusted a bit around and ended up usingfirst = std::max(1, n_ctx/128)
, which gave the closest match betweenllama.cpp
and the values reported in Table 3 of the LQER paper (which are for a context of 2048. I know this based on other quantization papers, which quote the samef16
PPL
values and explicitly state the context window used)The following table shows the
llama.cpp
f16
perplexities for the full models computed with this modification:OK, we can now do the comparison. The table shows ∆PPL for the 4 LLaMA models and the 4 different quantization types. For more convenient comparison I have also added the L²QER result.
I think the difference in performance is clear, and no further discussion is required.
I made this comment back in April of 2023. I had just gotten involved with
llama.cpp
and had started thinking about the quantization of LLMs. With SVD being a standard tool in the toolbox of an ML practitioner, it was one of the first things that came to mind. Did I try? Of course I did - with disappointing results: one needed way too many terms to be competitive with block-wise quantization (I had already started working on k-quants). It is of course possible that my SVD attempts weren't good and, and the LQER authors were able to get something out of SVD. But my guess is it is a matter of the quality of the quantization to begin with: if the quality is low, then perhaps one can improve with just the first few components of the singular value decomposition. But if one still has a 2X - 5X larger quantization error after having done that, it is extremely unlikely that one can improve the much better quants by using just a few SVD terms. So, based on this, I reach the above conclusion.Pinging @compilade who seems to be the main driving force behind implementing LQER in
llama.cpp
just in case this is somehow useful.Beta Was this translation helpful? Give feedback.
All reactions