-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selective node splitting #120
Comments
The reason I'm opening this here is: gloryknight suggested adding 0 as the default separator char (meaning the actual the value zero, which is an unused char code, not "0", which has the char code 48), and a function to allow people to change this. I think that's one possible way to add this. Another approach would be to add an optional second argument to Do you guys think this is a good addition, and what do you think is the right API for this? |
I can see that being far better for the data - though I'd think that the double quote character may give slightly better compression - there's likely going to be one token of Saying that - I think it should be behind a flag to use, it might be non-breaking, but it's also very much meant for JSON data, and might make non-JSON a little bit larger (though a quote character may be less subject to that etc). |
Second argument for the
If we use zero value for the split character than the value might be the flag. Implementing a separate flag may result in performance penalty.
Sure, let us try different characters, but, we must be careful not to overdo with this. Having too many splitting characters may lead to wrong string splitting again (but this depends on payload and should be tested in real live). Consider this example: after comma in JSON usually comes a property name followed by |
Good points, I had not thought of situations like
So I think I came up with an optimization which works in a way that is almost guaranteed to be better in all circumstances except randomly generated characters. Current heuristic: force split on separator, unless current prefix substring leads with a separator New heuristic: split if previous char was a separator and this char is not, unless the last split was also forced by a separator My prediction is that the results will be even better, work with more types payloads and with multiple separators. The reason is that the separator is not really part of the words we are splitting, so forcing the first character of a prefix to start with a separator is sub-optimal. However, putting it at the end of a prefix is not a problem, as the prefix can easily grow into a different char from the char before it. Furthermore, sometimes you can encounter a sequence of separators, like multiple spaces or newlines. The new logic ensures that a sequence like that will still end up in the dictionary and function like a separator. So my prediction is the new heuristic will bias the prefixes to converge on the semantically meaningful sub-strings (which are thus more likely to re-appear), leading to better compression. Since it doesn't really matter which separator character came before a semantically meaningful word, I bet this means we can really go nuts with the set of separator characters: A string like
So yeah, I'm pretty hopeful about this :D (bonus thought: this heuristic may be even better for compound languages like German or my native Dutch. A sentence like |
@JobLeonard - it is encouraging to see somebody motivated like this. Thank you for you enthusiasm. I have tried to use multiple markers A side remark about the compression: To my knowledge, there is no mathematical problem which could not be solved computationally. The only problem is the time which it takes to solve the problem. Therefore, most of mathematicians work on the ways to do computations in a limited (not infinite) time. As to the algorithms which we are discussing here - we need to decide what we want: ideal compression or fast speed (with most reasonable compression). Any heuristic is bound to have edge cases. Let us find the best for us. |
I did a quick test of my own idea of splitting after the separator. So far it seems to actually compress worse than your heuristic, sometimes even worse than default LZString :( After thinking some more about why that might be, it started to makes sense. Think of natural language, and words written in it. Which character is statistically the likeliest to appear just before a word? Right: the space character. With that in mind, making that space character part of the prefix would is statistically more likely to result in longer substrings, and hence better compression (I can imagine the LZMW/LZAP algorithms behaving a little bit differently here though, but that also would have to be tested). |
To respond to your original comment, it is not at all clear why |
Let me disagree. Your statement might be true for an infinite homogeneous data set. Details depend strongly on the input data. The more is known about the input data the better we can compress it by a specialized algorithm (proven multiple times in the community). |
I agree with you. I'm not saying this is a bad idea at all, what I'm just saying is that As far as releasing is concerned, I would be a bit more cautious than you here. People have gazillions of data snippets compressed in browsers all over the world, and releasing a version that is unable to decompress it could have big impacts. That said, if all unit tests pass, let's go indeed. |
The thing is that we know this won't break that, because it doesn't touch the decompression code. Forcing a node-split is not a change in spec, it's a change in decision making on how to encode a string in the compression code. Even on a first-principle basis it makes sense: the LZ-algorithm is a schema for a self-unpacking dictionary, but we're free to stuff it with whatever tokens we want. The current implementation even lets us output a string of infinite Offtopic discussion abou LZ/LZMW/LZAP
Yeah, that is actually one of the reasons why LZMW can perform worse than regular LZ, apparently (and why LZAP sometimes does better). I wonder if a hybrid would work: say I already have
The idea is that LZMW has a major benefit over LZ when we have very repetitive patterns - to be precise: once LZMW has a dictionary entry of a repeating pattern, the compression rate of that repeating pattern improves with the speed of the Fibonnacci sequence. However, it can miss the short substrings, as Pieroxy hinted at. For data-sets where the best patterns are limited to many short substrings to compress, LZ wins. It could just as well perform worse: it always adds two tokens instead of one, so that means we have one extra bit per token in our output stream (but since LZAP is even worse here and compresses as good as LZ and LZMW in many real-world circumstances we can't say for sure without testing) |
Here is the source code: https://github.com/gloryknight/LZstr/blob/master/LZstr2.source.js |
So, say that I have:
Are you saying that |
You are right. In this particular case pure separator splitting will produce worse results but we are using a trick and do not split entries starting with a separator. This improves the result (due to faster convergence).
This does lead to many extra |
Ah right, it's been a while so I forgot about the trick :) Also, if we're breaking anyway, we could prevent duplicate entries being added to the dictionary with an extra check. Bit slower but also not unrealistic real-world case I think |
So @gloryknight came up with a very interesting and simple heuristic that apparently works really well in many JSON files: whenever a
,
is encountered (can be set to another character, like a space or newline), LZ is forced to stop the current substring, and start anew (with a leading comma), except when the current substring starts with a comma. See these two pull-requests:JobLeonard#3
JobLeonard#4
The reason for this being effective is a bit subtle: imagine that we have a string we are scanning through, and the next set of characters will be
abcdefg
. Furthermore, our dictionary already has the substringsabc
,abcd
anddefg
(plus the necessary substrings to get to this point), but notefg
. Obviously, the ideal combination of tokens would beabc
+defg
. Instead we'll getabcd
+e
+f
+g
. This can happen quite often in LZ. So how to avoid this? Well, I guess gloryknight's insight was that not all characters are created equal here; they can have special functions. One of those is as separator characters. Think of natural language: or words are separated by spaces, so if we split on the space character (and similar separator like newlines, dots, commas) we would converge on identical substrings much quicker.Since LZString is most commonly used whn compressing JSON, which strips out all unnecessary whitespace, the
,
is the option that seems to improve compression performance (although maybe{
and:
also make sense, or maybe all three?). In his tests it gave significant compression benefits at a small perf cost.The best bit? This is perfectly backwards compatible with previous codes: the output can be decompressed by the same function as before.
The text was updated successfully, but these errors were encountered: