Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selective node splitting #120

Open
JobLeonard opened this issue Sep 19, 2018 · 15 comments
Open

Selective node splitting #120

JobLeonard opened this issue Sep 19, 2018 · 15 comments

Comments

@JobLeonard
Copy link
Collaborator

So @gloryknight came up with a very interesting and simple heuristic that apparently works really well in many JSON files: whenever a , is encountered (can be set to another character, like a space or newline), LZ is forced to stop the current substring, and start anew (with a leading comma), except when the current substring starts with a comma. See these two pull-requests:

JobLeonard#3

JobLeonard#4

The reason for this being effective is a bit subtle: imagine that we have a string we are scanning through, and the next set of characters will be abcdefg. Furthermore, our dictionary already has the substrings abc, abcd and defg (plus the necessary substrings to get to this point), but not efg. Obviously, the ideal combination of tokens would be abc + defg. Instead we'll get abcd + e + f + g. This can happen quite often in LZ. So how to avoid this? Well, I guess gloryknight's insight was that not all characters are created equal here; they can have special functions. One of those is as separator characters. Think of natural language: or words are separated by spaces, so if we split on the space character (and similar separator like newlines, dots, commas) we would converge on identical substrings much quicker.

Since LZString is most commonly used whn compressing JSON, which strips out all unnecessary whitespace, the , is the option that seems to improve compression performance (although maybe { and : also make sense, or maybe all three?). In his tests it gave significant compression benefits at a small perf cost.

The best bit? This is perfectly backwards compatible with previous codes: the output can be decompressed by the same function as before.

@JobLeonard
Copy link
Collaborator Author

The reason I'm opening this here is: gloryknight suggested adding 0 as the default separator char (meaning the actual the value zero, which is an unused char code, not "0", which has the char code 48), and a function to allow people to change this. I think that's one possible way to add this.

Another approach would be to add an optional second argument to compress that lets us set this separator char.

Do you guys think this is a good addition, and what do you think is the right API for this?

@Rycochet
Copy link
Collaborator

I can see that being far better for the data - though I'd think that the double quote character may give slightly better compression - there's likely going to be one token of ": " at least - but also all keys might get found, and not just the second onwards. - really not got time to look at it properly, but it "feels" right as a technique regardless of the character(s) used.

Saying that - I think it should be behind a flag to use, it might be non-breaking, but it's also very much meant for JSON data, and might make non-JSON a little bit larger (though a quote character may be less subject to that etc).

@gloryknight
Copy link

what do you think is the right API for this?

Second argument for the compress would be a bit awkward, since the argument should be added to all of the functions (also binary ones) and passed down in the calls. Also, users with existing code will have to change their code (not always welcome).

I think it should be behind a flag

If we use zero value for the split character than the value might be the flag. Implementing a separate flag may result in performance penalty.

the , is the option that seems to improve compression performance (although maybe { and : also make sense, or maybe all three?)

Sure, let us try different characters, but, we must be careful not to overdo with this. Having too many splitting characters may lead to wrong string splitting again (but this depends on payload and should be tested in real live).

Consider this example: after comma in JSON usually comes a property name followed by :. Properties with the same name will most probably have the same type and usually contain the same data. Splitting with : or both may lead to the case where the value of the property is forced to be stored as a separate dictionary entry. But these are just my speculations. First tests indicate that the " results in best compression so far.

@JobLeonard
Copy link
Collaborator Author

JobLeonard commented Sep 20, 2018

Second argument for the compress would be a bit awkward, since the argument should be added to all of the functions (also binary ones) and passed down in the calls. Also, users with existing code will have to change their code (not always welcome).

Good points, I had not thought of situations like [ /* ... */ ].map(LZString.compress) that would be broken. Another would be existing tests that expect the compressed output to be the same (even though I don't think we promise that the output remains the same between versions, that's still a consideration).

Saying that - I think it should be behind a flag to use, it might be non-breaking, but it's also very much meant for JSON data, and might make non-JSON a little bit larger (though a quote character may be less subject to that etc).

So I think I came up with an optimization which works in a way that is almost guaranteed to be better in all circumstances except randomly generated characters.

Current heuristic: force split on separator, unless current prefix substring leads with a separator

New heuristic: split if previous char was a separator and this char is not, unless the last split was also forced by a separator

My prediction is that the results will be even better, work with more types payloads and with multiple separators.

The reason is that the separator is not really part of the words we are splitting, so forcing the first character of a prefix to start with a separator is sub-optimal. However, putting it at the end of a prefix is not a problem, as the prefix can easily grow into a different char from the char before it. Furthermore, sometimes you can encounter a sequence of separators, like multiple spaces or newlines. The new logic ensures that a sequence like that will still end up in the dictionary and function like a separator.

So my prediction is the new heuristic will bias the prefixes to converge on the semantically meaningful sub-strings (which are thus more likely to re-appear), leading to better compression. Since it doesn't really matter which separator character came before a semantically meaningful word, I bet this means we can really go nuts with the set of separator characters: _+-*='",;:.?/\[]{}()<> plus all whitespace characters are pretty much guaranteed to be points of separation in natural languages and programming languages. Just make a big switch statement and it doesn't even have to have a high perf cost.

A string like "two hundred fifty six (two hundred fifty five?)" should be more likely to match on sensible sub-strings. Worst-case I can think of is that if we already have "two" and "hundred" in our dictionaries, this will result in new entries "two ", "hundred ", then "two h", plus whichever sub-strings of "undred" already exist in the dictionary. This looks sub-optimal, but keep in mind:

  • we would have hit an identical sub-optimal case with the old heuristic, and more often and earlier
  • in general we would have hit such a sub-optimal case even earlier and even more often without these heuristics
  • if the text contains "two hundred" very often, we actually want to grow a prefix for that eventually.

So yeah, I'm pretty hopeful about this :D

(bonus thought: this heuristic may be even better for compound languages like German or my native Dutch. A sentence like "two hundred fifty six" should be translated as "tweehonderdzesenvijftig" (twee-honderd-zes-en-vijftig, "en" is Dutch for "and"). Assuming words for "two" and "hundred" are used earlier in the text, then old heuristic is biased to converge to the substrings " twee", " honderd", " zes", " " en", " vijftig". So again, the inclusion of the separator at the start of the prefix would lead to a lot of missed sub-strings, but the new one avoids this issue)

@gloryknight
Copy link

@JobLeonard - it is encouraging to see somebody motivated like this. Thank you for you enthusiasm.

I have tried to use multiple markers ,{ ": at the same time. By replacing breakCode with an array and using if (breakCode.indexOf(c)>=0 && forceBreak) { Again, depending on payload it was an improvement or not. For real JSON data (about 15kB) there was marginal improvement in compression. I will make extensive comparison when I find time.

A side remark about the compression: To my knowledge, there is no mathematical problem which could not be solved computationally. The only problem is the time which it takes to solve the problem. Therefore, most of mathematicians work on the ways to do computations in a limited (not infinite) time. As to the algorithms which we are discussing here - we need to decide what we want: ideal compression or fast speed (with most reasonable compression). Any heuristic is bound to have edge cases. Let us find the best for us.
Maybe interesting to check LZ4 vs LZ4_HC https://lz4.github.io/lz4/ which address similar problem.

@JobLeonard
Copy link
Collaborator Author

I did a quick test of my own idea of splitting after the separator. So far it seems to actually compress worse than your heuristic, sometimes even worse than default LZString :(

After thinking some more about why that might be, it started to makes sense. Think of natural language, and words written in it. Which character is statistically the likeliest to appear just before a word? Right: the space character.

With that in mind, making that space character part of the prefix would is statistically more likely to result in longer substrings, and hence better compression (I can imagine the LZMW/LZAP algorithms behaving a little bit differently here though, but that also would have to be tested).

@pieroxy
Copy link
Owner

pieroxy commented Sep 24, 2018

To respond to your original comment, it is not at all clear why abc + defg would be an obvious choice compared to abcd + e + f + g. Sure, the tokens emmited are smaller but the compression dictionary failed to learn on e + f + g so subsequent compression could be worse. Remember it's not about outputting the smallest possible pattern, it's about learning common sequences on the input.

@gloryknight
Copy link

Remember it's not about outputting the smallest possible pattern, it's about learning common sequences on the input.

Let me disagree. Your statement might be true for an infinite homogeneous data set. Details depend strongly on the input data. The more is known about the input data the better we can compress it by a specialized algorithm (proven multiple times in the community).
I sought that the aim of any compression algorithm is to find the smallest possible pattern for given input. In our particular case (compression in javascript) there is some practical limitation on time which may be consumed for compression (to keep the page responsive) and, therefore, on size of the input. This means that the faster we find convergence the better compression will be reached (except maybe for some edge cases).
If we compare to other compression formats ( say LZ4 ) we will see that there is a reason for using chunks. The chunks allow for adaptation to change in data stream and limit dictionary to a reasonable size (with a price of building new dictionary every time).
I suggest to update this repository with the changes from array-lookup branch and create a ES6 version too (The increase in speed and backward compatibility justify it). There is nothing to be afraid - just release it (and do it often) - this will drive innovation. People like new releases. If someone needs an old version they can easily get it from CDN.

@pieroxy
Copy link
Owner

pieroxy commented Sep 24, 2018

I agree with you. I'm not saying this is a bad idea at all, what I'm just saying is that abc + defg is not obviously a better choice than to abcd + e + f + g. As you said, it depends.

As far as releasing is concerned, I would be a bit more cautious than you here. People have gazillions of data snippets compressed in browsers all over the world, and releasing a version that is unable to decompress it could have big impacts. That said, if all unit tests pass, let's go indeed.

@JobLeonard
Copy link
Collaborator Author

As far as releasing is concerned, I would be a bit more cautious than you here. People have gazillions of data snippets compressed in browsers all over the world, and releasing a version that is unable to decompress it could have big impacts. That said, if all unit tests pass, let's go indeed.

The thing is that we know this won't break that, because it doesn't touch the decompression code. Forcing a node-split is not a change in spec, it's a change in decision making on how to encode a string in the compression code.

Even on a first-principle basis it makes sense: the LZ-algorithm is a schema for a self-unpacking dictionary, but we're free to stuff it with whatever tokens we want. The current implementation even lets us output a string of infinite new 8 bit char tokens followed by 65 (ASCII code for A), which is nonsensical but decompresses just fine all the same.


Offtopic discussion abou LZ/LZMW/LZAP

Remember it's not about outputting the smallest possible pattern, it's about learning common sequences on the input.

Yeah, that is actually one of the reasons why LZMW can perform worse than regular LZ, apparently (and why LZAP sometimes does better). I wonder if a hybrid would work: say I already have abcd and efgh in the dictionary, then upon encountering abcdefgh the following would happen

  • LZ: abcde gets added to the dictionary
  • LZMW: abcdefgh gets added to the dictionary
  • LZAP: abcde, abcdef, abcdefg, and abcdefgh get added to the dictionary. The reason this doesn't always go terribly wrong is because the nr of bits per token size grows with log2(dictionary entries), and the benefit here that many valid substrings are found quickly apparently weighs against that in many situations. But not always.
  • possible hybrid: abcde and abcdefgh get added to the dictionary.

The idea is that LZMW has a major benefit over LZ when we have very repetitive patterns - to be precise: once LZMW has a dictionary entry of a repeating pattern, the compression rate of that repeating pattern improves with the speed of the Fibonnacci sequence. However, it can miss the short substrings, as Pieroxy hinted at. For data-sets where the best patterns are limited to many short substrings to compress, LZ wins. It could just as well perform worse: it always adds two tokens instead of one, so that means we have one extra bit per token in our output stream (but since LZAP is even worse here and compresses as good as LZ and LZMW in many real-world circumstances we can't say for sure without testing)

@gloryknight
Copy link

Another breakthrough in selective node splitting (unfortunately breaking change) : since the dictionary entries just before splitting symbol are never used there is no need to add them to the dictionary. This reduces dictionary size as shown in the following graph (dictionary size vs input size in bytes):
image

This results in better compression (Compression ratio vs input size in bytes):
image

Compression speed is slightly better (Compression speed in megabytes/second vs input size in bytes):
image

Decompression speed is increases too (Decompression speed in megabytes/second vs input size in bytes):
image

Code will follow shortly.

@gloryknight
Copy link

Here is the source code: https://github.com/gloryknight/LZstr/blob/master/LZstr2.source.js
I switched to unsafe implementation due to speed improvements.
Here is the compression speed comparison between original, unsafe and splitting with clean dictionary (Compression speed in megabytes/second vs input size in bytes):
image

@JobLeonard
Copy link
Collaborator Author

since the dictionary entries just before splitting symbol are never used there is no need to add them to the dictionary.

So, say that I have:

AAA,AAA,AAA,AAA,AAA,AAA,AAA,AAA,AAA

Are you saying that A, and such are not added to the dictionary? I guess you're right that this makes sense, since after a short while the compression will be "saturated" and always use ,AAA (btw, as a result this would be an example where the separator could produce worse compression sizes)

@gloryknight
Copy link

You are right. In this particular case pure separator splitting will produce worse results but we are using a trick and do not split entries starting with a separator. This improves the result (due to faster convergence).
A more complex case could be:

AAA,1,AAA,2,AAA,3,AAA,4,AAA,5,AAA,6,AAA

This does lead to many extra ,AAA entries in the dictionary. The last post eliminates these repetitions. Again, it all depends on input data. Sometimes the gain is negative :)

@JobLeonard
Copy link
Collaborator Author

Ah right, it's been a while so I forgot about the trick :)

Also, if we're breaking anyway, we could prevent duplicate entries being added to the dictionary with an extra check. Bit slower but also not unrealistic real-world case I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants