Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Reduce the dictionary size footprint #254

Open
lifthrasiir opened this issue Feb 25, 2024 · 6 comments
Open

Proposal: Reduce the dictionary size footprint #254

lifthrasiir opened this issue Feb 25, 2024 · 6 comments
Labels
enhancement New feature or request
Milestone

Comments

@lifthrasiir
Copy link

lifthrasiir commented Feb 25, 2024

In my knowledge, the size of word lists remains a major pain point for the original zxcvbn (e.g. dropbox/zxcvbn#196, dropbox/zxcvbn#268, dropbox/zxcvbn#285, dropbox/zxcvbn#338). Even though Brotli Content-Encoding does allow for more compression than gzip, it is significant enough that it is worthwhile to fix in the library itself as well.

I have written two concrete proposals that can significantly reduce the size footprint of the built-in dictionary. I believe they are easy to review and apply without any specialized knowledge about compression. Both also present some decisions for maintainers to make, so I tried to give enough information to aid that process.

Rank quantization

For example, packages/languages/en/src/commonWords.json starts with the following:

["you","the","to","'s","it","and","that","'t","of","is","in","what","we","me","this",...

...and ends with the following:

...,"battalions","motorcade","kelley","beaks","tomás","t-minus","sleek","vividly","koch"]

This list of 26,758 words is automatically generated from the frequency list of top 50K words. The cutoff is currently set to at least 500 occurrences, so the final entry koch should have occurred exactly 500 times. But there are many other words with the exactly same number of occurrences; in fact it's the case for all words shown above. Data compression algorithms don't and can't realize that and try to reproduce the exact order---you need to preprocess the data in the way that algorithms can work with an easier data.

Furthermore, while the public API does return the exact rank as a part of Feedback, it is probably meaningless for the aforementioned reason. We can for example imagine that ranks for first 100 words are exact, but next 100 words have the same rank (to be displayed as "101–200th common word"), and so on. This kind of rank quantization allows for an arbitrary reordering within the same rank and can greatly improve the compression without any additional code.

Benchmarks

I've experimented with several quantization schemes. I assumed that the relative rank of first 100 words is significant enough and never altered that. After that point, each tested scheme has the following quantized ranks:

  • Groups of 100: 101, 201, 301, 401, ..., 100n + 1, ... (100 words each)
  • 2 significant digits: 101, 111, 121, 131, ..., 981, 991 (10 words each so far), 1001, 1101, 1201, ..., 9801, 9901 (100 words each so far), 10001, 11001, ...
  • 1 significant digit: 101, 201, 301, ..., 901 (100 words each so far), 1001, 2001, ..., 9001 (1000 words each so far), 10001, 20001, ...
  • ½ significant digits: 101, 201, 501, 1001, 2001, 5001, 10001, 20001, ...
  • 0 significant digits (i.e. magnitude only): 101, 1001, 10001, ...

Words with the same quantized rank have been simply sorted in the Unicode codepoint order, so that similar words can be placed much more closely. While other shuffling is possible, this is good enough for our purpose and also helps the next proposal.

I also wanted to make sure that this optimization is visible throughout different compression algorithms and parameters, so I have used the following methods:

  • gzip -1: GNU gzip 1.9 with the fastest setting. This corresponds to nginx's default.
  • gzip -6: GNU gzip 1.9 with the default CLI setting.
  • gzip -9: GNU gzip 1.9 with the slowest setting.
  • ECT -3: gzip -9 followed by Efficient Compression Tool 0.9.5 with the default CLI setting. This simulates a conscious effort to optimize static assets without changing the compression algorithm itself. (There are other similar tools including zopfli, but I picked ECT because it is generally faster and more efficient than other tools.)
  • ECT -9: gzip -9 followed by ECT 0.9.5 with the slowest setting. This approximates a theoretically optimal compression possible.
  • Brotli -11: Brotli 1.0.9 with the slowest and default CLI setting.

Results for @zxcvbn-ts/language-en:

Quantization Uncomp. Gzip -1 Gzip -6 Gzip -9 ECT -3 ECT -9 Brotli -11
Baseline 1,216,551 617,735 566,285 565,783 519,082 513,281 488,580
Groups of 100 0.0% -0.2% -0.1% -0.1% -0.6% -0.3% -1.0%
2 sig digits 0.0% -1.8% -1.2% -1.1% -2.9% -2.9% -2.8%
1 sig digit 0.0% -8.0% -8.8% -8.8% -12.7% -12.5% -11.0%
½ sig digits 0.0% -11.9% -13.2% -13.1% -18.8% -18.5% -17.2%
0 sig digits 0.0% -16.9% -19.8% -19.7% -26.5% -26.4% -24.5%

Combined results for all @zxcvbn-ts/language-* packages:

Quantization Uncomp. Gzip -1 Gzip -6 Gzip -9 ECT -3 ECT -9 Brotli -11
Baseline 7,739,269 3,945,996 3,503,060 3,496,903 3,325,270 3,291,158 2,993,904
Groups of 100 0.0% -0.6% -0.4% -0.4% -0.7% -0.7% -1.5%
2 sig digits 0.0% -2.9% -1.7% -1.7% -4.2% -4.1% -3.9%
1 sig digit 0.0% -10.3% -10.1% -10.1% -14.7% -14.5% -13.0%
½ sig digits 0.0% -15.2% -14.4% -14.3% -20.9% -20.6% -19.5%
0 sig digits 0.0% -19.8% -20.6% -20.4% -26.4% -27.3% -25.9%

The uncompressed size remains identical because the only change here is reordering. Still, it is possible to reduce the compressed size by 10% with a reasonable rank quantization, and that reduction is more or less consistent for all tested parameters. (There is a reason for this consistency, but that's way too technical to explain here.)

Delta coding

It should be noted that packages/languages/en/src/firstnames.json among others is already sorted:

["aaren","aarika","abagael","abagail","abbe","abbey","abbi","abbie","abby","abbye","abigael",...

...and many consecutive words share the same prefix (for example, 6 words start with abb). We can exploit this by simple delta coding like this:

["aaren","___ika","_bagael","_____il","__be","____y","___i","____e","___y","____e","__igael",...

Here the number of preceding underscores represents the number of first characters to retain. And it's not necessary to repeat each underscore; we can use some reserved character for two consecutive underscores, some other for three, and so on. Since uppercase letters are removed during the generation, let's use them in place of two or more underscores:

["aaren","Cika","Abagael","Eil","Bbe","Dy","Ci","De","Cy","De","Bigael",...

At this point we don't even need any comma, because those reserved characters can only occur at the beginning of each word. So we can further simplify the scheme by replacing a lone , with A, ,_ with B, ,__ with C and so on instead:

"AaarenEikaCbagaelGilDbeFyEiFeEyFeDigael...

While compression algorithms can pick this kind of repetitions by themselves, we already know the list is sorted so we can make use of that fact much more efficiently in this case.

Benchmarks

Results for @zxcvbn-ts/language-en:

Quantization Uncomp. Gzip -1 Gzip -6 Gzip -9 ECT -3 ECT -9 Brotli -11
Delta coding -20.1% -12.6% -12.8% -12.8% -8.4% -7.9% -9.0%
Groups of 100 -25.7% -17.1% -16.4% -16.4% -11.7% -11.5% -11.6%
2 sig digits -29.6% -21.3% -20.4% -20.3% -20.3% -15.3% -11.5%
1 sig digit -39.8% -32.1% -31.0% -30.9% -30.8% -26.5% -23.1%
½ sig digits -45.0% -38.0% -37.0% -37.0% -33.4% -33.2% -33.0%
0 sig digits -51.3% -45.4% -44.6% -44.5% -41.1% -41.3% -41.2%

Combined results for all @zxcvbn-ts/language-* packages:

Quantization Uncomp. Gzip -1 Gzip -6 Gzip -9 ECT -3 ECT -9 Brotli -11
Delta coding -7.1% -3.2% -3.0% -2.9% -1.6% -1.5% -1.3%
Groups of 100 -18.3% -11.9% -9.9% -9.8% -8.1% -8.1% -6.2%
2 sig digits -25.0% -18.5% -16.2% -16.1% -20.6% -13.9% -6.1%
1 sig digit -39.8% -33.5% -30.8% -30.7% -34.5% -28.7% -22.3%
½ sig digits -45.5% -39.6% -37.2% -37.1% -35.8% -35.9% -34.2%
0 sig digits -51.6% -46.6% -44.6% -44.6% -43.5% -43.5% -42.3%

In general this delta coding scheme reduces another 1–20% of compressed data on top of rank quantization. The best case results in 50% reduction!

Caveats

It should be noted that such transformation has to be inversed every time the script is being loaded. The following is the actual code for delta coding I've used, and you can see that it returns an immediately invoked function that does the job.

const prefixCompress = (parsed) => {
  if (!Array.isArray(parsed)) {
    return
  }
  if (!parsed.every((e) => typeof e === 'string' && isCompactDoubleQuotedString(e))) {
    // Should be rare enough, so don't bother escape them.
    return
  }

  const deltas = []
  let last = ''
  for (const e of parsed) {
    let prefixLen = 0
    const maxPrefixLen = Math.min(e.length, last.length, 25)
    while (prefixLen < maxPrefixLen && e.charAt(prefixLen) === last.charAt(prefixLen)) {
      ++prefixLen
    }
    deltas.push(String.fromCharCode(65 + prefixLen) + e.slice(prefixLen))
    last = e
  }

  return `((compressed) => {
  const parts = compressed.split(/([A-Z])/g)
  const recons = []
  let last = '', i
  for (i = 1; i < parts.length; i += 2) {
    last = last.slice(0, parts[i].charCodeAt(0) - 65) + parts[i + 1]
    recons.push(last)
  }
  return recons
})("${deltas.join('')}")`
}

const isCompactDoubleQuotedString = (s) => {
  return !s.match(/[\x00-\x1f\u2028\u2029\\"]/)
}

In my testing they are not very slow, but not extremely quick either. For @zxcvbn-ts/language-en both browsers I've tested (Firefox and Chrome) took 5–10 ms to undo the transformation, even with some additional optimization to avoid memory allocation.

Amazingly enough, Chrome's V8 does seem to specially handle String.split with a string argument and it was 10,000x faster than any other tested code, while String.split was only ~3x faster in Firefox. I haven't tested a direct character-wise iteration but I doubt it would be as fast as String.split in V8, so that's one thing to take account for.

Raw data for benchmarks

No delta coding:
Baseline:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          6,862     3,172     2,978     2,978     2,908     2,712     2,496
common    465,160   253,525   232,312   232,228   220,834   220,155   204,319
cs        599,591   301,578   269,458   269,247   254,141   251,068   210,749
de        816,936   420,148   366,005   365,077   350,751   349,062   308,770
en      1,216,551   617,735   566,285   565,783   519,082   513,281   488,580
es-es     745,200   385,911   340,519   339,767   328,460   324,498   287,752
fi        475,403   234,329   202,773   202,091   193,452   191,702   178,976
fr        614,101   309,871   271,975   271,536   257,617   253,922   233,795
id        175,088    90,338    81,153    80,870    76,898    76,160    70,747
it        641,884   320,221   281,083   280,840   268,337   264,190   239,377
ja        215,448   105,875    94,967    94,710    89,440    88,576    85,722
nl-be     426,557   221,474   196,483   195,992   188,467   186,850   174,940
pl        839,810   424,912   371,342   370,328   357,855   354,141   316,316
pt-br     500,678   256,907   225,727   225,456   217,028   214,841   191,365
        7,739,269 3,945,996 3,503,060 3,496,903 3,325,270 3,291,158 2,993,904

Groups of 100:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          6,862     3,166     2,974     2,974     2,900     2,710     2,492
common    465,160   251,590   231,024   230,942   219,245   217,998   200,476
cs        599,591   299,168   269,894   269,505   255,800   252,421   216,448
de        816,936   418,006   364,545   363,699   348,722   346,532   306,308
en      1,216,551   616,410   565,279   564,799   515,573   511,732   483,680
es-es     745,200   382,802   338,713   338,043   325,261   321,401   280,483
fi        475,403   232,636   201,296   200,551   191,040   189,392   174,582
fr        614,101   307,248   270,857   270,430   254,427   252,001   230,718
id        175,088    89,101    80,549    80,288    75,736    75,030    68,763
it        641,884   317,731   279,667   279,425   264,140   261,822   233,248
ja        215,448   104,799    94,625    94,384    88,449    87,589    83,864
nl-be     426,557   219,271   194,869   194,470   187,976   184,267   171,667
pl        839,810   421,867   369,324   368,332   354,910   350,759   307,957
pt-br     500,678   254,855   224,509   224,250   214,618   212,695   185,902
        7,739,269 3,918,650 3,488,125 3,482,092 3,298,797 3,266,349 2,946,588

2 significant digits:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          6,862     3,169     2,973     2,973     2,916     2,710     2,516
common    465,160   243,823   225,843   225,777   207,165   205,044   191,448
cs        599,591   291,512   265,358   265,045   244,986   243,077   211,266
de        816,936   404,747   356,866   356,063   335,343   332,622   299,353
en      1,216,551   606,514   559,486   559,136   503,519   498,395   474,734
es-es     745,200   371,497   333,565   333,053   310,601   307,862   271,463
fi        475,403   229,256   199,767   199,020   187,205   185,512   172,704
fr        614,101   299,844   267,411   267,017   245,319   243,280   224,533
id        175,088    88,606    80,296    80,061    75,071    74,134    68,562
it        641,884   308,820   276,025   275,888   254,654   251,230   225,678
ja        215,448   101,746    93,473    93,234    85,159    84,040    81,378
nl-be     426,557   216,906   194,020   193,668   183,182   181,464   170,471
pl        839,810   412,888   365,058   364,230   343,009   339,877   299,569
pt-br     500,678   248,487   222,080   221,853   206,600   204,692   180,523
        7,739,269 3,827,815 3,442,221 3,437,018 3,184,729 3,153,939 2,874,198

1 significant digit:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          6,862     3,166     2,974     2,974     2,900     2,710     2,492
common    465,160   227,374   206,990   206,899   185,039   183,093   172,964
cs        599,591   263,767   234,211   233,893   210,779   208,834   186,302
de        816,936   377,582   329,738   329,035   304,288   302,713   278,473
en      1,216,551   567,736   515,950   515,651   452,781   449,091   434,440
es-es     745,200   339,659   300,275   299,749   271,897   269,173   242,292
fi        475,403   213,047   185,951   185,624   169,979   168,462   159,369
fr        614,101   276,255   243,559   243,241   218,588   216,815   205,086
id        175,088    85,358    77,808    77,594    70,999    70,281    65,289
it        641,884   284,432   251,137   250,908   224,151   222,013   203,944
ja        215,448    94,879    86,035    85,927    75,757    74,826    73,295
nl-be     426,557   205,356   183,666   183,415   169,521   168,037   160,520
pl        839,810   374,975   330,014   329,432   301,064   298,535   260,761
pt-br     500,678   224,974   197,883   197,672   178,285   176,650   159,036
        7,739,269 3,538,560 3,146,191 3,142,014 2,836,028 2,811,233 2,604,263

1/2 significant digits:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          6,862     3,166     2,974     2,974     2,903     2,710     2,501
common    465,160   211,645   191,940   191,909   165,063   162,380   156,620
cs        599,591   249,818   223,776   223,796   198,139   196,313   174,689
de        816,936   348,022   304,559   304,435   275,453   274,445   256,759
en      1,216,551   543,654   491,415   491,459   421,118   418,104   404,452
es-es     745,200   316,712   281,992   282,006   248,449   246,718   222,356
fi        475,403   199,913   177,060   176,997   156,336   155,294   146,301
fr        614,101   263,578   234,472   234,486   206,445   205,173   192,309
id        175,088    81,267    75,076    75,149    65,913    65,024    60,588
it        641,884   264,977   234,664   234,739   203,738   202,171   185,148
ja        215,448    92,055    86,051    86,015    73,110    72,138    70,275
nl-be     426,557   194,692   176,509   176,475   157,189   156,193   148,774
pl        839,810   357,031   319,144   319,153   283,590   282,204   235,176
pt-br     500,678   217,628   195,786   195,859   172,797   171,644   151,244
        7,739,269 3,344,158 2,995,418 2,995,452 2,630,243 2,610,511 2,407,192

0 significant digits:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          6,862     3,166     2,974     2,974     2,903     2,710     2,501
common    465,160   203,789   181,763   181,750   155,077   152,311   146,830
cs        599,591   234,179   204,668   204,705   178,522   176,590   157,874
de        816,936   322,684   275,616   275,568   246,098   244,849   234,118
en      1,216,551   513,335   453,962   454,070   381,236   377,510   368,680
es-es     745,200   298,468   260,406   260,532   226,392   224,656   202,361
fi        475,403   190,923   166,539   166,634   146,463   145,153   136,588
fr        614,101   251,444   220,591   220,602   192,802   191,794   181,129
id        175,088    77,785    71,029    71,038    61,701    60,530    55,979
it        641,884   253,397   221,070   221,179   221,179   188,405   173,049
ja        215,448    88,616    81,396    81,426    68,110    67,293    65,623
nl-be     426,557   186,540   166,964   166,992   148,094   146,594   140,863
pl        839,810   335,182   294,333   294,394   259,041   257,091   214,149
pt-br     500,678   203,970   179,437   179,505   156,609   155,620   137,027
        7,739,269 3,163,478 2,780,748 2,781,369 2,444,227 2,391,106 2,216,771

With delta coding:

Delta coding:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          5,952     2,995     2,844     2,844     2,844     2,703     2,395
common    427,961   244,548   225,573   225,483   217,631   216,629   202,722
cs        512,849   280,210   248,352   248,230   240,148   236,534   201,686
de        808,393   420,810   367,975   367,145   354,163   351,625   311,635
en        970,809   539,374   493,416   493,075   475,386   472,508   444,471
es-es     738,536   389,363   344,042   343,294   331,028   328,165   292,054
fi        462,127   232,893   203,051   202,597   194,327   192,454   180,247
fr        562,489   297,784   263,446   263,069   253,626   251,115   230,025
id        168,036    89,687    80,900    80,638    77,353    76,525    71,125
it        596,056   311,698   276,478   276,233   266,748   263,404   238,464
ja        203,599   103,849    93,402    93,109    89,780    88,051    85,346
nl-be     414,537   220,930   197,559   197,279   189,950   188,161   176,431
pl        817,934   424,504   372,420   371,583   358,786   355,708   322,545
pt-br     495,596   259,299   228,047   227,637   219,558   217,401   194,258
        7,184,874 3,817,944 3,397,505 3,392,216 3,271,328 3,240,983 2,953,404

Delta coding + Groups of 100:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          5,951     2,989     2,837     2,837     2,837     2,701     2,412
common    378,884   222,848   209,194   209,115   201,781   201,006   190,410
cs        461,079   256,644   233,166   233,166   227,708   223,532   200,127
de        713,140   386,208   343,947   343,545   332,838   329,836   298,950
en        902,873   511,549   472,945   472,798   457,868   453,897   431,870
es-es     632,710   347,429   315,228   315,145   305,944   302,234   275,502
fi        403,824   211,224   187,439   187,248   180,436   178,561   169,254
fr        492,253   270,056   244,804   244,599   236,360   234,107   218,788
id        140,303    78,424    72,244    72,161    69,382    68,828    65,240
it        516,372   280,249   254,771   254,735   246,357   244,029   225,286
ja        169,893    90,168    82,812    82,680    79,200    78,460    76,080
nl-be     360,453   199,783   181,979   181,905   175,889   174,255   165,529
pl        716,865   385,419   345,635   345,186   334,366   331,351   302,916
pt-br     423,393   231,215   209,117   209,056   201,898   200,290   183,343
        6,317,993 3,474,205 3,156,118 3,154,176 3,052,864 3,023,087 2,805,707

Delta coding + 2 significant digits:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          5,955     2,995     2,841     2,841     2,420     2,841     2,701
common    338,426   200,621   189,422   189,352   173,131   183,933   182,262
cs        415,302   232,802   213,280   213,269   186,939   205,946   204,353
de        641,897   352,267   315,595   315,253   277,877   304,325   302,651
en        855,734   486,084   450,663   450,517   413,547   434,286   432,291
es-es     567,100   314,419   286,879   286,799   254,932   277,032   275,112
fi        383,173   201,630   180,539   180,368   164,521   173,715   171,918
fr        448,506   247,724   225,763   225,598   204,630   217,808   216,154
id        138,549    77,372    71,462    71,395    64,584    68,628    68,068
it        463,062   254,057   232,231   232,178   207,856   223,683   222,541
ja        155,846    82,927    76,398    76,307    70,434    72,989    72,384
nl-be     347,330   193,080   176,312   176,256   161,401   170,355   168,738
pl        654,522   355,302   321,305   320,876   284,271   309,553   307,606
pt-br     383,912   211,434   192,464   192,406   170,865   185,586   184,205
        5,799,314 3,212,714 2,935,154 2,933,415 2,637,408 2,830,680 2,810,984

Delta coding + 1 significant digits:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          5,951     2,989     2,837     2,837     2,412     2,837     2,701
common    278,349   165,438   157,906   157,847   143,465   152,699   151,100
cs        320,429   182,852   169,702   169,694   150,797   164,382   163,110
de        518,285   291,627   264,803   264,558   235,696   255,815   254,384
en        732,162   419,187   390,543   390,423   358,769   376,886   375,261
es-es     439,746   248,185   229,008   228,934   203,920   221,240   219,741
fi        301,417   162,183   147,033   146,927   134,901   141,588   140,413
fr        359,584   202,790   187,098   187,002   170,907   180,521   179,309
id        117,900    66,716    62,170    62,117    56,190    59,570    59,155
it        363,387   203,342   187,989   187,941   169,044   181,921   180,498
ja        127,283    68,009    62,735    62,697    57,608    60,041    59,463
nl-be     286,597   162,999   150,597   150,551   138,592   145,621   144,205
pl        506,053   280,206   257,472   257,040   217,449   248,274   246,616
pt-br     298,716   167,513   154,134   154,069   137,090   152,447   147,831
        4,655,859 2,624,036 2,424,027 2,422,637 2,176,840 2,343,842 2,323,787

Delta coding + 1/2 significant digits:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          5,949     2,988     2,836     2,836     2,836     2,702     2,395
common    252,431   149,594   141,318   141,259   137,072   135,083   129,016
cs        288,685   164,360   152,036   152,018   147,651   145,843   135,056
de        459,440   260,720   237,803   237,606   230,221   228,732   213,399
en        668,994   382,902   356,459   356,319   345,330   342,618   326,910
es-es     391,075   221,192   203,774   203,700   197,195   195,446   181,332
fi        270,546   146,063   132,571   132,471   127,718   126,495   120,898
fr        329,572   186,803   172,519   172,438   166,727   165,391   156,983
id        108,650    61,639    57,650    57,588    55,227    54,810    51,846
it        322,857   181,067   167,010   166,984   161,967   160,491   149,620
ja        120,103    63,994    58,826    58,802    56,679    55,877    53,955
nl-be     261,392   149,850   138,994   138,912   135,104   133,130   127,684
pl        458,568   254,462   233,488   233,043   226,774   223,675   191,412
pt-br     277,916   156,547   143,891   143,831   141,306   137,989   127,034
        4,216,178 2,382,181 2,199,175 2,197,807 2,131,807 2,108,282 1,967,540

Delta coding + 0 significant digits:
             orig       gz1       gz6       gz9      ect3      ect9      br11
ar          5,949     2,988     2,836     2,836     2,836     2,702     2,395
common    234,476   137,901   130,470   130,404   125,944   124,730   118,382
cs        250,061   139,348   127,060   127,045   123,310   121,864   112,799
de        386,356   220,600   202,084   201,972   195,969   194,579   182,401
en        592,195   337,236   313,632   313,493   305,343   301,280   286,992
es-es     342,790   192,403   176,264   176,194   170,197   169,016   155,128
fi        247,194   133,099   120,543   120,460   115,908   114,976   109,059
fr        297,438   168,573   155,579   155,522   150,607   149,326   141,729
id         99,359    56,475    53,048    53,018    50,948    50,467    46,941
it        292,112   162,726   149,216   149,171   144,397   143,259   133,073
ja        110,296    58,480    53,484    53,439    51,077    50,634    48,654
nl-be     239,907   138,243   128,757   128,712   124,790   123,206   117,931
pl        404,003   220,682   201,161   200,657   194,919   192,415   162,160
pt-br     242,596   135,444   124,146   124,082   121,224   118,980   107,927
        3,744,732 2,104,198 1,938,280 1,937,005 1,877,469 1,857,434 1,725,571
@MrWook MrWook added the enhancement New feature or request label Feb 25, 2024
@MrWook
Copy link
Collaborator

MrWook commented Feb 25, 2024

Hey @lifthrasiir,

First off, thank you very much for your thorough analysis of the issue with the dictionaries. It would greatly benefit this package to reduce the size of these dictionaries.

Since you've already conducted benchmarks, could you please share the code for Rank Quantization with a repository, a random JavaScript site, or any other resource? This would greatly enhance my understanding of the process and how to implement it into this package.

The Delta coding already has a code example, which makes it quite easy to grasp. However, it's important to note that the implementation needs to support different kinds of written languages, such as Polish or German special characters, which may not be covered by the current regex pattern /([A-Z])/g.

Additionally, we should not assume that the dictionaries are already sorted alphabetically, as they ideally should be sorted by occurrence. Unfortunately, not every government provides a neat open-source list of such data online, so we've sourced some of the data from other places, which is then hosted on GitHub.

You don't need to worry about inversion transformation during startup, as the dictionaries are already built at that time, which only takes a few milliseconds. Adding one small script more wouldn't make much of a difference, and it would justify this step to some extent.

This would be the perfect opportunity to implement such changes, as I'm currently working on the next major version, and this would definitely be a breaking change if we transition from JSON files to JavaScript files with a long string inside that needs to be parsed. A good place for implementation would be directly inside the List Generator. The inversion part could be handled within setRankedDictionaries in the Options file. The only downside is that we wouldn't provide simple JSON files for others to import anymore.

I'm always happy to have more contributors on board!

@MrWook MrWook added this to the 4.0.0 milestone Feb 25, 2024
@lifthrasiir
Copy link
Author

Since you've already conducted benchmarks, could you please share the code for Rank Quantization with a repository, a random JavaScript site, or any other resource? This would greatly enhance my understanding of the process and how to implement it into this package.

I didn't want to alter the generator itself for testing so I've naively done the rank quantization in the prefixCompress function. For example, a quantization to ½ significant digits would need the following code:

   if (!parsed.every((e) => typeof e === 'string' && isCompactDoubleQuotedString(e))) {
     // Should be rare enough, so don't bother escape them.
     return
   }
 
+  for (let step = 100; step < parsed.length; step *= 10) {
+    // Remove at most `M * step` items starting from the offset `N * step`,
+    // sort them lexicographically, and replace the original items with the sorted array.
+    // `Array.splice` gracefully handles out-of-bound ranges, so no additional check is needed.
+    parsed.splice(1 * step, 1 * step, ...parsed.splice(1 * step, 1 * step).sort())
+    parsed.splice(2 * step, 3 * step, ...parsed.splice(2 * step, 3 * step).sort())
+    parsed.splice(5 * step, 5 * step, ...parsed.splice(5 * step, 5 * step).sort())
+  }
 
   const deltas = []
   let last = ''
   for (const e of parsed) {

You'd probably want to streamline this more in the actual code, e.g. by using a generator.

The Delta coding already has a code example, which makes it quite easy to grasp. However, it's important to note that the implementation needs to support different kinds of written languages, such as Polish or German special characters, which may not be covered by the current regex pattern /([A-Z])/g.

You may have thought about capitalizing each word and removing commas (foo,bar,quuxFooBarQuux)? Here those characters replace a comma and are used as separators with additional information, so you can use any set of characters that do not appear in the word list.

I've originally used control characters /([\x01-\x09])/g before, which also are unlikely to appear and don't have to be escaped in string literals (the next character \x0a, i.e. \n, isn't), until I realized that every current word list is converted to lowercase so [A-Z] are guaranteed to be always available. The original pattern can be used as an alternative for lists which do contain uppercased Latin letters.

Additionally, we should not assume that the dictionaries are already sorted alphabetically, as they ideally should be sorted by occurrence.

That's probably true for first hundred or thousand words, but the long-tailed nature of such dictionary limits an accuracy of any further ranks. For example, most would agree that a difference between 500 occurrences and 600 occurrences is not that worthy to make. (There are 2,734 English words in the aforementioned frequency list, for the comparison.) It's more debatable for 500 vs. 1,000 occurrences though, which would have to be eventually decided.

While I didn't so in my testing, I agree the actual quantization should be done in the list generator in order to take account for the actual frequencies.

You don't need to worry about inversion transformation during startup, as the dictionaries are already built at that time, which only takes a few milliseconds. Adding one small script more wouldn't make much of a difference, and it would justify this step to some extent.

I too think so, but I was surprised to learn that String.split was much faster than I imagined, so I couldn't ignore a miniscule possibility that some users may have unconsciously relied on this performance behavior. A passing mention in the release note may be enough.

This would be the perfect opportunity to implement such changes, as I'm currently working on the next major version, and this would definitely be a breaking change if we transition from JSON files to JavaScript files with a long string inside that needs to be parsed. [...] The only downside is that we wouldn't provide simple JSON files for others to import anymore.

Good to hear, though I think the only necessary change here is to expose a function instead of an array, so that the decompression can happen lazily. I have also briefly considered to convert every field in the dictionary object to a property, so that such lazy decompression is done automatic:

import commonWords from './commonWords'

const cache = {}
const dictionary = {
  get ['commonWords-en']() {
    return cache.commonWords ??= lazyCommonWords()
  }
}

The pre-compressed file would be placed in commonWords.js, so the existing commonWords.json might be still retained. (But I'm not sure how many users do directly depend on those JSON files.)

The inversion part could be handled within setRankedDictionaries in the Options file.

I would rather put the inversion outside of the setRankedDictionaries, because otherwise you have to document the compression format and it can't be no longer customized for each input. My proposed schemes are extremely general and applicable for most human texts, but you can do better if you know exactly which inputs are expected.

As an extreme example, I have once written a self-decompressing JavaScript packer named Roadroller. While it is too computation-intensive and never suitable for general uses, Roadroller combined with gzip performs much better than even Brotli, partly because it can optimize various parameters to given input and that makes up for the added size of a built-in decompressor. But Roadroller can't (yet) see the underlying human assumption, so it doesn't outperform rank quantization and delta coding described above.

Of course, there would be a small number of compression strategies used at any point of time, so the shared decompression code can still be refactored into its own module if needed. In my testing that code was duplicated every time, but I made sure that the duplicated code remains always same and can't significantly worsen the compression (an additional function expression was sufficient for that).

@mimi89999
Copy link
Contributor

Also, I think that it would be better to have the dictionaries compressed in the packages. Web servers often do the compression on the fly and therefore use low compression levels. Precompresssing would make it possible to use maximum gzip compression levels. The dicts could then be decompressed in the browser using a built-in function: https://developer.mozilla.org/en-US/docs/Web/API/Compression_Streams_API

@mimi89999
Copy link
Contributor

By using gzip -9 and encoding the compressed content with base64, I still got a compression ratio of 1.6. That's quite impressive.

@mimi89999
Copy link
Contributor

By using Gzip with level 9 and base85 encoding, I managed to achieve a compression level of around 2.0, that means reducing the file size by half.

CompressionStream is supported by about 90% of browsers in use: https://caniuse.com/mdn-api_compressionstream and can decompress Gzip, so no additional library is required. The base85 implementation is about 4KB non minified.

Here is a simple POC for compressing the dictionaries:

diff --git a/data-scripts/_helpers/runtime.ts b/data-scripts/_helpers/runtime.ts
index 4340aea..67f2687 100644
--- a/data-scripts/_helpers/runtime.ts
+++ b/data-scripts/_helpers/runtime.ts
@@ -1,5 +1,7 @@
 import * as fs from 'fs'
+import * as fflate from 'fflate';
 import * as path from 'path'
+import { encodeBase85 } from "@alttiri/base85";
 import { execSync } from 'child_process'
 
 export interface LooseObject {
@@ -58,6 +60,12 @@ export default class ListHandler {
           path.join(folder, `${options.filename}.json`),
           JSON.stringify(data),
         )
+        const buf = fflate.strToU8(JSON.stringify(data))
+        const compressed = fflate.compressSync(buf, { level: 9, mem: 12 })
+        const compressedString = fflate.strFromU8(compressed, true)
+
+        console.log(buf.length / btoa(compressedString).length)
+        console.log(buf.length / encodeBase85(compressed).length)
       }
       // eslint-disable-next-line no-console
       console.timeEnd(options.filename)

@mimi89999
Copy link
Contributor

I implemented it in #276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants