Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rling -q cw drops some records during counting #47

Open
roycewilliams opened this issue Oct 10, 2024 · 0 comments
Open

rling -q cw drops some records during counting #47

roycewilliams opened this issue Oct 10, 2024 · 0 comments

Comments

@roycewilliams
Copy link
Contributor

roycewilliams commented Oct 10, 2024

Example: Here's top X freqcount data of TLDs from a domain dump, using a perl script:

198174606:com
16123338:net
13445547:org
8481074:top
5972916:xyz
4899113:info
4250349:online
3173095:shop
2285882:site
1977737:store
1616455:app
1592094:biz
1142129:icu
1126259:vip

... but no matter what I do, rling -q cw's topX starts here:

  Count Line
 2285882 site
 1977737 store
 1616455 app
 1592094 biz
 1126259 vip
 1017868 cfd 

A potentially related mismatch can be reproduced with a file containing only the same string:

$ yes | head -n 4M >yes4m.list
$ wc -l yes4m.list 
4194304 yes4m.list
$ rling -q cw yes4m.list stdout
Reading "yes4m.list"...8388608 bytes total in 0.0064 seconds
Counting lines...Found 4194304 lines in 0.1082 seconds
Estimated memory required: 213,909,536 (204.00Mbytes)
Sorting... took 0.0000 seconds
Frequency:  1 unique (4194303 duplicate lines) in 0.3733 seconds

0 total lines matched in 0.3733 seconds
Input file had 4,194,304 lines, with lengths from 1 to 1
Writing analysis to "stdout"
   Count Line

Wrote 1 lines in 0.0000 seconds
Total runtime 0.4880 seconds

There's also a threshold at 100,000 as well somehow:

$ rm test.dat; yes 'ww' | head -n 100000 >test.dat; yes 'ff' | head -n 100000 >>test.dat; rling -q cw test.dat
 stdout 2>/dev/null
   Count Line

$ rm test.dat; yes 'ww' | head -n 10000 >test.dat; yes 'ff' | head -n 10000 >>test.dat; rling -q cw test.dat stdout 2>/dev/null
   Count Line
   10000 ff

$ count=99999; rm test.dat; yes 'ww' | head -n ${count} >test.dat; yes 'ff' | head -n ${count} >>test.dat; rling -q cw test.dat stdout 2>/dev/null
   Count Line
   99999 ff

$ count=100000; rm test.dat; yes 'ww' | head -n ${count} >test.dat; yes 'ff' | head -n ${count} >>test.dat; rling -q cw test.dat stdout 2>/dev/null
   Count Line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant