-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bias term using NCD? #44
Comments
hi @Maimonator, this is an interesting finding! The metadata occupying the denominator is larger than I thought. |
Yes I I think that removing the bias increases the accuracy but I wouldn't use anything higher than |
Yeah, you are right - subtracting more than 20 may increase the chance of zero division. |
I'll try and find a dataset with class imbalance where I think the impact could be prominent in false negatives/false positives. |
Hey!
I've been playing around with this method and I've noticed something interesting.
When calculating the NCD you use
And Cx1x2, Cx1 and Cx2 are computed using
gzip.compress
The thing is, that the output of
gzip.compress
contains not only the compressed data but metadata on it.This includes signature, compression method, flags, modification time, crc32 and more. Depends on the configuration this is at least 20 bytes that I think should not be a part of the NCD calculation as they don't represent any actual information of the string that was compressed.
On my WSL (Ubuntu 22.04) the default behavior is 20 bytes of metadata.
To be clear, essentially what happens is this:
Basically all the distances are calculated with a bias term in the denominator of +20, as the metadata size is nullified in the numerator.
This means that all samples receive a bias towards being closer to each other, and not in the same way for each two samples that the NCD is calculated for.
I've tested various values for metadata_size by adding a constant to the denominator on the AG News dataset.
Note that I specifically used majority vote for accuracy with different K's so the results are not the same as the paper's
When the denominator constant is
0
the value is the same as in the original implementation of NCD,-20
is how I suspect it should be.I also tested this with increments of
20
up to200
It seems that for AG News the higher the denominator constant the lower the accuracy.
Does this make sense or am I missing something?
The text was updated successfully, but these errors were encountered: