Bias term using NCD? #44

Maimonator · 2023-08-03T17:14:42Z

Hey!
I've been playing around with this method and I've noticed something interesting.

When calculating the NCD you use

ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1,Cx2)

And Cx1x2, Cx1 and Cx2 are computed using gzip.compress
The thing is, that the output of gzip.compress contains not only the compressed data but metadata on it.
This includes signature, compression method, flags, modification time, crc32 and more. Depends on the configuration this is at least 20 bytes that I think should not be a part of the NCD calculation as they don't represent any actual information of the string that was compressed.
On my WSL (Ubuntu 22.04) the default behavior is 20 bytes of metadata.

To be clear, essentially what happens is this:

ncd = (Cx1x2 + metadata_size - min(Cx1,Cx2) - metadata_size) / (max(Cx1, Cx2) + metadata_size)

Basically all the distances are calculated with a bias term in the denominator of +20, as the metadata size is nullified in the numerator.
This means that all samples receive a bias towards being closer to each other, and not in the same way for each two samples that the NCD is calculated for.

I've tested various values for metadata_size by adding a constant to the denominator on the AG News dataset.
Note that I specifically used majority vote for accuracy with different K's so the results are not the same as the paper's

When the denominator constant is 0 the value is the same as in the original implementation of NCD, -20 is how I suspect it should be.

I also tested this with increments of 20 up to 200

It seems that for AG News the higher the denominator constant the lower the accuracy.

Does this make sense or am I missing something?

The text was updated successfully, but these errors were encountered:

bazingagin · 2023-08-04T02:50:48Z

hi @Maimonator, this is an interesting finding! The metadata occupying the denominator is larger than I thought.
It seems removing the bias increases the accuracy a bit? (Maybe needs to confirm by using e.g., -40 as the bias term and see if the accuracy decreases?)

Maimonator · 2023-08-04T05:12:15Z

hi @Maimonator, this is an interesting finding! The metadata occupying the denominator is larger than I thought. It seems removing the bias increases the accuracy a bit? (Maybe needs to confirm by using e.g., -40 as the bias term and see if the accuracy decreases?)

Yes I I think that removing the bias increases the accuracy but I wouldn't use anything higher than -20 as you increase the risk of zero division or getting negative values which as I understand the paper, the idea is just to normalize the distance to be between 0 and 1.
Also it's important to note that this has a larger impact on shorter strings .

bazingagin · 2023-08-04T09:54:44Z

Yeah, you are right - subtracting more than 20 may increase the chance of zero division.
This bias term will have more impact on shorter strings but I'm curious whether this "relatively higher" impact on the shorter string affects the preference for the length of the training sample.

Maimonator · 2023-08-04T10:11:06Z

Yeah, you are right - subtracting more than 20 may increase the chance of zero division. This bias term will have more impact on shorter strings but I'm curious whether this "relatively higher" impact on the shorter string affects the preference for the length of the training sample.

I'll try and find a dataset with class imbalance where I think the impact could be prominent in false negatives/false positives.
Maybe I'll do it later today 🤗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bias term using NCD? #44

Bias term using NCD? #44

Maimonator commented Aug 3, 2023

bazingagin commented Aug 4, 2023

Maimonator commented Aug 4, 2023

bazingagin commented Aug 4, 2023

Maimonator commented Aug 4, 2023

Bias term using NCD? #44

Bias term using NCD? #44

Comments

Maimonator commented Aug 3, 2023

bazingagin commented Aug 4, 2023

Maimonator commented Aug 4, 2023

bazingagin commented Aug 4, 2023

Maimonator commented Aug 4, 2023