Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute and expose χ² probability in EntropyReport #995

Merged
merged 3 commits into from
Nov 8, 2024

Conversation

qkaiser
Copy link
Contributor

@qkaiser qkaiser commented Oct 27, 2024

Introduce another entropy measure based on Chi Square probability by using unblob-native's chi_square_probability function. This function returns the Chi Square distribution probability.

Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy.

In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns.

The chi-square distribution is calculated for the stream of bytes in the chunk and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. The percentage is the only value that is of interest from unblob's perspective, so that's why we only return it.

According to ent doc⁰:

We [can] interpret the percentage as the degree to which the sequence tested is suspected of being non-random. If the percentage is greater than 99% or less than 1%, the sequence is almost certainly not random. If the percentage is between 99% and 95% or between 1% and 5%, the sequence is suspect. Percentages between 90% and 95% and 5% and 10% indicate the sequence is “almost suspect”.

[0] - https://www.fourmilab.ch/random/

This entropy measure is introduced by modifying the EntropyReport class so that it contains two EntropyMeasures:

  • shannon: for Shannon entropy, which was already there
  • chi_square: for Chi Square entropy, which we introduce

The format_entropy_plot has been adjusted to display two lines within the entropy graph. One for Shannon, the other for Chi Square.

@qkaiser qkaiser added the enhancement New feature or request label Oct 27, 2024
@qkaiser qkaiser self-assigned this Oct 27, 2024
@qkaiser qkaiser marked this pull request as draft October 27, 2024 12:00
@qkaiser qkaiser force-pushed the feat-chisquare-entropy branch from 65c20e2 to aeaa3d2 Compare October 27, 2024 13:41
@qkaiser qkaiser changed the title Compute and expose Chi Square entropy levels Compute and expose χ² probability entropy levels Oct 27, 2024
@qkaiser qkaiser changed the title Compute and expose χ² probability entropy levels Compute and expose χ² probability in EntropyReport Oct 27, 2024
@qkaiser qkaiser force-pushed the feat-chisquare-entropy branch from aeaa3d2 to 0459ac8 Compare October 27, 2024 13:48
@qkaiser
Copy link
Contributor Author

qkaiser commented Oct 27, 2024

Entropy levels for a plaintext lorem ipsum:

image

Entropy levels of that file, now XOR'ed:

image

Entropy levels of that plaintext file, now gzip'ed:

image

Entropy levels of that plaintext file, now AES encrypted:

image

The χ² probability convey more precise information about the nature of analyzed data. We can spot weak encryption like XOR even when Shannon entropy reports constant high levels of entropy. We can differentiate between compressed and encrypted given the χ² probability distribution, while they share similar Shannon entropy levels.

@qkaiser qkaiser force-pushed the feat-chisquare-entropy branch from 0459ac8 to b0228db Compare October 27, 2024 13:56
@qkaiser qkaiser marked this pull request as ready for review October 27, 2024 13:56
@qkaiser qkaiser force-pushed the feat-chisquare-entropy branch from b0228db to 448c685 Compare October 27, 2024 14:03
@qkaiser qkaiser requested review from e3krisztian and vlaci October 29, 2024 07:52
Copy link
Contributor

@e3krisztian e3krisztian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a wording change both in the commit message and in sources to use randomness or randomness measure instead of entropy.


In the commit message of 448c685 chi^2 probability is categorized as a kind of entropy, probably because as we already have entropy as a randomness measure.

unblob/report.py Outdated Show resolved Hide resolved
@qkaiser qkaiser force-pushed the feat-chisquare-entropy branch 2 times, most recently from 93aec8d to d123265 Compare October 30, 2024 11:02
@qkaiser qkaiser requested a review from e3krisztian November 3, 2024 11:19
@qkaiser qkaiser force-pushed the feat-chisquare-entropy branch 2 times, most recently from cceeb8f to 42ad2a6 Compare November 5, 2024 10:25
unblob/processing.py Outdated Show resolved Hide resolved
@qkaiser qkaiser force-pushed the feat-chisquare-entropy branch 3 times, most recently from 3ed5c17 to aaa7361 Compare November 7, 2024 17:08
@qkaiser
Copy link
Contributor Author

qkaiser commented Nov 7, 2024

@e3krisztian now using unblob-native version 0.1.5 that was released today. Ready to be merged I think.

@vlaci
Copy link
Contributor

vlaci commented Nov 7, 2024

@e3krisztian now using unblob-native version 0.1.5 that was released today. Ready to be merged I think.

We should update the required version as well to 0.1.5, as that is what matters when people install unblob from pypi

@qkaiser qkaiser force-pushed the feat-chisquare-entropy branch 3 times, most recently from 0b9cb0d to 7f034e2 Compare November 7, 2024 20:35
@qkaiser
Copy link
Contributor Author

qkaiser commented Nov 7, 2024

@vlaci done. also I hate this caret notation.

…sReport

Introduce another randomness measure based on Chi Square probability by
using unblob-native's chi_square_probability function. This function
returns the Chi Square distribution probability.

Chi-square tests are effective for distinguishing compressed from
encrypted data because they evaluate the uniformity of byte
distributions more rigorously than Shannon entropy.

In compressed files, bytes often cluster around certain values due to
patterns that still exist (albeit less detectable), resulting in a
non-uniform distribution. Encrypted data, by contrast, exhibits nearly
perfect uniformity, as each byte value from 0–255 is expected to appear
with almost equal frequency, making it harder to detect any discernible
patterns.

The chi-square distribution is calculated for the stream of bytes in the
chunk and expressed as an absolute number and a percentage which
indicates how frequently a truly random sequence would exceed the value
calculated. The percentage is the only value that is of interest from
unblob's perspective, so that's why we only return it.

According to ent doc⁰:

> We [can] interpret the percentage as the degree to which the
> sequence tested is suspected of being non-random. If the percentage is
> greater than 99% or less than 1%, the sequence is almost certainly not
> random. If the percentage is between 99% and 95% or between 1% and 5%,
> the sequence is suspect. Percentages between 90% and 95% and 5% and 10%
> indicate the sequence is “almost suspect”.

[0] - https://www.fourmilab.ch/random/

This randomness measure is introduced by modifying the EntropyReport class
so that it contains two RandomnessMeasurements:
- shannon: for Shannon entropy, which was already there
- chi_square: for Chi Square probability, which we introduce

EntropyReport is renamed to RandomnessReport to reflect that all
measurements are not entropy related.

The format_entropy_plot has been adjusted to display two lines within
the entropy graph. One for Shannon, the other for Chi Square.

This commit breaks the previous API by converting
entropy_depth and entropy_plot to randomness_depth and randomness_plot
in ExtractionConfig. The '--entropy-depth' CLI option is replaced by
'--randomness-depth'.
@qkaiser qkaiser force-pushed the feat-chisquare-entropy branch from 7f034e2 to 8e2e11b Compare November 8, 2024 08:32
@vlaci vlaci merged commit 5bec244 into main Nov 8, 2024
15 checks passed
@vlaci vlaci deleted the feat-chisquare-entropy branch November 8, 2024 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants