-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute and expose χ² probability in EntropyReport #995
Conversation
65c20e2
to
aeaa3d2
Compare
aeaa3d2
to
0459ac8
Compare
Entropy levels for a plaintext lorem ipsum: Entropy levels of that file, now XOR'ed: Entropy levels of that plaintext file, now gzip'ed: Entropy levels of that plaintext file, now AES encrypted: The χ² probability convey more precise information about the nature of analyzed data. We can spot weak encryption like XOR even when Shannon entropy reports constant high levels of entropy. We can differentiate between compressed and encrypted given the χ² probability distribution, while they share similar Shannon entropy levels. |
0459ac8
to
b0228db
Compare
b0228db
to
448c685
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a wording change both in the commit message and in sources to use randomness or randomness measure instead of entropy.
In the commit message of 448c685 chi^2 probability is categorized as a kind of entropy, probably because as we already have entropy as a randomness measure.
93aec8d
to
d123265
Compare
cceeb8f
to
42ad2a6
Compare
3ed5c17
to
aaa7361
Compare
@e3krisztian now using unblob-native version 0.1.5 that was released today. Ready to be merged I think. |
We should update the required version as well to 0.1.5, as that is what matters when people install unblob from pypi |
0b9cb0d
to
7f034e2
Compare
@vlaci done. also I hate this caret notation. |
…sReport Introduce another randomness measure based on Chi Square probability by using unblob-native's chi_square_probability function. This function returns the Chi Square distribution probability. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy. In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns. The chi-square distribution is calculated for the stream of bytes in the chunk and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. The percentage is the only value that is of interest from unblob's perspective, so that's why we only return it. According to ent doc⁰: > We [can] interpret the percentage as the degree to which the > sequence tested is suspected of being non-random. If the percentage is > greater than 99% or less than 1%, the sequence is almost certainly not > random. If the percentage is between 99% and 95% or between 1% and 5%, > the sequence is suspect. Percentages between 90% and 95% and 5% and 10% > indicate the sequence is “almost suspect”. [0] - https://www.fourmilab.ch/random/ This randomness measure is introduced by modifying the EntropyReport class so that it contains two RandomnessMeasurements: - shannon: for Shannon entropy, which was already there - chi_square: for Chi Square probability, which we introduce EntropyReport is renamed to RandomnessReport to reflect that all measurements are not entropy related. The format_entropy_plot has been adjusted to display two lines within the entropy graph. One for Shannon, the other for Chi Square. This commit breaks the previous API by converting entropy_depth and entropy_plot to randomness_depth and randomness_plot in ExtractionConfig. The '--entropy-depth' CLI option is replaced by '--randomness-depth'.
7f034e2
to
8e2e11b
Compare
Introduce another entropy measure based on Chi Square probability by using unblob-native's chi_square_probability function. This function returns the Chi Square distribution probability.
Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy.
In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns.
The chi-square distribution is calculated for the stream of bytes in the chunk and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. The percentage is the only value that is of interest from unblob's perspective, so that's why we only return it.
According to ent doc⁰:
[0] - https://www.fourmilab.ch/random/
This entropy measure is introduced by modifying the
EntropyReport
class so that it contains twoEntropyMeasures
:The
format_entropy_plot
has been adjusted to display two lines within the entropy graph. One for Shannon, the other for Chi Square.