-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add chi square calculation function to math module. #69
Conversation
bec8800
to
96d7d42
Compare
We should update the benchmarks as well. I am curious how the throughput compares between the algorithms. |
b721665
to
816251a
Compare
Done ! Throughput is quite similar:
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
When using version 1.65, the following error message is received when running cargo: > error: package `clap_lex v0.7.2` cannot be built because it requires > rustc 1.74 or newer, while the currently active rustc version is > 1.65.0 Either upgrade to rustc 1.74 or newer, or use cargo update -p > [email protected] --precise ver where `ver` is the latest version of > `clap_lex` supporting rustc 1.65.0 When using version 1.74 and statrs we get: > error: unsupported output in build script of libm v0.2.9: > cargo::rustc-check-cfg=cfg(assert_no_panic) Found a cargo::key=value > build directive which is reserved for future use. Either change the > directive to cargo:key=value syntax (note the single :) or upgrade your > version of Rust.
Add chi_square_probability function to math_tools module. This function returns the Chi Square distribution probability. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy. In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns. The chi-square distribution is calculated for the stream of bytes in the chunk and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. The percentage is the only value that is of interest from unblob's perspective, so that's why we only return it. According to ent doc⁰: > We [can] interpret the percentage as the degree to which the > sequence tested is suspected of being non-random. If the percentage is > greater than 99% or less than 1%, the sequence is almost certainly not > random. If the percentage is between 99% and 95% or between 1% and 5%, > the sequence is suspect. Percentages between 90% and 95% and 5% and 10% > indicate the sequence is “almost suspect”. [0] - https://www.fourmilab.ch/random/
Currently, unblob-native expose the
shannon_entropy
to calculate entropy levels of chunks. While valid, this approach is limited in that we cannot rely on it to differentiate between compressed and encrypted data streams.An improved approach involves Chi-square tests. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy.
In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns.
According to ent: