Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chi square calculation function to math module. #69

Merged
merged 5 commits into from
Nov 6, 2024
Merged

Conversation

qkaiser
Copy link
Contributor

@qkaiser qkaiser commented Oct 25, 2024

Currently, unblob-native expose the shannon_entropy to calculate entropy levels of chunks. While valid, this approach is limited in that we cannot rely on it to differentiate between compressed and encrypted data streams.

An improved approach involves Chi-square tests. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy.

In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns.

According to ent:

The chi-square test is the most commonly used test for the randomness of data, and is extremely sensitive to errors in pseudorandom sequence generators. The chi-square distribution is calculated for the stream of bytes in the file and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated.

math_lady

@qkaiser qkaiser added enhancement New feature or request rust Pull requests that update Rust code labels Oct 25, 2024
@qkaiser qkaiser requested a review from vlaci October 25, 2024 11:34
@qkaiser qkaiser self-assigned this Oct 25, 2024
@qkaiser qkaiser force-pushed the chisquare branch 4 times, most recently from bec8800 to 96d7d42 Compare October 27, 2024 13:33
@vlaci
Copy link
Contributor

vlaci commented Oct 29, 2024

We should update the benchmarks as well. I am curious how the throughput compares between the algorithms.

@qkaiser qkaiser force-pushed the chisquare branch 2 times, most recently from b721665 to 816251a Compare October 29, 2024 13:04
@qkaiser
Copy link
Contributor Author

qkaiser commented Oct 29, 2024

We should update the benchmarks as well. I am curious how the throughput compares between the algorithms.

Done ! Throughput is quite similar:

Shannon entropy/1048576 time:   [932.18 µs 953.68 µs 981.25 µs]
                        thrpt:  [1019.1 MiB/s 1.0240 GiB/s 1.0476 GiB/s]


Chi square probability/1048576
                        time:   [921.64 µs 971.12 µs 1.0295 ms]
                        thrpt:  [971.32 MiB/s 1.0056 GiB/s 1.0596 GiB/s]

@qkaiser
Copy link
Contributor Author

qkaiser commented Oct 29, 2024

I'll adapt the Rust test cases to be similar to the ones we do from Python. Done.

Copy link
Contributor

@vlaci vlaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

src/math_tools.rs Outdated Show resolved Hide resolved
src/math_tools.rs Outdated Show resolved Hide resolved
src/math_tools.rs Outdated Show resolved Hide resolved
src/math_tools.rs Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
When using version 1.65, the following error message is received when
running cargo:

> error: package `clap_lex v0.7.2` cannot be built because it requires
> rustc 1.74 or newer, while the currently active rustc version is
> 1.65.0 Either upgrade to rustc 1.74 or newer, or use cargo update -p
> [email protected] --precise ver where `ver` is the latest version of
> `clap_lex` supporting rustc 1.65.0

When using version 1.74 and statrs we get:

> error: unsupported output in build script of libm v0.2.9:
> cargo::rustc-check-cfg=cfg(assert_no_panic) Found a cargo::key=value
> build directive which is reserved for future use. Either change the
> directive to cargo:key=value syntax (note the single :) or upgrade your
> version of Rust.
Add chi_square_probability function to math_tools module. This function
returns the Chi Square distribution probability.

Chi-square tests are effective for distinguishing compressed from
encrypted data because they evaluate the uniformity of byte
distributions more rigorously than Shannon entropy.

In compressed files, bytes often cluster around certain values due to
patterns that still exist (albeit less detectable), resulting in a
non-uniform distribution. Encrypted data, by contrast, exhibits nearly
perfect uniformity, as each byte value from 0–255 is expected to appear
with almost equal frequency, making it harder to detect any discernible
patterns.

The chi-square distribution is calculated for the stream of bytes in the
chunk and expressed as an absolute number and a percentage which
indicates how frequently a truly random sequence would exceed the value
calculated. The percentage is the only value that is of interest from
unblob's perspective, so that's why we only return it.

According to ent doc⁰:

> We [can] interpret the percentage as the degree to which the
> sequence tested is suspected of being non-random. If the percentage is
> greater than 99% or less than 1%, the sequence is almost certainly not
> random. If the percentage is between 99% and 95% or between 1% and 5%,
> the sequence is suspect. Percentages between 90% and 95% and 5% and 10%
> indicate the sequence is “almost suspect”.

[0] - https://www.fourmilab.ch/random/
@vlaci vlaci merged commit 618b011 into main Nov 6, 2024
26 checks passed
@vlaci vlaci deleted the chisquare branch November 6, 2024 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants