Add chi square calculation function to math module. #69

qkaiser · 2024-10-25T11:34:10Z

Currently, unblob-native expose the shannon_entropy to calculate entropy levels of chunks. While valid, this approach is limited in that we cannot rely on it to differentiate between compressed and encrypted data streams.

An improved approach involves Chi-square tests. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy.

In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns.

According to ent:

The chi-square test is the most commonly used test for the randomness of data, and is extremely sensitive to errors in pseudorandom sequence generators. The chi-square distribution is calculated for the stream of bytes in the file and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated.

vlaci · 2024-10-29T10:47:46Z

We should update the benchmarks as well. I am curious how the throughput compares between the algorithms.

qkaiser · 2024-10-29T14:33:41Z

We should update the benchmarks as well. I am curious how the throughput compares between the algorithms.

Done ! Throughput is quite similar:

Shannon entropy/1048576 time:   [932.18 µs 953.68 µs 981.25 µs]
                        thrpt:  [1019.1 MiB/s 1.0240 GiB/s 1.0476 GiB/s]


Chi square probability/1048576
                        time:   [921.64 µs 971.12 µs 1.0295 ms]
                        thrpt:  [971.32 MiB/s 1.0056 GiB/s 1.0596 GiB/s]

qkaiser · 2024-10-29T16:53:28Z

~~I'll adapt the Rust test cases to be similar to the ones we do from Python.~~ Done.

vlaci

Very nice!

src/math_tools.rs

Cargo.toml

When using version 1.65, the following error message is received when running cargo: > error: package `clap_lex v0.7.2` cannot be built because it requires > rustc 1.74 or newer, while the currently active rustc version is > 1.65.0 Either upgrade to rustc 1.74 or newer, or use cargo update -p > [email protected] --precise ver where `ver` is the latest version of > `clap_lex` supporting rustc 1.65.0 When using version 1.74 and statrs we get: > error: unsupported output in build script of libm v0.2.9: > cargo::rustc-check-cfg=cfg(assert_no_panic) Found a cargo::key=value > build directive which is reserved for future use. Either change the > directive to cargo:key=value syntax (note the single :) or upgrade your > version of Rust.

Add chi_square_probability function to math_tools module. This function returns the Chi Square distribution probability. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy. In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns. The chi-square distribution is calculated for the stream of bytes in the chunk and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. The percentage is the only value that is of interest from unblob's perspective, so that's why we only return it. According to ent doc⁰: > We [can] interpret the percentage as the degree to which the > sequence tested is suspected of being non-random. If the percentage is > greater than 99% or less than 1%, the sequence is almost certainly not > random. If the percentage is between 99% and 95% or between 1% and 5%, > the sequence is suspect. Percentages between 90% and 95% and 5% and 10% > indicate the sequence is “almost suspect”. [0] - https://www.fourmilab.ch/random/

qkaiser added enhancement New feature or request rust Pull requests that update Rust code labels Oct 25, 2024

qkaiser requested a review from vlaci October 25, 2024 11:34

qkaiser self-assigned this Oct 25, 2024

qkaiser mentioned this pull request Oct 25, 2024

Add Chi square measure to EntropyReport onekey-sec/unblob#993

Closed

qkaiser force-pushed the chisquare branch 4 times, most recently from bec8800 to 96d7d42 Compare October 27, 2024 13:33

qkaiser force-pushed the chisquare branch 2 times, most recently from b721665 to 816251a Compare October 29, 2024 13:04

qkaiser force-pushed the chisquare branch from 816251a to af4bf51 Compare October 30, 2024 08:37

vlaci requested changes Nov 6, 2024

View reviewed changes

qkaiser force-pushed the chisquare branch from af4bf51 to 4ec28e2 Compare November 6, 2024 13:22

qkaiser added 5 commits November 6, 2024 14:26

chore(deps): add statrs 0.17.1

edec78f

chore(deps): run pdm lock --refresh

04deb66

tests(math): drop assert_relative_eq in favor of assert_eq

c4a11cf

qkaiser force-pushed the chisquare branch from 4ec28e2 to c4a11cf Compare November 6, 2024 13:26

vlaci approved these changes Nov 6, 2024

View reviewed changes

vlaci merged commit 618b011 into main Nov 6, 2024
26 checks passed

vlaci deleted the chisquare branch November 6, 2024 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chi square calculation function to math module. #69

Add chi square calculation function to math module. #69

qkaiser commented Oct 25, 2024 •

edited

Loading

vlaci commented Oct 29, 2024

qkaiser commented Oct 29, 2024

qkaiser commented Oct 29, 2024 •

edited

Loading

vlaci left a comment

Add chi square calculation function to math module. #69

Add chi square calculation function to math module. #69

Conversation

qkaiser commented Oct 25, 2024 • edited Loading

vlaci commented Oct 29, 2024

qkaiser commented Oct 29, 2024

qkaiser commented Oct 29, 2024 • edited Loading

vlaci left a comment

Choose a reason for hiding this comment

qkaiser commented Oct 25, 2024 •

edited

Loading

qkaiser commented Oct 29, 2024 •

edited

Loading