Skip to content

Commit

Permalink
feat(math): add χ² probability and convert EntropyReport to Randomnes…
Browse files Browse the repository at this point in the history
…sReport

Introduce another randomness measure based on Chi Square probability by
using unblob-native's chi_square_probability function. This function
returns the Chi Square distribution probability.

Chi-square tests are effective for distinguishing compressed from
encrypted data because they evaluate the uniformity of byte
distributions more rigorously than Shannon entropy.

In compressed files, bytes often cluster around certain values due to
patterns that still exist (albeit less detectable), resulting in a
non-uniform distribution. Encrypted data, by contrast, exhibits nearly
perfect uniformity, as each byte value from 0–255 is expected to appear
with almost equal frequency, making it harder to detect any discernible
patterns.

The chi-square distribution is calculated for the stream of bytes in the
chunk and expressed as an absolute number and a percentage which
indicates how frequently a truly random sequence would exceed the value
calculated. The percentage is the only value that is of interest from
unblob's perspective, so that's why we only return it.

According to ent doc⁰:

> We [can] interpret the percentage as the degree to which the
> sequence tested is suspected of being non-random. If the percentage is
> greater than 99% or less than 1%, the sequence is almost certainly not
> random. If the percentage is between 99% and 95% or between 1% and 5%,
> the sequence is suspect. Percentages between 90% and 95% and 5% and 10%
> indicate the sequence is “almost suspect”.

[0] - https://www.fourmilab.ch/random/

This entropy measure is introduced by modifying the EntropyReport class
so that it contains two RandomnessMeasurements:
- shannon: for Shannon entropy, which was already there
- chi_square: for Chi Square entropy, which we introduce

EntropyReport is renamed to RandomnessReport to reflect that all
measurements are not entropy only.

The format_entropy_plot has been adjusted to display two lines within
the entropy graph. One for Shannon, the other for Chi Square.

This commit breaks the previous API by converting
entropy_depth and entropy_plot to randomness_depth and randomness_plot
in ExtractionConfig. The '--entropy-depth' CLI option is replaced by
'--randomness-depth'.
  • Loading branch information
qkaiser committed Nov 5, 2024
1 parent e9108ab commit 42ad2a6
Show file tree
Hide file tree
Showing 11 changed files with 220 additions and 159 deletions.
104 changes: 53 additions & 51 deletions docs/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,10 +114,10 @@ $ cat alpine-report.json
]
```

### Entropy calculation
### Randomness calculation

If you are analyzing an unknown file format, it might be useful to know the
entropy of the contained files, so you can quickly see for example whether the
randomness of the contained files, so you can quickly see for example whether the
file is **encrypted** or contains some random content.

Let's make a file with fully random content at the start and end:
Expand All @@ -128,59 +128,61 @@ $ dd if=/dev/random of=random2.bin bs=10M count=1
$ cat random1.bin alpine-minirootfs-3.16.1-x86_64.tar.gz random2.bin > unknown-file
```

A nice ASCII entropy plot is drawn on verbose level 3:
A nice ASCII randomness plot is drawn on verbose level 3:

```console
$ unblob -vvv unknown-file | grep -C 15 "Entropy distribution"

2022-07-30 07:58.16 [debug ] Ended searching for chunks all_chunks=[0xa00000-0xc96196] pid=19803
2022-07-30 07:58.16 [debug ] Removed inner chunks outer_chunk_count=1 pid=19803 removed_inner_chunk_count=0
2022-07-30 07:58.16 [warning ] Found unknown Chunks chunks=[0x0-0xa00000, 0xc96196-0x1696196] pid=19803
2022-07-30 07:58.16 [info ] Extracting unknown chunk chunk=0x0-0xa00000 path=unknown-file_extract/0-10485760.unknown pid=19803
2022-07-30 07:58.16 [debug ] Carving chunk path=unknown-file_extract/0-10485760.unknown pid=19803
2022-07-30 07:58.16 [debug ] Calculating entropy for file path=unknown-file_extract/0-10485760.unknown pid=19803 size=0xa00000
2022-07-30 07:58.16 [debug ] Entropy calculated highest=99.99 lowest=99.98 mean=99.98 pid=19803
2022-07-30 07:58.16 [warning ] Drawing plot pid=19803
2022-07-30 07:58.16 [debug ] Entropy chart chart=
Entropy distribution
┌---------------------------------------------------------------------------┐
100┤•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••│
90┤ │
80┤ │
70┤ │
60┤ │
50┤ │
40┤ │
30┤ │
20┤ │
10┤ │
0┤ │
└┬---┬---┬---─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬┘
1 4 7 12 16 20 24 29 33 37 41 46 50 54 59 63 67 71 76 80
[y] entropy % [x] mB
pid=19803
2022-07-30 07:58.16 [info ] Extracting unknown chunk chunk=0xc96196-0x1696196 path=unknown-file_extract/13197718-23683478.unknown pid=19803
2022-07-30 07:58.16 [debug ] Carving chunk path=unknown-file_extract/13197718-23683478.unknown pid=19803
2022-07-30 07:58.16 [debug ] Calculating entropy for file path=unknown-file_extract/13197718-23683478.unknown pid=19803 size=0xa00000
2022-07-30 07:58.16 [debug ] Entropy calculated highest=99.99 lowest=99.98 mean=99.98 pid=19803
2022-07-30 07:58.16 [warning ] Drawing plot pid=19803
2022-07-30 07:58.16 [debug ] Entropy chart chart=
Entropy distribution
┌---------------------------------------------------------------------------┐
100┤•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••│
90┤ │
80┤ │
70┤ │
60┤ │
50┤ │
40┤ │
30┤ │
20┤ │
10┤ │
0┤ │
└┬---┬---┬---─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬┘
1 4 7 12 16 20 24 29 33 37 41 46 50 54 59 63 67 71 76 80
[y] entropy % [x] mB
2024-10-30 10:52.03 [debug ] Calculating chunk for pattern match handler=arc pid=1963719 real_offset=0x1685f5b start_offset=0x1685f5b
2024-10-30 10:52.03 [debug ] Header parsed header=<arc_head archive_marker=0x1a, header_type=0x1, name=b'8\xa7i&po\xc77\xd5h\x9a\x9d\xf1', size=0x26d171fa, date=0x1bfd, time=0xe03f, crc=-0x3b95, length=0x349997d5> pid=1963719
2024-10-30 10:52.03 [debug ] Ended searching for chunks all_chunks=[0xa00000-0xc96196] pid=1963719
2024-10-30 10:52.03 [debug ] Removed inner chunks outer_chunk_count=1 pid=1963719 removed_inner_chunk_count=0
2024-10-30 10:52.03 [warning ] Found unknown Chunks chunks=[0x0-0xa00000, 0xc96196-0x1696196] pid=1963719
2024-10-30 10:52.03 [info ] Extracting unknown chunk chunk=0x0-0xa00000 path=unknown-file_extract/0-10485760.unknown pid=1963719
2024-10-30 10:52.03 [debug ] Carving chunk path=unknown-file_extract/0-10485760.unknown pid=1963719
2024-10-30 10:52.03 [debug ] Calculating randomness for file path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug ] Shannon entropy calculated block_size=0x20000 highest=99.99 lowest=99.98 mean=99.98 path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug ] Chi square probability calculated block_size=0x20000 highest=97.88 lowest=3.17 mean=52.76 path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug ] Entropy chart chart=
Randomness distribution
┌───────────────────────────────────────────────────────────────────────────┐
100┤ •• Shannon entropy (%) •••••••••♰••••••••••••••••••••••••••••••••••│
90┤ ♰♰ Chi square probability (%) ♰ ♰ ♰♰♰♰ ♰ ♰ ♰ │
80┤♰ ♰ ♰♰ ♰♰ ♰♰ ♰ ♰ ♰♰♰♰♰♰♰♰♰ ♰ ♰♰♰♰♰♰ ♰♰ ♰♰ │
70┤♰♰♰♰ ♰ ♰ ♰ ♰ ♰♰♰ ♰ ♰ ♰ ♰ ♰♰♰♰♰♰♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰♰ ♰♰♰♰♰♰ │
60┤♰♰♰♰ ♰♰ ♰♰ ♰ ♰♰♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰♰♰♰♰♰ ♰♰ ♰ ♰ ♰♰♰♰ ♰ ♰♰♰ ♰♰♰♰♰♰♰ │
50┤ ♰♰♰ ♰♰ ♰♰ ♰♰ ♰♰♰♰ ♰♰ ♰ ♰♰♰ ♰♰♰♰♰♰ ♰ ♰ ♰ ♰♰♰♰♰ ♰ ♰♰♰ ♰ ♰♰♰♰♰ ♰ │
40┤ ♰♰ ♰♰ ♰ ♰♰ ♰♰♰♰ ♰♰ ♰ ♰♰♰ ♰♰♰♰♰♰ ♰♰ ♰♰ ♰♰♰♰♰♰ ♰ ♰♰♰ ♰ ♰♰♰♰ ♰♰ ♰│
30┤ ♰ ♰♰ ♰♰ ♰♰♰♰ ♰ ♰♰ ♰♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰♰♰ ♰ ♰ ♰♰ ♰ ♰♰♰ ♰♰ ♰ │
20┤ ♰♰ ♰♰ ♰♰♰ ♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰♰ │
10┤ ♰ ♰ ♰ ♰ ♰ ♰♰ ♰ ♰ ♰♰ │
0┤ ♰ ♰ │
└─┬──┬─┬──┬────┬───┬──┬──┬──┬───┬───┬──┬────┬───┬────┬──┬──┬────┬──┬───┬──┬─┘
0 2 5 7 11 16 20 23 27 30 34 38 42 47 51 56 60 63 68 71 76 79
131072 bytes
path=unknown-file_extract/0-10485760.unknown pid=1963719
2024-10-30 10:52.03 [info ] Extracting unknown chunk chunk=0xc96196-0x1696196 path=unknown-file_extract/13197718-23683478.unknown pid=1963719
2024-10-30 10:52.03 [debug ] Carving chunk path=unknown-file_extract/13197718-23683478.unknown pid=1963719
2024-10-30 10:52.03 [debug ] Calculating randomness for file path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug ] Shannon entropy calculated block_size=0x20000 highest=99.99 lowest=99.98 mean=99.98 path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug ] Chi square probability calculated block_size=0x20000 highest=99.03 lowest=0.23 mean=42.62 path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug ] Entropy chart chart=
Randomness distribution
┌───────────────────────────────────────────────────────────────────────────┐
100┤ •• Shannon entropy (%) •••••••••••••••••••••♰••••••••••••••••••••••│
90┤ ♰♰ Chi square probability (%) ♰ ♰♰ ♰ │
80┤♰♰ ♰♰ ♰♰ ♰ ♰♰ ♰ ♰♰ ♰ ♰♰ │
70┤♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰♰ ♰♰ ♰♰♰ ♰ ♰♰ ♰♰ │
60┤ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰♰♰♰♰ ♰♰ ♰♰ ♰♰ ♰ ♰ ♰♰♰ ♰♰ ♰ ♰ ♰♰ ♰ │
50┤ ♰ ♰♰♰ ♰ ♰ ♰ ♰ ♰ ♰♰♰♰ ♰ ♰♰ ♰ ♰♰♰ ♰ ♰ ♰ ♰♰♰ ♰♰ ♰ ♰ ♰♰ ♰♰ ♰ │
40┤ ♰♰♰♰ ♰♰ ♰♰ ♰ ♰ ♰♰ ♰♰♰ ♰♰♰ ♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰♰♰ ♰♰ │
30┤ ♰♰♰♰ ♰♰ ♰♰ ♰♰ ♰♰ ♰♰ ♰♰♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰ ♰♰♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰│
20┤ ♰♰♰ ♰ ♰ ♰♰ ♰♰ ♰♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰ ♰♰ ♰ ♰♰ ♰♰ ♰ ♰ │
10┤ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰♰ ♰ ♰♰ ♰♰ ♰♰ ♰ ♰ ♰ │
0┤ ♰ ♰ ♰♰ ♰ ♰♰ │
└─┬──┬─┬──┬────┬───┬──┬──┬──┬───┬───┬──┬────┬───┬────┬──┬──┬────┬──┬───┬──┬─┘
0 2 5 7 11 16 20 23 27 30 34 38 42 47 51 56 60 63 68 71 76 79
131072 bytes
```

### Skip extraction with file magic
Expand Down
4 changes: 2 additions & 2 deletions fuzzing/search_chunks_fuzzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ def test_search_chunks(data):
config = ExtractionConfig(
extract_root=Path("/dev/shm"), # noqa: S108
force_extract=True,
entropy_depth=0,
entropy_plot=False,
randomness_depth=0,
randomness_plot=False,
skip_magic=[],
skip_extension=[],
skip_extraction=False,
Expand Down
8 changes: 4 additions & 4 deletions tests/test_cleanup.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def test_remove_extracted_chunks(input_file: Path, output_dir: Path):
input_file.write_bytes(ZIP_BYTES)
config = ExtractionConfig(
extract_root=output_dir,
entropy_depth=0,
randomness_depth=0,
)

all_reports = process_file(config, input_file)
Expand All @@ -62,7 +62,7 @@ def test_keep_all_problematic_chunks(input_file: Path, output_dir: Path):
input_file.write_bytes(DAMAGED_ZIP_BYTES)
config = ExtractionConfig(
extract_root=output_dir,
entropy_depth=0,
randomness_depth=0,
)

all_reports = process_file(config, input_file)
Expand All @@ -75,7 +75,7 @@ def test_keep_all_unknown_chunks(input_file: Path, output_dir: Path):
input_file.write_bytes(b"unknown1" + ZIP_BYTES + b"unknown2")
config = ExtractionConfig(
extract_root=output_dir,
entropy_depth=0,
randomness_depth=0,
)

all_reports = process_file(config, input_file)
Expand All @@ -97,7 +97,7 @@ def test_keep_chunks_with_null_extractor(input_file: Path, output_dir: Path):
input_file.write_bytes(b"some text")
config = ExtractionConfig(
extract_root=output_dir,
entropy_depth=0,
randomness_depth=0,
handlers=(_HandlerWithNullExtractor,),
)
all_reports = process_file(config, input_file)
Expand Down
8 changes: 4 additions & 4 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ def test_dir_for_file(tmp_path: Path):


@pytest.mark.parametrize(
"params, expected_depth, expected_entropy_depth, expected_process_num, expected_verbosity, expected_progress_reporter",
"params, expected_depth, expected_randomness_depth, expected_process_num, expected_verbosity, expected_progress_reporter",
[
pytest.param(
[],
Expand Down Expand Up @@ -233,7 +233,7 @@ def test_dir_for_file(tmp_path: Path):
def test_archive_success(
params,
expected_depth: int,
expected_entropy_depth: int,
expected_randomness_depth: int,
expected_process_num: int,
expected_verbosity: int,
expected_progress_reporter: Type[ProgressReporter],
Expand Down Expand Up @@ -263,8 +263,8 @@ def test_archive_success(
config = ExtractionConfig(
extract_root=tmp_path,
max_depth=expected_depth,
entropy_depth=expected_entropy_depth,
entropy_plot=bool(expected_verbosity >= 3),
randomness_depth=expected_randomness_depth,
randomness_plot=bool(expected_verbosity >= 3),
process_num=expected_process_num,
handlers=BUILTIN_HANDLERS,
verbose=expected_verbosity,
Expand Down
Loading

0 comments on commit 42ad2a6

Please sign in to comment.