Implement Custom UTF-8 Decoder #885

Arker123 · 2023-09-26T06:51:41Z

This PR introduces a custom UTF-8 string decoder to replace the existing binary2strings conversion.
Reference: https://en.wikipedia.org/wiki/UTF-8
Fixes: #867

williballenthin · 2023-09-26T08:51:03Z

@Arker123 would you include some performance benchmarks as you propose these changes? eg. does FLOSS performance noticeably change when using this implementation?

don't have to address this immediately, just before we consider it for merging

Arker123 · 2023-09-27T04:22:12Z

For rust-hello64.exe, the current coverage has dropped from 96% to 71%, which I'm investigating. That's the reason this PR is marked as a draft. On the plus side, it correctly extracts the string 'invalid args' as shown in issue #867.

Arker123 · 2023-09-27T04:26:33Z

Additionally, we still need to address the consideration of 'characters' in this implementation.

Source: https://en.wikipedia.org/wiki/UTF-8

Arker123 · 2023-10-02T03:42:15Z

Presently, we have achieved coverage levels of >= 96% in 5 out of 6 test files. However, one file, specifically https://github.com/mandiant/flare-floss-testfiles/blob/master/language/rust/rust-unknown-binaries/bin/1.69.0/i386/200c308a793630e4f3686dd846f0d55b6368834a859875970b4135f3ca487f46, is currently only at 86% coverage. :(

Arker123 · 2023-10-02T05:15:46Z

After conducting further research, I have discovered additional xrefs within the .rdata section of the i386 architecture. The current coverage for the aforementioned file stands at 91%.

You can view the findings here:

mr-tz · 2023-10-09T15:55:34Z

Let's not mix updates to the extraction algorithm and the custom UTF-8 string finder in this PR.

For performance here, let's focus on the time it takes to find just all the UTF-8 strings from files.

Arker123 · 2023-11-10T09:48:10Z

Here are some performance statistics:

File name: 1.59.0_amd64_4d41c02164ee4aeacd5e3fd67ecdc097233b8c6a725fab588d1cdde0461fd8cc
File size: 114.84571838378906 MB
perf: 8.53998875617981 s

File name: 1.49.0_i386_13af976c769e583e63665cc29ddeb6dc1a81c447ce7ea87bc10e01941576318b
File size: 112.57718658447266 MB
perf: 11.280088901519775 s

File name: 1.67.1_amd64_286164c8db3976443ca605209c182e93d2fee5854504d5e9fc32a0b701b78aa8
File size: 125.58065795898438 MB
perf: 20.27586317062378 s

File name: 1.53.0_amd64_6b4292500c9603d0f9647ae53adaa4cbf2840ae243d9efa3f14048f079e2d32a
File size: 65.56964874267578 MB
perf: 4.034202814102173 s

File name: 1.53.0_amd64_63fa08f342be3c2a9c13f3d4c02f5a808913b291d0dc2691424fbb6e50fadb0d
File size: 65.56991577148438 MB
perf: 4.46514892578125 s

File name: 1.65.0_amd64_6f0a952ebd0fd544bf27a13686a0f3494e9be102654a58996bedfe3bcc6f61ab
File size: 83.82809448242188 MB
perf: 9.418390274047852 s

File name: 1.53.0_amd64_bcca0fd4d94764ef5a1ade5e81e749b11d87ab1c6c490d9065040e20da9dcf73
File size: 65.56964874267578 MB
perf: 4.185543060302734 s

File name: 1.59.0_amd64_a09acc46cc64b945ba7e5b1fcd449b6cad435b31d45c7cdc9a4b8dc3c5d4aa04
File size: 115.15755462646484 MB
perf: 8.811774015426636 s

File name: 1.53.0_amd64_15d603b12906f96d4891164febef7b587667363b7c91011ee100c81cad3a0472
File size: 65.56964874267578 MB
perf: 4.438174724578857 s

File name: 1.64.0_amd64_8b4a259da875e73e61b72c87191b8e7ba35c29da6db4fc1f7805c5f5ef93d42e
File size: 74.84070587158203 MB
perf: 5.529693603515625 s

File name: 1.69.0_amd64_c5f8c0a07123fc97c7b673b22b5f54cf415d66454022827f849a76dbe7ab2853
File size: 17.7099609375 MB
perf: 2.7266366481781006 s

File name: 1.69.0_amd64_0eba113dde8ba85afd96116d39444fc4fcc7a280b45d0d15544642f290a5e72a
File size: 31.06982421875 MB
perf: 0.1291041374206543 s

File name: 1.67.1_amd64_a585d992715c9d482fea1d047ea938d884f2888fa5d037aeca8c14ee6f7636d1
File size: 125.58039093017578 MB
perf: 19.494038581848145 s

mr-tz · 2023-11-10T10:05:00Z

Are these the performance stats just for running the UTF-8 extraction code?

Can you clean up the PR to only include those changes, please?

Arker123 · 2023-11-10T11:28:59Z

Are these the performance stats just for running the UTF-8 extraction code?

Yes

Can you clean up the PR to only include those changes, please?

Okay😄

mr-tz · 2023-11-10T14:31:46Z

floss/language/rust/decode_utf8.py

+    raise ValueError("no .rdata section found")
+
+
+def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[Tuple[str, int, int]]:


can you add a function that's signature is analog to extract_ascii_strings, i.e. operate on buffers?
this can then be wrapped by this or a similar function.

mr-tz · 2023-11-10T14:32:17Z

floss/language/rust/decode_utf8.py

+
+    for i in range(0, len(strings)):
+        # for 1 byte
+        if strings[i] & 0x80 == 0x00:


please add tests for various encoded strings to show they're properly decoded

… HEAD

floss/language/rust/decode_utf8.py

VascoSch92 · 2024-06-23T15:46:49Z

floss/language/rust/decode_utf8.py

+    return strings
+
+
+def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:


Suggested change

def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:

def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[Optional[List[Any]]]:

the return type is wrong

It doesn't pass in mypy like this. Is there another approach?

williballenthin

how does this impact performance?

Co-authored-by: Vasco Schiavo <[email protected]>

Arker123 · 2024-06-24T05:16:45Z

how does this impact performance?

There is an increase in extraction percentage from 88% to 91%.

here is a table for time taken to run tests for rust

Test_name	original_time	current_time
test_language_rust_coverage.py	11.146s	11.637s
test_language_rust_known_binary.py	36.969s	34.953s
test_language_extract_rust.py	4.321s	4.472s
test_utf8_decoder.py	12.366s	13.194s

*Tests are ran on Core i5 12Gen

williballenthin

Please see inline comments.

I think the core algorithm probably works well, but some enhancements to the structure and documentation will ensure it remains maintainable. I like that we'll be able to remove the external dependency!

floss/language/rust/decode_utf8.py

williballenthin · 2024-06-24T07:09:48Z

floss/language/rust/decode_utf8.py

+            temp = buf[i] << 24 | buf[i + 1] << 16 | buf[i + 2] << 8 | buf[i + 3]
+            character = temp.to_bytes(4, "big").decode("utf-8", "ignore")
+            i += 3
+            character_and_index.append([character, i, 4])


what about the other cases? are there any?

either way, please add an else branch to handle them, either logging or doing an assertion.

williballenthin · 2024-06-24T07:12:33Z

floss/language/rust/decode_utf8.py

+        # for 2 bytes
+        elif buf[i] & 0xE0 == 0xC0:
+            temp = buf[i] << 8 | buf[i + 1]
+            character = temp.to_bytes(2, "big").decode("utf-8", "ignore")


why do you use ignore here? wouldn't we want to handle the case that invalid UTF-8 data is encountered (and not extract a string there)?

I assume that your algorithm works pretty well, since you've opened the PR, but I can't quite follow how it works. Would you please add some comments explaining the design, and definitely a few test cases that exercise each of the branch arms?

Hi, the tests for each branch are in tests/test_utf8_decoder.py. Let me know if anything else is required.

williballenthin · 2024-06-24T07:14:07Z

floss/language/rust/decode_utf8.py

+    return strings
+
+
+def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:


Suggested change

def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:

def extract_rdata_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:

tests/test_language_rust_coverage.py

Co-authored-by: Willi Ballenthin <[email protected]>

Arker123 added 2 commits September 26, 2023 12:17

Implement UTF-8 Decoder

cfeb127

Tweaks

e083376

Minor changes

4a54532

Arker123 added 2 commits October 2, 2023 10:42

Discovered more i386 xrefs

775f1ce

Enhance percentage extraction to 91% 😄

18e6080

Arker123 mentioned this pull request Nov 9, 2023

More i386 xrefs #899

Open

Merge branch 'master' into utf8_decoder

6d8e314

Clean up

8515899

Arker123 marked this pull request as ready for review November 10, 2023 13:08

mr-tz reviewed Nov 10, 2023

View reviewed changes

Arker123 added 7 commits June 23, 2024 18:57

Added extract_utf8_strings_from_buffer

51525ae

Code style

1f5f3eb

Merge branch 'master' of https://github.com/Arker123/flare-floss into…

3d1093a

… HEAD

Tweaks

a5e46ae

Minor bug

3105843

Tweaks

7481274

Add tests

60b3ca6

Arker123 requested review from mr-tz and williballenthin June 23, 2024 15:27

VascoSch92 reviewed Jun 23, 2024

View reviewed changes

williballenthin reviewed Jun 23, 2024

View reviewed changes

Arker123 and others added 2 commits June 24, 2024 10:22

Update floss/language/rust/decode_utf8.py

272770d

Co-authored-by: Vasco Schiavo <[email protected]>

Tweaks

770955c

williballenthin requested changes Jun 24, 2024

View reviewed changes

Arker123 and others added 2 commits June 26, 2024 09:39

Update tests/test_language_rust_coverage.py

a354b30

Co-authored-by: Willi Ballenthin <[email protected]>

Several Refinements

960f2c0

Arker123 requested a review from williballenthin June 30, 2024 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Custom UTF-8 Decoder #885

Implement Custom UTF-8 Decoder #885

Arker123 commented Sep 26, 2023

williballenthin commented Sep 26, 2023

Arker123 commented Sep 27, 2023 •

edited

Loading

Arker123 commented Sep 27, 2023

Arker123 commented Oct 2, 2023

Arker123 commented Oct 2, 2023

mr-tz commented Oct 9, 2023

Arker123 commented Nov 10, 2023

mr-tz commented Nov 10, 2023

Arker123 commented Nov 10, 2023

mr-tz Nov 10, 2023

mr-tz Nov 10, 2023

VascoSch92 Jun 23, 2024

Arker123 Jun 24, 2024

williballenthin left a comment

Arker123 commented Jun 24, 2024

williballenthin left a comment

williballenthin Jun 24, 2024

williballenthin Jun 24, 2024

Arker123 Jun 26, 2024

williballenthin Jun 24, 2024

		raise ValueError("no .rdata section found")


		def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[Tuple[str, int, int]]:

		return strings


		def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:

	def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:
	def extract_rdata_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:

Implement Custom UTF-8 Decoder #885

Are you sure you want to change the base?

Implement Custom UTF-8 Decoder #885

Conversation

Arker123 commented Sep 26, 2023

williballenthin commented Sep 26, 2023

Arker123 commented Sep 27, 2023 • edited Loading

Arker123 commented Sep 27, 2023

Arker123 commented Oct 2, 2023

Arker123 commented Oct 2, 2023

mr-tz commented Oct 9, 2023

Arker123 commented Nov 10, 2023

Here are some performance statistics:

mr-tz commented Nov 10, 2023

Arker123 commented Nov 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williballenthin left a comment

Choose a reason for hiding this comment

Arker123 commented Jun 24, 2024

williballenthin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Arker123 commented Sep 27, 2023 •

edited

Loading