-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Custom UTF-8 Decoder #885
base: master
Are you sure you want to change the base?
Conversation
@Arker123 would you include some performance benchmarks as you propose these changes? eg. does FLOSS performance noticeably change when using this implementation? don't have to address this immediately, just before we consider it for merging |
For rust-hello64.exe, the current coverage has dropped from 96% to 71%, which I'm investigating. That's the reason this PR is marked as a draft. On the plus side, it correctly extracts the string 'invalid args' as shown in issue #867. |
Additionally, we still need to address the consideration of 'characters' in this implementation. |
Presently, we have achieved coverage levels of >= 96% in 5 out of 6 test files. However, one file, specifically https://github.com/mandiant/flare-floss-testfiles/blob/master/language/rust/rust-unknown-binaries/bin/1.69.0/i386/200c308a793630e4f3686dd846f0d55b6368834a859875970b4135f3ca487f46, is currently only at 86% coverage. :( |
Let's not mix updates to the extraction algorithm and the custom UTF-8 string finder in this PR. For performance here, let's focus on the time it takes to find just all the UTF-8 strings from files. |
Here are some performance statistics:File name: 1.59.0_amd64_4d41c02164ee4aeacd5e3fd67ecdc097233b8c6a725fab588d1cdde0461fd8cc File name: 1.49.0_i386_13af976c769e583e63665cc29ddeb6dc1a81c447ce7ea87bc10e01941576318b File name: 1.67.1_amd64_286164c8db3976443ca605209c182e93d2fee5854504d5e9fc32a0b701b78aa8 File name: 1.53.0_amd64_6b4292500c9603d0f9647ae53adaa4cbf2840ae243d9efa3f14048f079e2d32a File name: 1.53.0_amd64_63fa08f342be3c2a9c13f3d4c02f5a808913b291d0dc2691424fbb6e50fadb0d File name: 1.65.0_amd64_6f0a952ebd0fd544bf27a13686a0f3494e9be102654a58996bedfe3bcc6f61ab File name: 1.53.0_amd64_bcca0fd4d94764ef5a1ade5e81e749b11d87ab1c6c490d9065040e20da9dcf73 File name: 1.59.0_amd64_a09acc46cc64b945ba7e5b1fcd449b6cad435b31d45c7cdc9a4b8dc3c5d4aa04 File name: 1.53.0_amd64_15d603b12906f96d4891164febef7b587667363b7c91011ee100c81cad3a0472 File name: 1.64.0_amd64_8b4a259da875e73e61b72c87191b8e7ba35c29da6db4fc1f7805c5f5ef93d42e File name: 1.69.0_amd64_c5f8c0a07123fc97c7b673b22b5f54cf415d66454022827f849a76dbe7ab2853 File name: 1.69.0_amd64_0eba113dde8ba85afd96116d39444fc4fcc7a280b45d0d15544642f290a5e72a File name: 1.67.1_amd64_a585d992715c9d482fea1d047ea938d884f2888fa5d037aeca8c14ee6f7636d1 |
Are these the performance stats just for running the UTF-8 extraction code? Can you clean up the PR to only include those changes, please? |
Yes
Okay😄 |
floss/language/rust/decode_utf8.py
Outdated
raise ValueError("no .rdata section found") | ||
|
||
|
||
def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[Tuple[str, int, int]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a function that's signature is analog to extract_ascii_strings
, i.e. operate on buffers?
this can then be wrapped by this or a similar function.
floss/language/rust/decode_utf8.py
Outdated
|
||
for i in range(0, len(strings)): | ||
# for 1 byte | ||
if strings[i] & 0x80 == 0x00: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add tests for various encoded strings to show they're properly decoded
floss/language/rust/decode_utf8.py
Outdated
return strings | ||
|
||
|
||
def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]: | |
def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[Optional[List[Any]]]: |
the return type is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't pass in mypy like this. Is there another approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this impact performance?
Co-authored-by: Vasco Schiavo <[email protected]>
There is an increase in extraction percentage from 88% to 91%. here is a table for time taken to run tests for rust
*Tests are ran on Core i5 12Gen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see inline comments.
I think the core algorithm probably works well, but some enhancements to the structure and documentation will ensure it remains maintainable. I like that we'll be able to remove the external dependency!
floss/language/rust/decode_utf8.py
Outdated
temp = buf[i] << 24 | buf[i + 1] << 16 | buf[i + 2] << 8 | buf[i + 3] | ||
character = temp.to_bytes(4, "big").decode("utf-8", "ignore") | ||
i += 3 | ||
character_and_index.append([character, i, 4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the other cases? are there any?
either way, please add an else
branch to handle them, either logging or doing an assertion.
# for 2 bytes | ||
elif buf[i] & 0xE0 == 0xC0: | ||
temp = buf[i] << 8 | buf[i + 1] | ||
character = temp.to_bytes(2, "big").decode("utf-8", "ignore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you use ignore
here? wouldn't we want to handle the case that invalid UTF-8 data is encountered (and not extract a string there)?
I assume that your algorithm works pretty well, since you've opened the PR, but I can't quite follow how it works. Would you please add some comments explaining the design, and definitely a few test cases that exercise each of the branch arms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, the tests for each branch are in tests/test_utf8_decoder.py
. Let me know if anything else is required.
floss/language/rust/decode_utf8.py
Outdated
return strings | ||
|
||
|
||
def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]: | |
def extract_rdata_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]: |
Co-authored-by: Willi Ballenthin <[email protected]>
This PR introduces a custom UTF-8 string decoder to replace the existing binary2strings conversion.
Reference: https://en.wikipedia.org/wiki/UTF-8
Fixes: #867