Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Custom UTF-8 Decoder #885

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

Arker123
Copy link
Collaborator

This PR introduces a custom UTF-8 string decoder to replace the existing binary2strings conversion.
Reference: https://en.wikipedia.org/wiki/UTF-8
Fixes: #867

@williballenthin
Copy link
Collaborator

@Arker123 would you include some performance benchmarks as you propose these changes? eg. does FLOSS performance noticeably change when using this implementation?

don't have to address this immediately, just before we consider it for merging

@Arker123
Copy link
Collaborator Author

Arker123 commented Sep 27, 2023

For rust-hello64.exe, the current coverage has dropped from 96% to 71%, which I'm investigating. That's the reason this PR is marked as a draft. On the plus side, it correctly extracts the string 'invalid args' as shown in issue #867.

@Arker123
Copy link
Collaborator Author

Additionally, we still need to address the consideration of 'characters' in this implementation.
image
Source: https://en.wikipedia.org/wiki/UTF-8

@Arker123
Copy link
Collaborator Author

Arker123 commented Oct 2, 2023

Presently, we have achieved coverage levels of >= 96% in 5 out of 6 test files. However, one file, specifically https://github.com/mandiant/flare-floss-testfiles/blob/master/language/rust/rust-unknown-binaries/bin/1.69.0/i386/200c308a793630e4f3686dd846f0d55b6368834a859875970b4135f3ca487f46, is currently only at 86% coverage. :(

@Arker123
Copy link
Collaborator Author

Arker123 commented Oct 2, 2023

After conducting further research, I have discovered additional xrefs within the .rdata section of the i386 architecture. The current coverage for the aforementioned file stands at 91%.

You can view the findings here:
Annotation 2023-10-02 051352

Annotation 2023-10-02 051425

@mr-tz
Copy link
Collaborator

mr-tz commented Oct 9, 2023

Let's not mix updates to the extraction algorithm and the custom UTF-8 string finder in this PR.

For performance here, let's focus on the time it takes to find just all the UTF-8 strings from files.

@Arker123 Arker123 mentioned this pull request Nov 9, 2023
@Arker123
Copy link
Collaborator Author

Here are some performance statistics:

File name: 1.59.0_amd64_4d41c02164ee4aeacd5e3fd67ecdc097233b8c6a725fab588d1cdde0461fd8cc
File size: 114.84571838378906 MB
perf: 8.53998875617981 s


File name: 1.49.0_i386_13af976c769e583e63665cc29ddeb6dc1a81c447ce7ea87bc10e01941576318b
File size: 112.57718658447266 MB
perf: 11.280088901519775 s


File name: 1.67.1_amd64_286164c8db3976443ca605209c182e93d2fee5854504d5e9fc32a0b701b78aa8
File size: 125.58065795898438 MB
perf: 20.27586317062378 s


File name: 1.53.0_amd64_6b4292500c9603d0f9647ae53adaa4cbf2840ae243d9efa3f14048f079e2d32a
File size: 65.56964874267578 MB
perf: 4.034202814102173 s


File name: 1.53.0_amd64_63fa08f342be3c2a9c13f3d4c02f5a808913b291d0dc2691424fbb6e50fadb0d
File size: 65.56991577148438 MB
perf: 4.46514892578125 s


File name: 1.65.0_amd64_6f0a952ebd0fd544bf27a13686a0f3494e9be102654a58996bedfe3bcc6f61ab
File size: 83.82809448242188 MB
perf: 9.418390274047852 s


File name: 1.53.0_amd64_bcca0fd4d94764ef5a1ade5e81e749b11d87ab1c6c490d9065040e20da9dcf73
File size: 65.56964874267578 MB
perf: 4.185543060302734 s


File name: 1.59.0_amd64_a09acc46cc64b945ba7e5b1fcd449b6cad435b31d45c7cdc9a4b8dc3c5d4aa04
File size: 115.15755462646484 MB
perf: 8.811774015426636 s


File name: 1.53.0_amd64_15d603b12906f96d4891164febef7b587667363b7c91011ee100c81cad3a0472
File size: 65.56964874267578 MB
perf: 4.438174724578857 s


File name: 1.64.0_amd64_8b4a259da875e73e61b72c87191b8e7ba35c29da6db4fc1f7805c5f5ef93d42e
File size: 74.84070587158203 MB
perf: 5.529693603515625 s


File name: 1.69.0_amd64_c5f8c0a07123fc97c7b673b22b5f54cf415d66454022827f849a76dbe7ab2853
File size: 17.7099609375 MB
perf: 2.7266366481781006 s


File name: 1.69.0_amd64_0eba113dde8ba85afd96116d39444fc4fcc7a280b45d0d15544642f290a5e72a
File size: 31.06982421875 MB
perf: 0.1291041374206543 s


File name: 1.67.1_amd64_a585d992715c9d482fea1d047ea938d884f2888fa5d037aeca8c14ee6f7636d1
File size: 125.58039093017578 MB
perf: 19.494038581848145 s


@mr-tz
Copy link
Collaborator

mr-tz commented Nov 10, 2023

Are these the performance stats just for running the UTF-8 extraction code?

Can you clean up the PR to only include those changes, please?

@Arker123
Copy link
Collaborator Author

Are these the performance stats just for running the UTF-8 extraction code?

Yes

Can you clean up the PR to only include those changes, please?

Okay😄

@Arker123 Arker123 marked this pull request as ready for review November 10, 2023 13:08
raise ValueError("no .rdata section found")


def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[Tuple[str, int, int]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a function that's signature is analog to extract_ascii_strings, i.e. operate on buffers?
this can then be wrapped by this or a similar function.


for i in range(0, len(strings)):
# for 1 byte
if strings[i] & 0x80 == 0x00:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add tests for various encoded strings to show they're properly decoded

floss/language/rust/decode_utf8.py Outdated Show resolved Hide resolved
return strings


def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:
def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[Optional[List[Any]]]:

the return type is wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't pass in mypy like this. Is there another approach?

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this impact performance?

@Arker123
Copy link
Collaborator Author

how does this impact performance?

There is an increase in extraction percentage from 88% to 91%.

here is a table for time taken to run tests for rust

Test_name original_time current_time
test_language_rust_coverage.py 11.146s 11.637s
test_language_rust_known_binary.py 36.969s 34.953s
test_language_extract_rust.py 4.321s 4.472s
test_utf8_decoder.py 12.366s 13.194s

*Tests are ran on Core i5 12Gen

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see inline comments.

I think the core algorithm probably works well, but some enhancements to the structure and documentation will ensure it remains maintainable. I like that we'll be able to remove the external dependency!

floss/language/rust/decode_utf8.py Outdated Show resolved Hide resolved
floss/language/rust/decode_utf8.py Outdated Show resolved Hide resolved
floss/language/rust/decode_utf8.py Outdated Show resolved Hide resolved
temp = buf[i] << 24 | buf[i + 1] << 16 | buf[i + 2] << 8 | buf[i + 3]
character = temp.to_bytes(4, "big").decode("utf-8", "ignore")
i += 3
character_and_index.append([character, i, 4])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the other cases? are there any?

either way, please add an else branch to handle them, either logging or doing an assertion.

# for 2 bytes
elif buf[i] & 0xE0 == 0xC0:
temp = buf[i] << 8 | buf[i + 1]
character = temp.to_bytes(2, "big").decode("utf-8", "ignore")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you use ignore here? wouldn't we want to handle the case that invalid UTF-8 data is encountered (and not extract a string there)?

I assume that your algorithm works pretty well, since you've opened the PR, but I can't quite follow how it works. Would you please add some comments explaining the design, and definitely a few test cases that exercise each of the branch arms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, the tests for each branch are in tests/test_utf8_decoder.py. Let me know if anything else is required.

return strings


def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:
def extract_rdata_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[List[Any]]:

tests/test_language_rust_coverage.py Outdated Show resolved Hide resolved
@Arker123 Arker123 requested a review from williballenthin June 30, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve b2s or replace with custom code (unneded wide strings support)
4 participants