Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify entropy calculation to use mmap instead of read_bytes() to reduce memory usage for large files #138

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

eclipsotic
Copy link
Contributor

This is primarily relevant when extracting very large (e.g. 32GB+) firmware images.

There are other uses of read_bytes() and read() in the code that could be refactored to use mmap, but this is the only one that seemed strictly necessary. If you guys are interested in changing the other instances, I'm glad to look into it as part of this PR.

@maringuu
Copy link
Collaborator

Can you explain why this is an improvement? While in general I agree to you that memory mapping is a good idea (e.g. unblob does it this way) I do not see how this is an improvement here.
Currently the file is read to memory, which is also what memory mapping does. How is this different?

@jstucke
Copy link
Collaborator

jstucke commented Aug 15, 2024

I think it would be even better to change avg_entropy() so that it takes e.g. a file pointer and only reads in chunks when calculating the entropy but it's part of https://github.com/fkie-cad/common_helper_unpacking_classifier/ so we would need to change it there first

@eclipsotic
Copy link
Contributor Author

Can you explain why this is an improvement? While in general I agree to you that memory mapping is a good idea (e.g. unblob does it this way) I do not see how this is an improvement here. Currently the file is read to memory, which is also what memory mapping does. How is this different?

Memory mapping does not read the entire file into memory; that's the crucial difference. mmap uses lazy loading, causing overall RAM usage to be much, much lower than reading the full file into memory. As a result, switching to mmap greatly reduces the RAM requirements when running the extractor on large (e.g. 32GB+) firmware images.

@eclipsotic
Copy link
Contributor Author

I think it would be even better to change avg_entropy() so that it takes e.g. a file pointer and only reads in chunks when calculating the entropy but it's part of https://github.com/fkie-cad/common_helper_unpacking_classifier/ so we would need to change it there first

This MR produces that behavior (i.e. only read in chunks as they are accessed). What you're describing is what mmap does, which is why I wanted to switch to it 😄

@jstucke
Copy link
Collaborator

jstucke commented Aug 16, 2024

I think it would be even better to change avg_entropy() so that it takes e.g. a file pointer and only reads in chunks when calculating the entropy but it's part of https://github.com/fkie-cad/common_helper_unpacking_classifier/ so we would need to change it there first

This MR produces that behavior (i.e. only read in chunks as they are accessed). What you're describing is what mmap does, which is why I wanted to switch to it 😄

That does not seem to be entirely correct. I just tested all variants and mmap only uses less memory if you read only a chunk of the file, but if you read the entire file chunk by chunk, you will use the same amount of memory as with reading in the entire file up front (the only difference is that it gradually increases as the file is lazily read instead of using it all at once). Only reading in the file chunk-wise into the same buffer seems to really reduce the memory footprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants