Modify entropy calculation to use mmap instead of read_bytes() to reduce memory usage for large files #138

eclipsotic · 2024-05-04T03:37:33Z

This is primarily relevant when extracting very large (e.g. 32GB+) firmware images.

There are other uses of read_bytes() and read() in the code that could be refactored to use mmap, but this is the only one that seemed strictly necessary. If you guys are interested in changing the other instances, I'm glad to look into it as part of this PR.

maringuu · 2024-08-15T15:24:44Z

Can you explain why this is an improvement? While in general I agree to you that memory mapping is a good idea (e.g. unblob does it this way) I do not see how this is an improvement here.
Currently the file is read to memory, which is also what memory mapping does. How is this different?

jstucke · 2024-08-15T15:48:54Z

I think it would be even better to change avg_entropy() so that it takes e.g. a file pointer and only reads in chunks when calculating the entropy but it's part of https://github.com/fkie-cad/common_helper_unpacking_classifier/ so we would need to change it there first

eclipsotic · 2024-08-15T21:55:55Z

Can you explain why this is an improvement? While in general I agree to you that memory mapping is a good idea (e.g. unblob does it this way) I do not see how this is an improvement here. Currently the file is read to memory, which is also what memory mapping does. How is this different?

Memory mapping does not read the entire file into memory; that's the crucial difference. mmap uses lazy loading, causing overall RAM usage to be much, much lower than reading the full file into memory. As a result, switching to mmap greatly reduces the RAM requirements when running the extractor on large (e.g. 32GB+) firmware images.

eclipsotic · 2024-08-15T22:00:10Z

I think it would be even better to change avg_entropy() so that it takes e.g. a file pointer and only reads in chunks when calculating the entropy but it's part of https://github.com/fkie-cad/common_helper_unpacking_classifier/ so we would need to change it there first

This MR produces that behavior (i.e. only read in chunks as they are accessed). What you're describing is what mmap does, which is why I wanted to switch to it 😄

jstucke · 2024-08-16T08:18:04Z

I think it would be even better to change avg_entropy() so that it takes e.g. a file pointer and only reads in chunks when calculating the entropy but it's part of https://github.com/fkie-cad/common_helper_unpacking_classifier/ so we would need to change it there first

This MR produces that behavior (i.e. only read in chunks as they are accessed). What you're describing is what mmap does, which is why I wanted to switch to it 😄

That does not seem to be entirely correct. I just tested all variants and mmap only uses less memory if you read only a chunk of the file, but if you read the entire file chunk by chunk, you will use the same amount of memory as with reading in the entire file up front (the only difference is that it gradually increases as the file is lazily read instead of using it all at once). Only reading in the file chunk-wise into the same buffer seems to really reduce the memory footprint.

Use mmap instead of read_bytes() to reduce memory usage for large files

14c298d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify entropy calculation to use mmap instead of read_bytes() to reduce memory usage for large files #138

Modify entropy calculation to use mmap instead of read_bytes() to reduce memory usage for large files #138

eclipsotic commented May 4, 2024

maringuu commented Aug 15, 2024

jstucke commented Aug 15, 2024

eclipsotic commented Aug 15, 2024

eclipsotic commented Aug 15, 2024

jstucke commented Aug 16, 2024 •

edited

Loading

Modify entropy calculation to use mmap instead of read_bytes() to reduce memory usage for large files #138

Are you sure you want to change the base?

Modify entropy calculation to use mmap instead of read_bytes() to reduce memory usage for large files #138

Conversation

eclipsotic commented May 4, 2024

maringuu commented Aug 15, 2024

jstucke commented Aug 15, 2024

eclipsotic commented Aug 15, 2024

eclipsotic commented Aug 15, 2024

jstucke commented Aug 16, 2024 • edited Loading

jstucke commented Aug 16, 2024 •

edited

Loading