Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading is very slow for larger files #57

Open
jon-zu opened this issue Sep 15, 2024 · 4 comments
Open

Reading is very slow for larger files #57

jon-zu opened this issue Sep 15, 2024 · 4 comments

Comments

@jon-zu
Copy link

jon-zu commented Sep 15, 2024

I'm using the msi crate, but after some profiling, I think this crate might be the better place to report that.

Code:

        let setup_file = BufReader::new(File::open("out_is/Setup.msi").unwrap());
        let mut msi = msi::Package::open(setup_file).unwrap();
        let mut stream = msi.read_stream("Data1.cab").unwrap();
        let mut max = stream.take(1024 * 1024 * 50); 
        let mut out_dir = BufWriter::new(File::create("out_is/Data1.cab").unwrap());
        std::io::copy(&mut max, &mut out_dir).unwrap();

Reading 50 Mb of that cab stream takes roughly one minute on my pc. I've also created a flame graph, but I'm not really sure where the problem lies. Is the loop in Chain::new taking a lot of time maybe?
flamegraph

@mdsteele
Copy link
Owner

Just to check, is this timing coming from a --release build?

@jon-zu
Copy link
Author

jon-zu commented Sep 17, 2024

Just to check, is this timing coming from a --release build?

Yes, I can see later today If I can make a small, reproducible example. I'm making a small tool to extract InstallShield installer, where I extract the .msi from and exe.

@jon-zu
Copy link
Author

jon-zu commented Sep 18, 2024

I've built a small example here. 7z can extract this in a few seconds, It seems the StreamReader is slower than 1mb/s.

MSI file(~1gb): https://drive.google.com/file/d/1TqfK1UDhhv1pbfKZe0yWDAHZ-_azVsoT/view?usp=drive_link

Flamegraph for the example(I've disabled printing for that one):
flamegraph

It seems the svg is not really working here, I've also added It to the repo.

@mdsteele
Copy link
Owner

Hmm...it looks like the current implementation re-creates the Chain struct every time it needs to refill the Stream's internal buffer, and then throws the Chain away again. And Chain::new walks the entire FAT chain to the end, even if the caller only wants to read some piece of data from the beginning or middle of the chain. So we end up with this quadratic behavior where, to read all of a very long stream, we have to walk the same chain sectors over and over and over again.

Fixing this might not be straightforward; IIRC part of the reason the Stream doesn't persist the Chain is that sectors could potentially be added/removed while the Stream is still open, and it should be robust against that.

One possibility to experiment with would be that perhaps the Chain shouldn't be keeping a list of sector_ids at all; it should just walk the FAT chain on demand when instructed to seek to a specific position in the chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants