Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package.getFileData is a bottleneck sometimes #111

Open
arrowd opened this issue Apr 13, 2023 · 7 comments
Open

Package.getFileData is a bottleneck sometimes #111

arrowd opened this issue Apr 13, 2023 · 7 comments

Comments

@arrowd
Copy link
Contributor

arrowd commented Apr 13, 2023

While testing my own asgen backend I tried processing a package for the Stellarium project. It took more than 2 hours to process it. A little debugging revealed that the package

  1. quite large by itself
  2. has a gazillion files
  3. has a lot of Qt translation files

When asgen runs compose on the package, its callbacks repeatedly call to Package.getFileData(), which is implemented in the following way for all backends:

    override
    const(ubyte[]) getFileData (string fname)
    {
        if (!pkgArchive.isOpen)
            pkgArchive.open (this.getFilename);

        return pkgArchive.readData(fname);
    }

Reading the ArchiveDecompressor.readData() code reveals that it iterates over the whole archive looking for a requested file. This results in O(n*m) complexity where n is a number of files in the package and m is the number of translation files.

This clearly needs some caching/optimization, which can be made backend-agnostic, but since this is the first time I'm working with D I decided to first report the issue and hear back the options or suggestions on that.

@ximion
Copy link
Owner

ximion commented Apr 14, 2023

That's the annoying thing with tar files: They don't have an index.
So far this hasn't been much of an issue, as there are only few large packages, but since we now started to parse tons of translations it will probably indeed become an issue.

If libarchive has the right API for that (which I assume it does), we could read through the archive in its entirety once and create a hash map with the right offsets. We could then immediately jump to the correct positions.
This would slow down processing slightly for shorter archives, but massively increase speed for large ones.

@arrowd
Copy link
Contributor Author

arrowd commented Apr 14, 2023

Or maybe just extract the archive if we detect that getFileData() is called too often?

@ximion
Copy link
Owner

ximion commented Apr 14, 2023

Skipping over large data chunks is waaaaaay faster than extracting anything and dealing with disk write and read I/O.
Also, what would "too often" even be ;-)
Creating a hash table for files above a certain size, or even all of them, makes much more sense :-)

@arrowd
Copy link
Contributor Author

arrowd commented Apr 14, 2023

Should this be done inside ArchiveDescompressor then?

@ximion
Copy link
Owner

ximion commented Apr 14, 2023

I'd say yes, because the data structure needed would hold very specific information about the individual archive.
But I'll have to look again how the decompressor is used (haven't looked at that code in detail in ages).

@arrowd
Copy link
Contributor Author

arrowd commented May 3, 2023

If libarchive has the right API for that (which I assume it does)

I was unable to find out such API. The only function that actually uses an archive_entry returned by archive_read_next_header is archive_read_extract, which extracts data to disk.

@arrowd
Copy link
Contributor Author

arrowd commented May 13, 2023

Maybe we can count the number of translation files on the first readContents() call and then somehow decide if we should extract to a temporary location?

@arrowd arrowd mentioned this issue Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants