Package.getFileData is a bottleneck sometimes #111

arrowd · 2023-04-13T18:46:49Z

While testing my own asgen backend I tried processing a package for the Stellarium project. It took more than 2 hours to process it. A little debugging revealed that the package

quite large by itself
has a gazillion files
has a lot of Qt translation files

When asgen runs compose on the package, its callbacks repeatedly call to Package.getFileData(), which is implemented in the following way for all backends:

    override
    const(ubyte[]) getFileData (string fname)
    {
        if (!pkgArchive.isOpen)
            pkgArchive.open (this.getFilename);

        return pkgArchive.readData(fname);
    }

Reading the ArchiveDecompressor.readData() code reveals that it iterates over the whole archive looking for a requested file. This results in O(n*m) complexity where n is a number of files in the package and m is the number of translation files.

This clearly needs some caching/optimization, which can be made backend-agnostic, but since this is the first time I'm working with D I decided to first report the issue and hear back the options or suggestions on that.

The text was updated successfully, but these errors were encountered:

ximion · 2023-04-14T16:50:32Z

That's the annoying thing with tar files: They don't have an index.
So far this hasn't been much of an issue, as there are only few large packages, but since we now started to parse tons of translations it will probably indeed become an issue.

If libarchive has the right API for that (which I assume it does), we could read through the archive in its entirety once and create a hash map with the right offsets. We could then immediately jump to the correct positions.
This would slow down processing slightly for shorter archives, but massively increase speed for large ones.

arrowd · 2023-04-14T17:36:12Z

Or maybe just extract the archive if we detect that getFileData() is called too often?

ximion · 2023-04-14T17:53:01Z

Skipping over large data chunks is waaaaaay faster than extracting anything and dealing with disk write and read I/O.
Also, what would "too often" even be ;-)
Creating a hash table for files above a certain size, or even all of them, makes much more sense :-)

arrowd · 2023-04-14T18:15:15Z

Should this be done inside ArchiveDescompressor then?

ximion · 2023-04-14T18:17:41Z

I'd say yes, because the data structure needed would hold very specific information about the individual archive.
But I'll have to look again how the decompressor is used (haven't looked at that code in detail in ages).

arrowd · 2023-05-03T06:53:08Z

If libarchive has the right API for that (which I assume it does)

I was unable to find out such API. The only function that actually uses an archive_entry returned by archive_read_next_header is archive_read_extract, which extracts data to disk.

arrowd · 2023-05-13T07:11:19Z

Maybe we can count the number of translation files on the first readContents() call and then somehow decide if we should extract to a temporary location?

ximion added the enhancement label Apr 14, 2023

arrowd mentioned this issue Jun 6, 2023

FreeBSD backend #113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Package.getFileData is a bottleneck sometimes #111

Package.getFileData is a bottleneck sometimes #111

arrowd commented Apr 13, 2023

ximion commented Apr 14, 2023

arrowd commented Apr 14, 2023

ximion commented Apr 14, 2023

arrowd commented Apr 14, 2023

ximion commented Apr 14, 2023

arrowd commented May 3, 2023

arrowd commented May 13, 2023

Package.getFileData is a bottleneck sometimes #111

Package.getFileData is a bottleneck sometimes #111

Comments

arrowd commented Apr 13, 2023

ximion commented Apr 14, 2023

arrowd commented Apr 14, 2023

ximion commented Apr 14, 2023

arrowd commented Apr 14, 2023

ximion commented Apr 14, 2023

arrowd commented May 3, 2023

arrowd commented May 13, 2023