read/readall dumps the decompressed files to memory, instead of streaming them #579

jaboja · 2024-03-17T22:38:20Z

There is a problem with reading large files, whose decompressed form exceed the available RAM:

The library (namely read/readall methods) tries to first decompress the file to memory using BytesIO, and then returns that BytesIO object. While that may work well for small files, it fails due to lack of memory, for bigger ones.

It would be better if the library streamed the files, just like the standard file IO.

To Reproduce

Download a huge 7z file, e.g. this Wikipedia dump:

wget https://dumps.wikimedia.org/plwiki/20240301/plwiki-20240301-pages-meta-history1.xml-p1p6814.7z

Try to read it:

import py7zr
archive = py7zr.SevenZipFile('plwiki-20240301-pages-meta-history1.xml-p1p6814.7z', mode='r')
for _ignore, content in archive.readall().items():
    print(content.read(10))

If the machine has < 36GB memory, the script will try to allocate memory until it runs out of it, then it will break.

Expected behavior
Library should allocate only as much memory as really needed for reading data requested, and allow to stream files even if their decompressed form exceeds available memory and disk space.

Environment:

OS: Ubuntu 22.04.3 LTS
Python 3.10.12
py7zr version: 0.21.0
Disk space: 10 GB
Memory: 2 GB

(the Wikipedia dump file used as an example is 246.6 MB in compressed form, and 36 GB when decompressed)

The text was updated successfully, but these errors were encountered:

miurahr · 2024-04-02T07:18:07Z

There is a one of main loop in SevenZipFile#_extract which is like

        for f in self.files:
             # if - else  block
             # if  memory extraction 
              _buf = io.BytesIO()
               self.worker.register_filelike(f.id, MemIO(_buf))
             # else in default
                self.worker.register_filelike(f.id, outfilename)

       # now finished a preparation of extraction index
       # then calls  7z file file pointer and target path
       self.worker.extract(   self.fp,   path, parallel = ... )

With this structure, Worker class create thread and extract solid blocks in muti-thread when possible.
If you want to implement, it need to be changed significantly. When you get an idea how to improve, please tell me.

py7zr originally extract files into file system, and @Zoynels contribute a memory IO feature as #111

starkapandan · 2024-05-10T11:56:08Z

+1 on this issue. Probably not an easy fix, but on large 7z archives, being able to properly stream the contents is the proper approach, otherwise the only other alternative becomes to extract it first which is not the most elegant/quick solution. python tarfile and zipfile libraries for example achieves this properly, eg no memory crash when reading a large file.

don't know how the change needs to be implemented so leaving that to someone with more experience in this library hopefully, but here is the part where it crashes the memory... Normally for regular filehandles this is not an issue, since it reads a chunk and frees it after writing it. But the memory Read design is made so that a memory stream is provided into the existing decompress function was probably done for simplicity, but the issue here is that the memory object is as name states always in memory, so even if decompress function does its job correctly eg read and frees memory, the way memory object is used causes it to load everything until the end into this object, leading to memory filling up.

The ideal solution to these form of stream readers is to somehow give the control that the decompress method has over the bufferedreader (eg the one that is actually reading the 7z source) and create some form of abstraction over it with a regular BytesIO or any other form of memory stream, so that the end user eg implementer gets the control of what to load into memory and therefore increasing memory efficiency drastically.

itanfeng · 2024-10-10T02:26:30Z

Face the same problem of OOM for large 7zip file. Hope for improvement!

miurahr · 2024-10-10T08:49:10Z

There is a one of main loop in SevenZipFile#_extract which is like
        for f in self.files:
             # if - else  block
             # if  memory extraction 
              _buf = io.BytesIO()
               self.worker.register_filelike(f.id, MemIO(_buf))
             # else in default
                self.worker.register_filelike(f.id, outfilename)

       # now finished a preparation of extraction index
       # then calls  7z file file pointer and target path
       self.worker.extract(   self.fp,   path, parallel = ... )
With this structure, Worker class create thread and extract solid blocks in muti-thread when possible. If you want to implement, it need to be changed significantly. When you get an idea how to improve, please tell me.

py7zr originally extract files into file system, and @Zoynels contribute a memory IO feature as #111

There is an idea to extend SevenZipFile#read method optionally to accept callback that returns MemoryIO object that py7zr can write into. All the memory handling of the output stream are delegated to the caller.

I would like to try as topic branch, and I expect @itanfeng @starkapandan @jaboja as a tester and reviewer of the change.

It can be compatibility breaking change, and I want to deprecate the old read and readall argument API in version 1.0.0.

miurahr added enhancement New feature or request help wanted Extra attention is needed for extraction Issue on extraction, decompression or decryption labels Mar 20, 2024

miurahr mentioned this issue Apr 2, 2024

Memory leaks when uncompressing multi-volume archives #575

Open

miurahr mentioned this issue Oct 11, 2024

refactor: drop read/readall method, and add feature to stream data into user object #620

Merged

miurahr added this to the v1.0.0 - General release milestone Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read/readall dumps the decompressed files to memory, instead of streaming them #579

read/readall dumps the decompressed files to memory, instead of streaming them #579

jaboja commented Mar 17, 2024

miurahr commented Apr 2, 2024

starkapandan commented May 10, 2024

itanfeng commented Oct 10, 2024

miurahr commented Oct 10, 2024

read/readall dumps the decompressed files to memory, instead of streaming them #579

read/readall dumps the decompressed files to memory, instead of streaming them #579

Comments

jaboja commented Mar 17, 2024

miurahr commented Apr 2, 2024

starkapandan commented May 10, 2024

itanfeng commented Oct 10, 2024

miurahr commented Oct 10, 2024