Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening the same file for the second time via SeekableArchive raises a RuntimeError #23

Open
mxmlnkn opened this issue Aug 13, 2022 · 2 comments

Comments

@mxmlnkn
Copy link

mxmlnkn commented Aug 13, 2022

What I'm trying to do is open a file object to a file inside an archive and read from it. And when seeking back, I'm trying to reopen it from the start. My first guess was to simply class close on the file object and open it anew, but the second open throws.

The problem seems to be that the close on the file object for one of the many files inside the archive will actually close the whole archive. This is unexpected and the missing Archive.close method only adds to the confusion.
The close method actually only calls a deferred close according to the source code. I'm not sure how well behaved that is, especially when calling readstream multiple times and having multiple of those file objects open at the same time ...

Workflow:

import libarchive
a = libarchive.SeekableArchive('single-file.tar.bz2')
files = list(a)
f = a.readstream(files[0].pathname)
print(len(f.read()))  # 4
f.close()
f = a.readstream(files[0].pathname)
f.read()  # RuntimeError
RuntimeError                              Traceback (most recent call last)
<ipython-input-27-571e9fb02258> in <module>
----> 1 f.read()

~/.local/lib/python3.10/site-packages/libarchive/__init__.py in read(self, bytes)
    230             bytes = self.size - self.bytes
    231         # Read requested bytes
--> 232         data = _libarchive.archive_read_data_into_str(self.archive._a, bytes)
    233         self.bytes += len(data)
    234         return data

~/.local/lib/python3.10/site-packages/libarchive/_libarchive.py in archive_read_data_into_str(archive, len)
    583 
    584 def archive_read_data_into_str(archive, len):
--> 585     return __libarchive.archive_read_data_into_str(archive, len)
    586 
    587 def archive_write_data_from_str(archive, str):

RuntimeError: could not read requested data.

Looking at the code, the context manager for EntryReadStream should behave correctly, i.e., not call close but I can't use it because the lifetime is longer than a simple with-code-block and Python has no C++ like RAII unfortunately.

@mxmlnkn
Copy link
Author

mxmlnkn commented Aug 13, 2022

Actually, it is even worse. I am not able to open a file inside the archive a second time:

import libarchive
a = libarchive.SeekableArchive('single-file.tar.bz2')
files = list(a)
with a.readstream(files[0].pathname) as f:
    print(f.read(2))  # "fo"
with a.readstream(files[0].pathname) as f:
    print(f.read(2))  # "o\n" would have expected "fo"!!
with a.readstream(files[0].pathname) as f:
    print(f.read(2))  # RuntimeError!!

The RuntimeError is the same as above.

Weirdly enough it even works with an archive with two files where I first open the file appearing later in the archive and then the one appearing before it. But, then again, opening any of the files a second time leads to an exception again.

echo foo > bar
echo foo2 > bar2
tar -cf two-files.tbz2 bar bar2
a = libarchive.SeekableArchive('two-files.tar.bz2')
files = list(a)
print(a.readstream(files[1].pathname).read())  # b'foo2\n'
print(a.readstream(files[0].pathname).read())  # b'foo\n'
print(a.readstream(files[0].pathname).read())  # RuntimeError
print(a.readstream(files[1].pathname).read())  # b'foo2\n'
print(a.readstream(files[1].pathname).read())  # RuntimeError
print(a.readstream(files[1].pathname).read())  # RuntimeError
print(a.readstream(files[0].pathname).read())  # b'foo\n'
print(a.readstream(files[0].pathname).read())  # RuntimeError
print(a.readstream(files[1].pathname).read())  # b'foo2\n'
  • It looks like it is able to reopen the archive when opening a file that comes earlier than the last opened one. But it isn't intelligent enough to reopen the archive then trying to reopen the same file again!

  • Also, the interface for Archive looks even worse according to the readme:

    with libarchive.Archive('my_archive.zip') as a:
        for entry in a:
            if entry.pathname == 'my_file.txt':
                print 'File Contents:', a.read()

    Why would I iterate over the archive to get entries but then call read on the archive not on the entry. That is hella confusing.

  • Similarly, why would I want to call readstream( pathname ) on the archive when I already have unique Entry objects with all necessary information? For example, what would happen if there are files with duplicate names? According to TAR, the last appearance of that file should be used but I highly doubt that is what will be returned.

    echo foo > bar
    tar -cf updated-file.tar bar
    echo FOO > bar
    tar -uf updated-file.tar bar
    bzip2 updated-file.tar
    a = libarchive.SeekableArchive('updated-file.tar.bz2')
    files = list(a)
    print([f.pathname for f in files])  # ['bar', 'bar']
    print(a.readstream('bar').read())  # b'foo\n'
    print(a.readstream('bar').read())  # RuntimeError

@mxmlnkn mxmlnkn changed the title API for EntryReadStream.close is confusing because it closes the whole archive Opening the same file for the second time via SeekableArchive raises a RuntimeError Aug 13, 2022
@kensouchen
Copy link

kensouchen commented Feb 23, 2023

Hi, Could we reopen this?

Im still experiencing the same issue, with:

# python-libarchive == 4.2.1
 print(libarchive.version()) # 3.6.1

The issue is that both __iter__ and read()/readstream() are not running elegantly.

  1. on __iter__: The self.entries object is not maintained correctly. See below code, and you will see why:
file= libarchive.SeekableArchive('updated-file.tar.bz2')
print([f.pathname for f in file])  # ['bar'']
print([f.pathname for f in file])  # ['bar', 'bar']
  1. on read(), it's reading the same Entry thats closed. that's why you get error msg:
# RuntimeError: could not read requested data.

==========
Above issue could be fixed by a quick patch, but there's actually another issue with the C-library, if you run

print([f.pathname for f in file])  # for the 3rd time:
# Exception: Problem executing function, message is: INTERNAL ERROR: Function 'archive_read_next_header' invoked with archive structure in state 'eof', should be in state 'header/data'.

This might need further investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants