EOFError when reading a second file in an archive #240

girardbaptiste · 2020-09-16T13:50:09Z

Describe the bug
The first file opened with the first call to zip_file.read() is correctly open but the second returns an EOFError

To Reproduce

    import py7zr

    zip_file = py7zr.SevenZipFile(r"py7zr-0.9.5\tests\data\lzma2_1.7z",
                                  mode='r')

    for file in zip_file.files:
        if not file.emptystream:
            file_dict = zip_file.read(file.filename)
            for line in file_dict[file.filename].readlines():
                print(line)
            file_dict[file.filename].close()

The first file content is printed:
b'#!/usr/bin/env python\n'
b'\n'
b'import sys\n'
b'\n'
b'from py7zr import main\n'
b"if name == 'main':\n"
b' sys.exit(main())\n'
b'\n'

***But the second returns an EOFError Exception ***

Traceback (most recent call last):
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 735, in _extract
self.worker.extract(self.fp, parallel=(not self.password_protected and not self._filePassed))
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 954, in extract
self.extract_single(fp, self.files, self.src_start, src_end, q)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 1024, in extract_single
exc_q.put(exc_tuple)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 1015, in extract_single
if f.crc32 is not None and crc32 != f.crc32:
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 1058, in decompress
tmp = decompressor.decompress(inp, max_length)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\compressor.py", line 763, in decompress
folder_data = self.cchain.decompress(data, max_length=max_length)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\compressor.py", line 687, in decompress
tmp = self._decompress(data, max_length)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\compressor.py", line 668, in _decompress
raise EOFError
EOFError

Expected behavior
A clear and concise description of what you expected to happen.

Environment (please complete the following information):

OS: [e.g. Windows 10
Python 3.8.3
py7zr version: 0.9.5

Test data(please attach in the report):
The 7zip file test comes from git repo: py7zr-0.9.5\tests\data\lzma2_1.7z

Additional context

The text was updated successfully, but these errors were encountered:

miurahr · 2020-09-17T06:54:25Z

Once SevelZipFIle.read() called, all the file are processed and file pointer goes to EOF.
If you want to read again from start, SevenZipFile.reset() reset file pointer and decompressor status.

caution: when you handle 1GB archive and read(), reset() and read(), you read 2GB from disk.
It is recommended to use read(file-spec) once,

i.e.

import py7zr
import re

filter_pattern = re.compile(r'<your/target/file_and_directories/regex/expression>')
with SevenZipFile('archive.7z', 'r') as archive:
    allfiles = archive.getnames()
    selective_files = [f if filter_pattern.match(f) for f in allfiles]
    dict_data = archive.read(targets=selective_files)
    for entry in dict_data:
        target_data = dict_data[entry].read()

girardbaptiste · 2020-09-17T07:24:15Z

Thank you for this advice and timed passed for this 7z module. My application provides an interface to access to zip files content without unzipping all the content. The list of files is dynamically defined and archives are quite big. In the standard zipfile module, a single file could be opened with archive.open(‘file_name’, ‘r’) function. Is-it not possible to read only some files on the archive, without reading all of them ? What’s the difference between read() and readall() if read() reads also all the files ?

miurahr · 2020-09-17T07:49:54Z

7-Zip use "solid compression" . https://en.wikipedia.org/wiki/Solid_compression
It is because read() processes all payload chunk to extract a part of archived files.

py7zr take a design that readall() and read(name-spec) read all chunks, and return all of archive files or parts of files.
when run read() it read payload from start to end of single chunk, then if data is not used, it dropped, and return specified data of files.

By contrast, the ZIP format is not solid because it stores separately compressed files.
so it allow user to random access to archived file without reading other parts.

girardbaptiste · 2020-09-17T08:45:39Z

That’s clear. A single file couldn’t be access without reading all the other files.
But is-it possible to reset a single file that has been already read without resetting the complete archive ?

For example:

import py7zr
import re

filter_pattern = re.compile(r'<your/target/file_and_directories/regex/expression>')
with SevenZipFile('archive.7z', 'r') as archive:
    allfiles = archive.getnames()
    selective_files = [f for f in allfiles if filter_pattern.match(f)]
    dict_data = archive.read(targets=selective_files)
    for entry in dict_data:
        target_data = dict_data[entry].read()
        for lines in target_data.readlines():
            print(lines)

        target_data.reset()
        #  or
        target_data.seek(0)
        # or  
        target_data = open(target_data, 'r')
       
        # and then
        for lines in target_data.readlines():
            print(lines)

miurahr · 2020-09-17T09:03:07Z

dict_data[entry] in example is BytesIO object
you can do

bio = dict_data[entry]
data = bio.read()
bio.seek(0)
data = bio.read()

miurahr · 2020-09-19T08:07:48Z

Question is answered.

miurahr added for extraction Issue on extraction, decompression or decryption invalid This doesn't seem right labels Sep 17, 2020

miurahr added the question Further information is requested label Sep 17, 2020

miurahr closed this as completed Sep 19, 2020

miurahr pinned this issue Oct 28, 2020

Repository owner locked and limited conversation to collaborators Jan 31, 2024

miurahr converted this issue into discussion #573 Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

EOFError when reading a second file in an archive #240

EOFError when reading a second file in an archive #240

girardbaptiste commented Sep 16, 2020 •

edited

Loading

miurahr commented Sep 17, 2020

girardbaptiste commented Sep 17, 2020 via email •

edited

Loading

miurahr commented Sep 17, 2020

girardbaptiste commented Sep 17, 2020 •

edited

Loading

miurahr commented Sep 17, 2020 •

edited

Loading

miurahr commented Sep 19, 2020

This issue was moved to a discussion.

This issue was moved to a discussion.

EOFError when reading a second file in an archive #240

EOFError when reading a second file in an archive #240

Comments

girardbaptiste commented Sep 16, 2020 • edited Loading

miurahr commented Sep 17, 2020

girardbaptiste commented Sep 17, 2020 via email • edited Loading

miurahr commented Sep 17, 2020

girardbaptiste commented Sep 17, 2020 • edited Loading

miurahr commented Sep 17, 2020 • edited Loading

miurahr commented Sep 19, 2020

This issue was moved to a discussion.

girardbaptiste commented Sep 16, 2020 •

edited

Loading

girardbaptiste commented Sep 17, 2020 via email •

edited

Loading

girardbaptiste commented Sep 17, 2020 •

edited

Loading

miurahr commented Sep 17, 2020 •

edited

Loading