Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warcio recompress adds WARC-Block-Digest fields to records without one #161

Open
acidus99 opened this issue Jan 7, 2024 · 0 comments
Open

Comments

@acidus99
Copy link

acidus99 commented Jan 7, 2024

It appears that warcio recompress will add WARC-Block-Digest fields to records that do not already have that field.

In the ZIP there are 2 warcs.
example-warcs.zip

In orig.warc the warcinfo record at the start does not have a WARC-Block-Digest field at all. However if you run:

warcio recompress orig.warc warcio-recompress.warc.gz
gunzip warcio-recompress.warc.gz

And look at warc-recompress.warc you will see that the warcinfo record now has WARC-Block-Digest with a SHA1 hash. (I included a copy of warc-recompress.warc in the ZIP).

While I suppose more digests aren't a bad thing:

  • I would not expect a recompression operation to alter the records in the WARC.
  • This behavior isn't documented
  • It (very slightly) increases the size of the WARC

My suggestion would be that warcio recompress should not alter the records of the WARC it is operating on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant