Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: orphan chunks may be left in DuraCloud when a compressed AIP is re-ingested #1713

Open
5 tasks
shij13 opened this issue Oct 22, 2024 · 0 comments
Open
5 tasks

Comments

@shij13
Copy link

shij13 commented Oct 22, 2024

Expected behaviour
When an AIP is re-ingested, the original AIP is replaced in its entirety in storage.

Current behaviour
We are using DuraCloud for storage and are storing AIPs as compressed 7z packages using bzip2. Some re-ingested AIPs are coming out smaller than the original AIPs, which seems understandable since we use the fastest compression level (is a 869.92 MB difference too much to chalk up to the compression level?).

In some cases, the size is dropping to a smaller GB range (e.g. 3.05GB to 2.98GB, as opposed to 3.05GB to 3.01GB). Since DuraCloud stores packages in 1GB chunks, this is leaving us with orphaned chunks of the original AIP in storage instead of replacing the original AIP entirely.

For example:
Original ingest size (2024-03-04): 36155.86 MB (35.3 GB)
number of associated chunks per dura-manifest: 38 (dura-chunk-0000 to dura-chunk-0037)
Reingest size (2024-04-10): 35450.7 MB (34.62 GB)
number of associated chunks per dura-manifest: 37 (dura-chunk-0000 to dura-chunk-0036)
number of associated chunks in storage: 38 (dura-chunk-0000 to dura-chunk-0037)

  • last modified date of dura-chunk-0037: 2024-03-04
  • last modified date of all other dura-chunks and dura-manifest: 2024-04-10

From the DuraCloud audit log, it looks like the original AIP is being overwritten by the re-ingested AIP so additional chunks from the original AIP are not accounted for.

Steps to reproduce

  1. Set the processing configuration to "compression algorithm: 7z using bzip2" and "compress level: 1 - fastest level"
  2. Ingest a package that is multiple GB (since DuraCloud chunks in 1GB segments)
  3. Observe the size, number of chunks, and last modified date associated with the ingested package (e.g. in the browser interface or via the dura-manifest or audit log)
  4. Re-ingest the package using either the metadata-only or partial (normalize for access only) workflow
  5. Observe the size, number of chunks, and last modified date associated with the re-ingested package (e.g. in the browser interface or via the dura-manifest or audit log)

Note: it may take a few tries to create a package that will see a change in GB (e.g. 3.05GB to 2.98GB, as opposed to 3.05GB to 3.01GB) after re-ingest

Your environment (version of Archivematica, operating system, other relevant details)

  • AM 1.14, SS 0.20
  • DuraCloud 7.1
  • CentOS 7

For Artefactual use:

Before you close this issue, you must check off the following:

  • All pull requests related to this issue are properly linked
  • All pull requests related to this issue have been merged
  • A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
  • Documentation regarding this issue has been written and merged (if applicable)
  • Details about this issue have been added to the release notes (if applicable)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant