Wrong package download after on-demand sync #5725

pedro-psb · 2024-08-19T16:37:19Z

Originally reported here.
I'm bringing it here for visibility in the RemoteArtifact bucket of issues.

Version

Confirmed to happen with (at least).

pulpcore==3.21, pulp_rpm==3.18
pulpcore==3.39, pulp_rpm==3.23

Describe the bug

When:

Repos A and B syncs (on-demand) the same Content from two independent Remotes
In Remote B, the Content is replaced (e.g in rpm, same Nevra different checksum) and synced again (on-demand).
A request is made to the Content for Distribution associated with Repo A

Then:

Pulp raises a digest validation error and retries to download from the same (wrong) remote
Client can't get any content

To Reproduce

This is a flaky problem. Most likely it depends on sorting RemoteArtifact via uuid4, which is random.
I'll write a Pulp test and share it here to make it easier to re-run.

Serve a signed package in two different "external" (non-pulp) repos. E.g, download sample, run createrepo and make sure it's reachable from the Pulp instance.
For each of the two repos:
- Create a Repository (autopublish=True), Remote (policy=on_demand and corresponding url) and Distribution
- sync (policy=mirror_content_only)
Change one of the external repos to serve the unsigned version of the package. E.g, delete the signed, put this one, run createrepo and sync the corresponding Pulp repo again.
Request the package from the distribution whose package hasnt changed (should contain signed) and see the error.

Expected behavior

The right package should be delivered to the client, as its present and reachable in Remote A.

Additional context

On previous versions (core/rpm 3.16/3.17), the client would effectively get the wrong package, but that's not the case in supported versions anymore.
Overview of what is happening in Pulp:

lubosmj · 2024-09-05T12:51:48Z

We can adjust the order_by_acs() function or add a new sorting: 1. ACS wins if available, 2. remote associated with repo/distribution, 3. the rest is random (randomizing all the remotes).

The discussion around copying content will come.

pedro-psb · 2024-10-09T18:31:20Z

Hey @dralley,
Just wanted to 'oficially' ask you to be my reviewer for the wrong package being served in on-demand.
I see two possible steps on this:

In pulpcore:
- assure the server is handling those failures correctly (close connection, as we can't rollback streamed data).
- assure that the client eventually gets the content after retries, even if it hits a 'corrupted' remote first.
- caveat: the randomization helps with the above, but if there are ACSs and the first ACS server have the wrong binary, content is unreachable.
In pulp_rpm (independent, for later):
- This case shows that even after a second sync to the server, we are left with old unreachable RemoteArtifacts for a package-nevra, if the server changed that package-nevra binary (e.g, signed).
- cleaning this on the sync would reduce possible corrupted remotes being tried on the content handler. Removing things is always delicate, so I though about marking it as unreachable so we could ignore on content serving.
- just a note: if we didnt sync and the remote changed the package, then we can't do much.

Currently I'm working on (1), by looking into http 1.1 and our stream implementation details and writing unit tests using the content.Handler.

edit: In the end, (1) is about fixing #5012 first

pedro-psb · 2024-10-09T19:09:53Z

An alternative approach to randomizaton is add a cache to the ContentArtifact about the health it's remotes. I guess that was mentioned in the f2f but was not very well received.

Anyway, I'll share my naive idea here:

If a digest validation occurs, we add it to the cache saying that Remote-N for CA-Y failed at time T.
If a future retry is within a X timespan, we ignore. If not, we unmark it and try again.

pedro-psb · 2024-10-30T18:14:59Z

@dralley has this idea for a similar case, but I think it fits even better here.

maybe we should have the content app - if it tries a URL and then fails - launch a task to try all of the remaining remote artifacts in the background to download the artifact. That won't stop the first one from failing but it will ensure that afterwards it will be downloaded as if the original had succeeded.

IHO this is definitely better than randomization. The idea of streaming is optimizing speed/transfer for the best case, when we can get the right package. If we fail with that, I believe we should aim to minimize the worst case, which currently is the client don't getting the package (even if it's somehow reachable).

I believe with that approach and some improvements we can make the worst case be: (a) ensure the client get the package on the second request (at most), if the package is reachable (b) report a meaningful error if it's unreachable.

Worst Case

A Pessimistic Scenario

This still doesnt cover the worst case, where an ACS is corrupted and the client give-up before the task completes (this is very pessimistic case, but I've learned those usually happen with us).

Improvement idea

And idea to make it more robust is:

dralley · 2024-10-30T21:51:20Z

I definitely agree with what's alluded to in your diagram - that we do need to handle the unreachable case in such a way that it's not going to consume needlessly large amounts of resources constantly if the RAs are unreachable.

pedro-psb added Issue Triage-Needed labels Aug 19, 2024

This was referenced Aug 19, 2024

Content-app allows the use of other users Remote credentials when retreiving on-demand content. #3212

Open

Add test for wrong on_demand serving pulp/pulp_rpm#3715

Closed

ipanova added this to the RA re-design milestone Aug 20, 2024

ggainey removed the Triage-Needed label Aug 20, 2024

lubosmj added the satellite label Sep 5, 2024

lubosmj added the Triage-Needed label Sep 5, 2024

dkliban removed the Triage-Needed label Sep 17, 2024

pedro-psb mentioned this issue Oct 15, 2024

[SAT-22998] [3.39] Ensure Pulp closes the connection on corrupted streamed content #5914

Closed

pedro-psb mentioned this issue Oct 25, 2024

Content app shouldnt retry on RA streaming #5937

Open

pedro-psb self-assigned this Oct 30, 2024

pedro-psb mentioned this issue Nov 11, 2024

content-app will fully stream a file to the user before doing any checksum verification #5012

Closed

pedro-psb linked a pull request Nov 22, 2024 that will close this issue

[SAT-17783] Fix/corrupted RA blocks content streaming #6064

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong package download after on-demand sync #5725

Wrong package download after on-demand sync #5725

pedro-psb commented Aug 19, 2024

lubosmj commented Sep 5, 2024

pedro-psb commented Oct 9, 2024 •

edited

Loading

pedro-psb commented Oct 9, 2024

pedro-psb commented Oct 30, 2024

dralley commented Oct 30, 2024

Wrong package download after on-demand sync #5725

Wrong package download after on-demand sync #5725

Comments

pedro-psb commented Aug 19, 2024

Version

Describe the bug

To Reproduce

Expected behavior

Additional context

lubosmj commented Sep 5, 2024

pedro-psb commented Oct 9, 2024 • edited Loading

pedro-psb commented Oct 9, 2024

pedro-psb commented Oct 30, 2024

Worst Case

A Pessimistic Scenario

Improvement idea

dralley commented Oct 30, 2024

pedro-psb commented Oct 9, 2024 •

edited

Loading