Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong package download after on-demand sync #5725

Open
pedro-psb opened this issue Aug 19, 2024 · 5 comments · May be fixed by #6064
Open

Wrong package download after on-demand sync #5725

pedro-psb opened this issue Aug 19, 2024 · 5 comments · May be fixed by #6064
Assignees
Milestone

Comments

@pedro-psb
Copy link
Member

Originally reported here.
I'm bringing it here for visibility in the RemoteArtifact bucket of issues.

Version

Confirmed to happen with (at least).

  • pulpcore==3.21, pulp_rpm==3.18
  • pulpcore==3.39, pulp_rpm==3.23

Describe the bug

When:

  • Repos A and B syncs (on-demand) the same Content from two independent Remotes
  • In Remote B, the Content is replaced (e.g in rpm, same Nevra different checksum) and synced again (on-demand).
  • A request is made to the Content for Distribution associated with Repo A

Then:

  • Pulp raises a digest validation error and retries to download from the same (wrong) remote
  • Client can't get any content

To Reproduce

This is a flaky problem. Most likely it depends on sorting RemoteArtifact via uuid4, which is random.
I'll write a Pulp test and share it here to make it easier to re-run.

  1. Serve a signed package in two different "external" (non-pulp) repos. E.g, download sample, run createrepo and make sure it's reachable from the Pulp instance.
  2. For each of the two repos:
    • Create a Repository (autopublish=True), Remote (policy=on_demand and corresponding url) and Distribution
    • sync (policy=mirror_content_only)
  3. Change one of the external repos to serve the unsigned version of the package. E.g, delete the signed, put this one, run createrepo and sync the corresponding Pulp repo again.
  4. Request the package from the distribution whose package hasnt changed (should contain signed) and see the error.

Expected behavior

The right package should be delivered to the client, as its present and reachable in Remote A.

Additional context

  • On previous versions (core/rpm 3.16/3.17), the client would effectively get the wrong package, but that's not the case in supported versions anymore.
  • Overview of what is happening in Pulp:

image

@lubosmj
Copy link
Member

lubosmj commented Sep 5, 2024

We can adjust the order_by_acs() function or add a new sorting: 1. ACS wins if available, 2. remote associated with repo/distribution, 3. the rest is random (randomizing all the remotes).

The discussion around copying content will come.

@pedro-psb
Copy link
Member Author

pedro-psb commented Oct 9, 2024

Hey @dralley,
Just wanted to 'oficially' ask you to be my reviewer for the wrong package being served in on-demand.
I see two possible steps on this:

  1. In pulpcore:
    • assure the server is handling those failures correctly (close connection, as we can't rollback streamed data).
    • assure that the client eventually gets the content after retries, even if it hits a 'corrupted' remote first.
    • caveat: the randomization helps with the above, but if there are ACSs and the first ACS server have the wrong binary, content is unreachable.
  2. In pulp_rpm (independent, for later):
    • This case shows that even after a second sync to the server, we are left with old unreachable RemoteArtifacts for a package-nevra, if the server changed that package-nevra binary (e.g, signed).
    • cleaning this on the sync would reduce possible corrupted remotes being tried on the content handler. Removing things is always delicate, so I though about marking it as unreachable so we could ignore on content serving.
    • just a note: if we didnt sync and the remote changed the package, then we can't do much.

Currently I'm working on (1), by looking into http 1.1 and our stream implementation details and writing unit tests using the content.Handler.

edit: In the end, (1) is about fixing #5012 first

@pedro-psb
Copy link
Member Author

An alternative approach to randomizaton is add a cache to the ContentArtifact about the health it's remotes. I guess that was mentioned in the f2f but was not very well received.

Anyway, I'll share my naive idea here:

  • If a digest validation occurs, we add it to the cache saying that Remote-N for CA-Y failed at time T.
  • If a future retry is within a X timespan, we ignore. If not, we unmark it and try again.

@pedro-psb
Copy link
Member Author

@dralley has this idea for a similar case, but I think it fits even better here.

maybe we should have the content app - if it tries a URL and then fails - launch a task to try all of the remaining remote artifacts in the background to download the artifact. That won't stop the first one from failing but it will ensure that afterwards it will be downloaded as if the original had succeeded.

IHO this is definitely better than randomization. The idea of streaming is optimizing speed/transfer for the best case, when we can get the right package. If we fail with that, I believe we should aim to minimize the worst case, which currently is the client don't getting the package (even if it's somehow reachable).

I believe with that approach and some improvements we can make the worst case be: (a) ensure the client get the package on the second request (at most), if the package is reachable (b) report a meaningful error if it's unreachable.

Worst Case

A Pessimistic Scenario

This still doesnt cover the worst case, where an ACS is corrupted and the client give-up before the task completes (this is very pessimistic case, but I've learned those usually happen with us).

image

Improvement idea

And idea to make it more robust is:

image

@pedro-psb pedro-psb self-assigned this Oct 30, 2024
@dralley
Copy link
Contributor

dralley commented Oct 30, 2024

I definitely agree with what's alluded to in your diagram - that we do need to handle the unreachable case in such a way that it's not going to consume needlessly large amounts of resources constantly if the RAs are unreachable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants