Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove orphan tasks failing on Katello 4.9 #865

Closed
lravelo opened this issue Jul 31, 2023 · 23 comments · Fixed by pulp/pulpcore#4295
Closed

Remove orphan tasks failing on Katello 4.9 #865

lravelo opened this issue Jul 31, 2023 · 23 comments · Fixed by pulp/pulpcore#4295
Labels
.bugfix CHANGES/<issue_number>.bugfix Katello For bugs and issues known to affect Katello

Comments

@lravelo
Copy link

lravelo commented Jul 31, 2023

Version
katello 4.9

python39-pulp-rpm-3.19.7-1.el8.noarch
python39-pulpcore-3.22.7-1.el8.noarch

Describe the bug
remove orphan tasks in foreman are failing with the following type of message:

("Cannot delete some instances of model 'Content' because they are referenced through protected foreign keys: 'Package.content_ptr'.", {<RepositoryContent: pk=ff32c510-ad1c-4288-918a-442fba668125>})

To Reproduce
Steps to reproduce the behavior: running any remove orphan task will create this error

Expected behavior
I would expect the task to remove orphans successfully

Additional context

https://bugzilla.redhat.com/show_bug.cgi?id=2164551

journalctl -u pulpcore-worker@*

Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]: pulp [b3a7bad0-751c-4973-8432-9a7eb92d6452]: pulpcore.tasking.pulpcore_worker:INFO:   File "/usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py", line 44>
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     result = func(*args, **kwargs)
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 486, in synchronize
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     remote_url = fetch_remote_url(remote, url)
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 305, in fetch_remote_url
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     remote_url = fetch_mirror(remote)
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 254, in fetch_mirror
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     result = downloader.fetch()
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib/python3.9/site-packages/pulpcore/download/base.py", line 186, in fetch
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     return done.pop().result()
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 273, in run
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     return await download_wrapper()
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib/python3.9/site-packages/backoff/_async.py", line 151, in retry
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     ret = await target(*args, **kwargs)
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 258, in download_wrapper
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     return await self._run(extra_data=extra_data)
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib/python3.9/site-packages/pulp_rpm/app/downloaders.py", line 117, in _run
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     self.raise_for_status(response)
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib/python3.9/site-packages/pulp_rpm/app/downloaders.py", line 102, in raise_for_status
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     response.raise_for_status()
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:   File "/usr/lib64/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
Jul 19 00:00:24 foreman.example.com pulpcore-worker-2[152361]:     raise ClientResponseError(

PULP_SETTINGS=/etc/pulp/settings.py pulpcore-manager showmigrations rpm

rpm
 [X] 0001_initial
 [X] 0002_updaterecord_reboot_suggested
 [X] 0003_DATA_incorrect_json
 [X] 0004_add_metadata_signing_service_fk
 [X] 0005_optimize_sync
 [X] 0006_opensuse_support
 [X] 0007_checksum_types
 [X] 0008_advisory_pkg_sumtype_as_int
 [X] 0009_revision_null
 [X] 0010_revision_null_redo
 [X] 0011_rpmremote_sles_auth_token
 [X] 0012_remove_pkg_group_env_cat_related_pkgs
 [X] 0013_RAW_rpm_evr_extension
 [X] 0014_rpmrepository_package_retention_policy
 [X] 0015_repo_metadata
 [X] 0016_dist_tree_nofk
 [X] 0017_merge_advisory_collections
 [X] 0018_updatecollection__update_record
 [X] 0019_migrate_updatecollection_data
 [X] 0020_remove_updatecollection_m2m
 [X] 0021_rename_updatecollection_update_record
 [X] 0022_add_collections_related_name
 [X] 0023_increase_distribution_release_short
 [X] 0024_change_subrepo_relation_properties
 [X] 0025_remove_orphaned_subrepos
 [X] 0026_add_gpgcheck_options
 [X] 0027_checksum_null
 [X] 0028_rpmrepository_last_sync_repomd_cheksum
 [X] 0029_rpmpublication_sqlite_metadata
 [X] 0030_DATA_fix_updaterecord
 [X] 0031_modulemd_static_context
 [X] 0032_ulnremote
 [X] 0033_new_distribution_model
 [X] 0034_auto_publish
 [X] 0035_fix_auto_publish
 [X] 0036_checksum_type
 [X] 0037_DATA_remove_rpmrepository_sub_repo
 [X] 0037_update_json_field
 [X] 0038_fix_sync_optimization
 [X] 0039_disttree_digest
 [X] 0040_rpmalternatecontentsource
 [X] 0041_modulemdobsolete
 [X] 0042_alter_repometadatafile_data_type
 [X] 0043_textfield_conversion
 [X] 0044_noartifact_modules
 [X] 0045_modulemd_fields
 [X] 0046_rbac_perms
 [X] 0047_modulemd_datefield
 [X] 0048_artifacts_dependencies_fix
 [X] 0049_profiles_fix

Task Output:

{"pulp_tasks"=>
  [{"pulp_href"=>"/pulp/api/v3/tasks/2996831a-d553-4195-982a-12938f7ee779/",
    "pulp_created"=>"2023-07-31T13:10:02.449+00:00",
    "state"=>"failed",
    "name"=>"pulpcore.app.tasks.orphan.orphan_cleanup",
    "logging_cid"=>"5b70908fc43f4d14942561f8319dcc49",
    "started_at"=>"2023-07-31T13:10:02.642+00:00",
    "finished_at"=>"2023-07-31T13:10:07.675+00:00",
    "error"=>
     {"traceback"=>
       "  File \"/usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py\", line 444, in _perform_task\n" +
       "    result = func(*args, **kwargs)\n" +
       "  File \"/usr/lib/python3.9/site-packages/pulpcore/app/tasks/orphan.py\", line 66, in orphan_cleanup\n" +
       "    c.delete()\n" +
       "  File \"/usr/lib/python3.9/site-packages/django/db/models/query.py\", line 745, in delete\n" +
       "    collector.collect(del_query)\n" +
       "  File \"/usr/lib/python3.9/site-packages/django/db/models/deletion.py\", line 302, in collect\n" +
       "    raise ProtectedError(\n",
      "description"=>
       "(\"Cannot delete some instances of model 'Content' because they are referenced through protected foreign keys: 'Package.content_ptr'.\", {<RepositoryContent: pk=ff32c510-ad1c-4288-918a-442fba668125>})"},
    "worker"=>"/pulp/api/v3/workers/92aeba58-871a-40e8-aac1-042e4a0887c6/",
    "child_tasks"=>[],
    "progress_reports"=>
     [{"message"=>"Clean up orphan Content",
       "code"=>"clean-up.content",
       "state"=>"running",
       "total"=>19089,
       "done"=>2000}],
    "created_resources"=>[],
    "reserved_resources_record"=>["/pulp/api/v3/orphans/cleanup/"]}],
 "task_groups"=>[],
 "poll_attempts"=>{"total"=>21, "failed"=>1}}
@dralley
Copy link
Contributor

dralley commented Jul 31, 2023

@quba42
Copy link
Collaborator

quba42 commented Aug 1, 2023

We had a very similar issue (but with a different model) in pulp_deb: #690

Just thought I would drop a link in case anyone wants to have a look.

@gerrod3
Copy link

gerrod3 commented Aug 1, 2023

Another possible scenario is that orphaned content was added to a repository while the orphan cleanup task was running and thus was no longer orphaned, but still on the list of content to try to delete.

@lravelo
Copy link
Author

lravelo commented Aug 1, 2023

Another possible scenario is that orphaned content was added to a repository while the orphan cleanup task was running and thus was no longer orphaned, but still on the list of content to try to delete.

if this is the case, how could I go about searching for the repo in question so that I can delete and re-add?

@ipanova
Copy link
Member

ipanova commented Aug 1, 2023

@gerrod3 should we then just try catching the ProtectedError, log and proceed to the next "orphaned content"?

@gerrod3
Copy link

gerrod3 commented Aug 1, 2023

if this is the case, how could I go about searching for the repo in question so that I can delete and re-add?

Not sure it is possible through the API. If you can access the shell you could look up the RepositoryContent object referenced in the traceback. That object will have a reference to the repository the content is a part of.

pulpcore-manager shell:

from pulpcore.app.models import RepositoryContent
rc = RepositoryContent.objects.get(pk="ff32c510-ad1c-4288-918a-442fba668125")
repo = rc.repository
print(repo.name)

@gerrod3 should we then just try catching the ProtectedError, log and proceed to the next "orphaned content"?

Probably, since the task can be ran concurrently with others we should expect that sometimes orphans will no longer be orphans.

@lravelo
Copy link
Author

lravelo commented Aug 1, 2023

thanks @gerrod3
I was able to find the repo using this. Turns out its an Ubuntu Focal Updates repo

@dralley
Copy link
Contributor

dralley commented Aug 2, 2023

Hmm, that's interesting.

@quba42 Have you seen this before? I have once, but not in the Ubuntu context, and not in a way that is easily repeatable.

@lravelo
Copy link
Author

lravelo commented Aug 2, 2023

@dralley @quba42 I've yet to delete the repo from my foreman server. If either of you would like some extra info that could help solve this issue, I'm willing to provide it. No issues on my end.

@quba42
Copy link
Collaborator

quba42 commented Aug 3, 2023

@dralley In the pulp_deb case we had a foreign key relation between two types of metadata content, and under certain (reproducible) conditions, it was possible for one of the metadata to be orphaned, while the other was not orphaned and still referencing the orphaned one. The solution was to get rid of that foreign key relation entirely (we weren't actually using it for anything). This was a case of pulp_deb had a foreign key relation it should not have had.

This case is different, since it simply makes no sense for a RepositoryContent to reference a Content that gets picked up by orphan cleanup. (Isn't that a direct contradiction with the definition of "orphaned"?) So I suspect this current case was probably the result of some race condition between parallel tasks, or some inconsistent state as a result of some failed task. Which means this is most likely going to be much harder to reproduce...

@quba42
Copy link
Collaborator

quba42 commented Aug 17, 2023

Looks like more users are running into this: https://community.theforeman.org/t/remove-orphan-tasks-failing-on-katello-4-9/34456/4

Could this be some effect where something is an orphan when the remove orphans task is started, but is then re-added to a repository by the time the orphan cleanup actually tries to delete it? If so, then it should be possible to simply re-run the orphan task without running into the same error again. Can anyone with the problem confirm or deny?

ipanova referenced this issue in ipanova/pulpcore Aug 17, 2023
…rallel.

closes #4209
Orphan clean up can fail when other tasks like sync or content upload
might be rinning in paralell.
ipanova referenced this issue in ipanova/pulpcore Aug 17, 2023
…rallel.

closes #4209
Orphan clean up can fail when other tasks like sync or content upload
might be rinning in paralell.
ipanova referenced this issue in ipanova/pulpcore Aug 18, 2023
…rallel.

closes #4209
Orphan clean up can fail when other tasks like sync or content upload
might be rinning in paralell.
ipanova referenced this issue in ipanova/pulpcore Aug 21, 2023
…rallel.

closes #4209
Orphan clean up can fail when other tasks like sync or content upload
might be rinning in paralell.
@ipanova
Copy link
Member

ipanova commented Aug 21, 2023

The root cause for this bug report ( even if rpm plugin version is provided, rpm plugin has nothing to do with it...yes.. debian also has Package model) and the one described in the foreman blogpost is reincarnation of this debian issue #690, just this time it is here https://github.com/pulp/pulp_deb/blob/main/pulp_deb/app/models/content/structure_content.py#L91, @quba42 please more eyes here I might be wrong since I did not look in depth into debian content structure. I don't think the referenced bugzila is related either.

Note that the user has pulp-deb 2.20
If this would be some pulpcore race condition issue, where the content would be be part of a repo the error would specify the list of foreign keys including RepositoryContent foreign key like this:

ProtectedError: ("Cannot delete some instances of model 'Content' because they are referenced through protected foreign keys: 'RepositoryContent.content', 'Package.content_ptr'.", {<RepositoryContent: pk=018a1839-3ff9-77c9-9524-f4b49cdd4434>, <RepositoryContent: pk=018a1839-3e91-79f6-a116-41b4e59f2339>})

Notice that in both bug report and foreman blogpost the error is:

("Cannot delete some instances of model 'Content' because they are referenced through protected foreign keys: 'Package.content_ptr'.", {<RepositoryContent: pk=ff32c510-ad1c-4288-918a-442fba668125>})

So re-running orphan clean-up might not help in case this issue is of the same nature as with ReleaseFile.

I did notice that the fix for the ReleaseFile issue, that contained a migration got backported into 2.20.2 390e7ad. We should not backport migrations.

@ipanova
Copy link
Member

ipanova commented Aug 21, 2023

The proposed fix should unblock the orphan clean up task, just these Content instances will always be in the limbo state and won't be removed (i.e. with every orphan clean up task there will be logged non removable content).
The fix is primarily targeted for the case when orphan clean up is running in parallel with other tasks like sync.

@quba42
Copy link
Collaborator

quba42 commented Aug 22, 2023

@ipanova You are right. Given that @lravelo confirmed the affected content came from an Ubuntu repo, and a PackageReleaseComponent is the only thing that has a foreign key relation on a pulp_deb Package, this issue must be caused, by a pulp_deb Package that is an orphan, with a PackageReleaseComponent pointing at it that is not an orphan.

It follows this is not a pulpcore, but a pulp_deb issue.

Unlike with #690 the solution won't be to get rid of the foreign key relation in question. In fact, a pulp_deb repo version should never contain a PackageReleaseComponent unless it also contains the Package being referenced, since having just a PackageReleaseComponent in a repo version isn't meaningful. The solution will probably involve two parts:

1.) Preventitive: Identify any pulp_deb actions that can result in this inconsistent state and modify them so they cannot produce such inconsistent states. (I don't know what caused this state in this current case, but I think I could force this result myself on a test system, so it is certainly possible).
2.) Add a DB cleanup script for already affected systems, that will remove any dangling PackageReleaseComponent content from the existing repo versions. This would mean if the accompanying Package is an orphan, the PackageReleaseComponent would also become an orphan, thus unblocking orphan cleanup.

It is worth noting that pulp_deb has had this relation pretty much unchanged since the beginning of Pulp 3, so it would be interesting to understand how we were able to avoid this issue up to now. (What workflow suddenly started producing these cases?)

@quba42
Copy link
Collaborator

quba42 commented Aug 22, 2023

Is it possible to move this issue to pulp_deb, or do I just open a new one and link to this one?

@ipanova
Copy link
Member

ipanova commented Aug 22, 2023

@ipanova You are right. Given that @lravelo confirmed the affected content came from an Ubuntu repo, and a PackageReleaseComponent is the only thing that has a foreign key relation on a pulp_deb Package, this issue must be caused, by a pulp_deb Package that is an orphan, with a PackageReleaseComponent pointing at it that is not an orphan.

It follows this is not a pulpcore, but a pulp_deb issue.

Unlike with pulp/pulp_deb#690 the solution won't be to get rid of the foreign key relation in question. In fact, a pulp_deb repo version should never contain a PackageReleaseComponent unless it also contains the Package being referenced, since having just a PackageReleaseComponent in a repo version isn't meaningful.

In pulp-container we have quite a similar concept, where it does not make sense to keep Blob without it's corresponding Manifest in the repo version, hence we have implemented a 'recursive removal/add logic' maybe debian could use similar concepts? https://docs.pulpproject.org/pulp_container/workflows/manage-content.html#add-content-to-a-repository

The solution will probably involve two parts:

1.) Preventitive: Identify any pulp_deb actions that can result in this inconsistent state and modify them so they cannot produce such inconsistent states. (I don't know what caused this state in this current case, but I think I could force this result myself on a test system, so it is certainly possible). 2.) Add a DB cleanup script for already affected systems, that will remove any dangling PackageReleaseComponent content from the existing repo versions. This would mean if the accompanying Package is an orphan, the PackageReleaseComponent would also become an orphan, thus unblocking orphan cleanup.

It is worth noting that pulp_deb has had this relation pretty much unchanged since the beginning of Pulp 3, so it would be interesting to understand how we were able to avoid this issue up to now. (What workflow suddenly started producing these cases?)

That's a good question, I don't think we have changed orphan removal logic except for disallowing running in parallel 2+ orphan cleanup tasks. I agree that starting with a reproducer is a step number 1.
Maybe if looking from another angle - did something change on the debian side? maybe some manual Package content removal from repo + retain_repo_version=1 leads to this. Or maybe sync in mirror mode automatically removes the Package from repo version but not the other dependent content?

@ipanova
Copy link
Member

ipanova commented Aug 22, 2023

Is it possible to move this issue to pulp_deb, or do I just open a new one and link to this one?

I have a PR attached to it, but I can create a new issue since the fix has a bit different scope from the original root cause

@quba42
Copy link
Collaborator

quba42 commented Aug 22, 2023

@ipanova Do you think it would make sense to mitigate this (and similar issues) by adding something like a force mode to orphan cleanup. The idea would be for orphan cleanup to catch the ProtectedError, skip the content that can't be removed, but continue to cleanup any other orphans it has yet to clean up, that are not affected by the error.

@ipanova
Copy link
Member

ipanova commented Aug 22, 2023

That's basically what my PR is doing, if you look at it.
There is another side of the coin to it though, by catching the ProtectedError we'd just log the error, which is not as visible as raising a traceback, so some issues, like this one with Debian will more likely to be covered, ignored, etc.

@dralley
Copy link
Contributor

dralley commented Aug 22, 2023

The root cause for this bug report ( even if rpm plugin version is provided, rpm plugin has nothing to do with it...yes.. debian also has Package model)

I have a tiny bit of doubt about this because there was a BZ filed, against Satellite, presumably without the Debian plugin. But we have no actionable information from that report and it's one single report in a long long time. Whatever the problem is does seem to occur mostly with the Debian plugin?

So I have no issue with moving it to the Debian plugin. We can do that

About the PR, it seems the discussion is along the lines of:

  • don't merge it, let the orphan cleanup fail but figure out how to communicate to the user that they should use a script / tool we provide to repair the situation
  • merge it, bypass "problematic" content units, just print it in the logs, and maybe also provide that script

I think we should see what we can do in terms of repair before we necessarily make a decision either way in terms of the PR.

@ipanova
Copy link
Member

ipanova commented Aug 22, 2023

@dralley idk about the repair part in both bullet points of yours, the script should be plugin self aware otherwise how it should know what to do with e.g. debian Package, or PackageReleaseComponent? Blindly removing FK does not seem like a good idea.

@ipanova
Copy link
Member

ipanova commented Aug 22, 2023

Let me open a separate issue for my PR and move the discussion there, I will copy over revelant comments

@ipanova ipanova transferred this issue from pulp/pulpcore Aug 22, 2023
@quba42 quba42 added .bugfix CHANGES/<issue_number>.bugfix Katello For bugs and issues known to affect Katello labels Aug 23, 2023
@quba42
Copy link
Collaborator

quba42 commented Aug 23, 2023

1.) Preventitive: Identify any pulp_deb actions that can result in this inconsistent state and modify them so they cannot produce such inconsistent states. (I don't know what caused this state in this current case, but I think I could force this result myself on a test system, so it is certainly possible).
2.) Add a DB cleanup script for already affected systems, that will remove any dangling PackageReleaseComponent content from the existing repo versions. This would mean if the accompanying Package is an orphan, the PackageReleaseComponent would also become an orphan, thus unblocking orphan cleanup.

Looks like we already have an issue for 1: #785

Which means this issue can be used for 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.bugfix CHANGES/<issue_number>.bugfix Katello For bugs and issues known to affect Katello
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants