Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat OSErrors as exceptions, not payload failures #35461

Merged
merged 1 commit into from
Dec 3, 2024

Conversation

gherceg
Copy link
Contributor

@gherceg gherceg commented Dec 2, 2024

Product Description

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-16323

Occasionally we have stale celery workers that ultimately run into an OSError when attempting to access code that no longer exists if the release it is running on has been cleaned up. See this stack trace for example:

Traceback (most recent call last):
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/tasks.py", line 192, in _process_repeat_record
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/models.py", line 1196, in fire
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/models.py", line 453, in fire_for_record
  File "/home/cchq/www/production/releases/2024-11-21_08.58/python_env-3.9/lib/python3.9/site-packages/memoized.py", line 20, in _memoized
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/expression/repeaters.py", line 93, in get_payload
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/expression/repeater_generators.py", line 19, in get_payload
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/motech/repeaters/expression/repeater_generators.py", line 71, in _generate_payload
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 453, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 901, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 850, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/extension_expressions.py", line 61, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 605, in __call__
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/expressions/specs.py", line 626, in get_value
  File "/home/cchq/www/production/releases/2024-11-21_08.58/corehq/apps/userreports/decorators.py", line 50, in _inner
  File "/usr/lib/python3.9/inspect.py", line 1024, in getsource
    lines, lnum = getsourcelines(object)
  File "/usr/lib/python3.9/inspect.py", line 1006, in getsourcelines
    lines, lnum = findsource(object)
  File "/usr/lib/python3.9/inspect.py", line 835, in findsource
    raise OSError('could not get source code')

Since this was happening when attempting to fetch the payload, this was being treated as a payload error and therefore not being retried. However this is an issue on our end, and while this isn't the perfect solution (ideally we wouldn't get into this state in the first place), we should at least treat OSErrors as non-payload related failures that will be retried, since there is a chance it will succeed on the next attempt.

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

No

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@gherceg gherceg marked this pull request as ready for review December 2, 2024 17:32
@gherceg gherceg requested a review from kaapstorm as a code owner December 2, 2024 17:32
Copy link
Contributor

@AmitPhulera AmitPhulera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
One question, This will notify us regarding the issues that we faced with repeat records. Can this situation happen with other queues as well?

@gherceg gherceg merged commit 140d56c into master Dec 3, 2024
13 checks passed
@gherceg gherceg deleted the gh/repeat-records/handle-os-error branch December 3, 2024 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants