-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log job failure even when there are retries configured #6169
base: 8.3.x
Are you sure you want to change the base?
Conversation
73714c8
to
8f20ab0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't think it's this simple.
- I think this diff means that polled log messages for task failure will go back to being duplicated.
- This only covers failure, but submission failure also has retries so may require similar treatment.
- I think the failures before retries are exhausted will now get logged at CRITICAL level rather than INFO.
I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?
I think that submission failure is already handled correctly - it certainly is in the simplistic case where you feed it
These are logged at critical - and I think they should be?
This would be consistent with submit failure... |
2c7e480
to
3cedf2f
Compare
No, I'm not thinking of the other log message duplication issue. The change made here bypassed logic that was used for suppressing duplicate log messages (the 8f20ab0#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eL930 However, in your more recent "fix" commit, you have put this back the way it was before: 3cedf2f#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eR930 |
This does not apply to submit failure, because submit failure will always log a critical warning through the
|
3cedf2f
to
1341355
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Test failures)
1341355
to
ce0498e
Compare
I think that the correct level is error. @hjoliver - does your PR fall victim to any of Oliver's comments from #6169 (review)? |
If execution/submission retry delays are configured, then execution/submission failures (respectively) are expected to occur. Therefore it is not a CRITICAL message to log. Only if the retries are exhausted should it be a CRITICAL level message? |
I don't disagree with that, but it was kinda off-topic for this Issue - which is about not hiding the job failure from the log - unless we introduce a jarring inconsistency between the way submission and execution failures are logged. But OK, if we want to two kill two birds with one stone, let's look at unhiding the job failure AND changing the level of both kinds of failure at once, to maintain consistency... I agree with @wxtim 's assertion that the correct level (for both) is ERROR. |
In that case, I would go with your approach @wxtim - but with some tweaks:
if retries:
LOG.error(message)
...
else:
LOG.critical(message)
...
[UPDATE] LOG.error(message)
if retries:
....
else:
... (a job error is always an error, but whether it's critical or not depends on the graph... if it is, it will cause a stall that does get logged as CRITICAL) |
@wxtim - is this ready to go again, in light of the above discussion? |
In my head, yes, but I see that there a load of test failures. Will draft now and undraft when ready. |
5b5c391
to
0b26abd
Compare
These test failures were caused by a slight change in the nature of the message caused by moving it: By the time the code reaches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small typo found.
@wxtim this seems to have diverged a bit from what I thought was agreed above, which was:
Now, if there are retries we only get the retry warning. (Which I think is back to the problem we were trying to fix here, although the logging location has changed to the methods that would support the fix). |
Co-authored-by: Hilary James Oliver <[email protected]> response to review
fc12804
to
5453cac
Compare
Closes #6151
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).CHANGES.md
entry included if this is a change that can affect users?.?.x
branch.