-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local dispatcher improvements fixes #5569
Closed
johnhaddon
wants to merge
36
commits into
GafferHQ:main
from
johnhaddon:localDispatcherImprovementsFixes
Closed
Local dispatcher improvements fixes #5569
johnhaddon
wants to merge
36
commits into
GafferHQ:main
from
johnhaddon:localDispatcherImprovementsFixes
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is basically the window that the "Execute/View Local Jobs" menu item used to show, but refactored into a dockable editor panel. It still leaves a lot to be desired.
This removes a fair bit of duplicate code. There were quite a few weird little differences in the reporting logic in particular, which I could see no reason for, so they are gone. The two distinctions between foreground and background dispatches that remain are : - Background execution runs tasks in a subprocess, and polls them while checking for cancellation. - Background execution suppresses exceptions (while still reporting them via the job log), while foreground execution propagates them.
- This appears to be completely unused. - A `jobStatusChangedSignal()` would be more generally useful. - But signals about status changing would make more sense on the Jobs themselves. We'll deal with this in a later commit.
We were storing a status per batch, but in fact the only status we ever reported to the outside world (via `failed()` and `killed()`) was the status of the root batch. And to get the root batch into a failed state we were going via a convoluted route whereby `__reportFailed()` would mark the current batch as failed, and call `JobPool._failed()` which would in turn call back to `Job._fail()` to mark the root batch as failed. Much easier to manage a single `self.__status` directly, and limit the per-batch data to a simple flag that tells us if we've executed it yet or not. At the same time, we explicitly store the current batch instead of having to do a search for it every time. Note that we're assuming that attribute assignment (for `self.__currentBatch`) is atomic (it is for CPython) rather than using a mutex to synchronize access between the background and UI threads.
Instead, just keep failed jobs in the main jobs list. This prevents them from jumping around in the LocalJobs UI, and simplifies the API too.
There was no benefit in that.
Also move icons into their own section, and spruce up the other icons a little bit.
Otherwise you get no opportunity to see the job logs in the LocalJobs panel, and for quick jobs its not even clear if the job happened. Also add a `Job.status()` method since there was no public way of determining running/complete status before (it was assumed that anything in the list that wasn't killed or failed was running). Remove `Job.killed()` and `Job.failed()` since they are superceded by `Job.status()`.
This gives us a more natural cancellation mechanism, and means we're using a standard TBB worker thread and therefore are not creating unnecessary thread locals. Since cancellation operates via exceptions, it's more natural to also allow error exceptions to propagate out of `__executeWalk()` rather than using a `False` return value. This also means that all management of `self.__status` can be consolidated into one place in `__executeInternal()`.
This should not have been part of the public API - it only exists so that `doDispatch()` can initiate execution _after_ adding the job to the pool. Nobody needs to call it again after that, and definitely not client code.
And make `removeJob()` public since it is called by the UI, and add public `addJob()` method for symmetry.
The Log tab is more useful.
We'll be updating using our signal handlers anyway.
And refactor the other updates to make it clearer what is being updated and why.
Before they were only updated when selecting a job, so you didn't get any streaming feedback as to what a job was doing.
- Prefix batch messages with the node name, not the job name - Include execution time in batch completion message - Output stack traces line-by-line to work better with MessagesWidget.Role.Log. - Add debug message containing subprocess command lines. - Omit batch messages for root batch
That's what it is.
The LocalJobs Properties tab now contains only static properties of the job, so it is no longer a bug that we're not updating it dynamically when the current batch changes.
- Move just below job listing, since that's what they operate on. - Remove draggable splittable handle separating buttons from widget above. - Reduce size. - Update appropriately when job status changes.
Without this, the BackgroundTask and `outputHandler` threads would continue running while Python was torn down around them on the main thread. This led to crashes when trying to release the GIL from the BackgroundTask. The Changes.md entry doesn't refer to crashes, because the behaviour in 1.3 was different due to us using a Python thread instead of a BackgroundTask. In 1.3 Python would wait patiently for the Job thread to finish, even if that meant waiting a very very very long time. _Unless_ the user hit `Ctrl+C` again, in which case it would forcibly quit, leaving behind zombie task processes. I was in two minds as to whether or not to apply the exit handler to all JobPools or just the default one. Since we don't have any non-default pools outside of the unit tests I don't really have any information to base a decision on, so went with only the default pool for now.
Strictly speaking, no such View should ever exist, because the `in` plug is added in the View constructor. But `ImageView::insertConverter()` currently replaces it with another plug, meaning that it is missing when the replacement is added, which is when `cancelAffectedTasks()` is called. This was causing a crash in `GafferImageUITest.ImageViewTest.testDeriving`. The problem had gone unnoticed until now because previously there were no BackgroundTasks to cancel so we never got as far as the bad check. But now that the LocalDispatcher uses a BackgroundTask, some completed tasks lay dormant waiting for garbage collection after LocalDispatcherTest has run, and we hit the null dereference.
This is probably neither here nor there in actual usage in the UI, but it is important for running the unit tests when the UI module has been imported. That's because `GafferUI.EventLoop` unconditionally installs a UIThreadCallHandler that just buffers the calls until an event loop is started. That prolonged Job lifetimes way outside of their associated tests, eventually causing this spooky action at a distance : ``` Traceback (most recent call last): File "/__w/gaffer/gaffer/build/python/GafferTest/BoxTest.py", line 1138, in testComputeNodeCastDoesntRequirePython node = CastChecker() File "/__w/gaffer/gaffer/build/python/GafferTest/BoxTest.py", line 1129, in __init__ self["out"] = Gaffer.IntPlug( direction = Gaffer.Plug.Direction.Out ) IECore.Exception: Traceback (most recent call last): File "/__w/gaffer/gaffer/build/python/GafferTest/BoxTest.py", line 1133, in isInstanceOf raise Exception( "Cast to ComputeNode should not require Python" ) Exception: Cast to ComputeNode should not require Python ``` How did that happen? Well, a Job contains a BackgroundTask, and `self["out"]` is a graph edit, and graph edits cancel background tasks, and the cancellation code does a `runTimeCast()` on the subject of the edit. So if a BackgroundTask exists thanks to EventLoop, then the cancellation code runs and does a `runTimeCast()` on `CastChecker`, which throws. It really would be better if EventLoop scoped its UIThreadCallHandler between `start()` and `stop()`.
We were already doing this in a few specific tests but are now doing it for all. Our intention is still that our Python code should never make circular references between Python objects (and we have checks for that in `GafferUITest.TestCase.tearDown()` among other places), so we're not calling `gc.collect()`. But we can't do anything about the unfortunate circular references created by RefCountedBinding itself, and it's preferable to have those broken predictably at the end of the test rather than unpredictably when a later test runs. The immediate motivation for doing this is to destroy the `LocalDispatcher` instances and their associated `JobPool` in `LocalDispatcherTest`, before we run any other tests. If we don't, then the completed BackgroundTasks in the jobs cause a failure in `BoxTest.testComputeNodeCastDoesntRequirePython`. See previous commit for more details.
If we use the default pool, then the internal jobs show up confusingly in the LocalJobs editor. This wasn't a problem before as completed jobs were being removed immediately.
Sorry, meant to make this PR to my own repo for testing - closing. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Generally describe what this PR will do, and why it is needed
Related issues
Dependencies
Breaking changes
Checklist