Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local dispatcher improvements fixes #5569

Closed

Conversation

johnhaddon
Copy link
Member

Generally describe what this PR will do, and why it is needed

  • List specific new features and changes to project components

Related issues

  • List any Issues this PR addresses or solves

Dependencies

  • List any other unmerged PRs that this PR depends on

Breaking changes

  • List any breaking API/ABI changes (and apply the pr-majorVersion label)

Checklist

  • I have read the contribution guidelines.
  • I have updated the documentation, if applicable.
  • I have tested my change(s) in the test suite, and added new test cases where necessary.
  • My code follows the Gaffer project's prevailing coding style and conventions.

This is basically the window that the "Execute/View Local Jobs" menu item used to show, but refactored into a dockable editor panel. It still leaves a lot to be desired.
This removes a fair bit of duplicate code. There were quite a few weird little differences in the reporting logic in particular, which I could see no reason for, so they are gone. The two distinctions between foreground and background dispatches that remain are :

- Background execution runs tasks in a subprocess, and polls them while checking for cancellation.
- Background execution suppresses exceptions (while still reporting them via the job log), while foreground execution propagates them.
- This appears to be completely unused.
- A `jobStatusChangedSignal()` would be more generally useful.
- But signals about status changing would make more sense on the Jobs themselves. We'll deal with this in a later commit.
We were storing a status per batch, but in fact the only status we ever
reported to the outside world (via `failed()` and `killed()`) was the
status of the root batch. And to get the root batch into a failed state
we were going via a convoluted route whereby `__reportFailed()` would
mark the current batch as failed, and call `JobPool._failed()` which would
in turn call back to `Job._fail()` to mark the root batch as failed. Much
easier to manage a single `self.__status` directly, and limit the per-batch
data to a simple flag that tells us if we've executed it yet or not.

At the same time, we explicitly store the current batch instead of having
to do a search for it every time. Note that we're assuming that attribute assignment (for `self.__currentBatch`) is atomic (it is for CPython) rather than using a mutex to synchronize access between the background and UI threads.
Instead, just keep failed jobs in the main jobs list. This prevents them
from jumping around in the LocalJobs UI, and simplifies the API too.
Also move icons into their own section, and spruce up the other icons a little bit.
Otherwise you get no opportunity to see the job logs in the LocalJobs
panel, and for quick jobs its not even clear if the job happened. Also
add a `Job.status()` method since there was no public way of determining
running/complete status before (it was assumed that anything in the list
that wasn't killed or failed was running).

Remove `Job.killed()` and `Job.failed()` since they are superceded by
`Job.status()`.
This gives us a more natural cancellation mechanism, and means we're
using a standard TBB worker thread and therefore are not creating unnecessary thread locals.

Since cancellation operates via exceptions, it's more natural to also allow error exceptions to propagate out of `__executeWalk()` rather than using a `False` return value. This also means that all management of `self.__status` can be consolidated into one place in `__executeInternal()`.
This should not have been part of the public API - it only exists so that `doDispatch()` can initiate execution _after_ adding the job to the pool. Nobody needs to call it again after that, and definitely not client code.
And make `removeJob()` public since it is called by the UI, and add public `addJob()` method for symmetry.
The Log tab is more useful.
We'll be updating using our signal handlers anyway.
And refactor the other updates to make it clearer what is being updated and why.
Before they were only updated when selecting a job, so you didn't get any streaming feedback as to what a job was doing.
- Prefix batch messages with the node name, not the job name
- Include execution time in batch completion message
- Output stack traces line-by-line to work better with MessagesWidget.Role.Log.
- Add debug message containing subprocess command lines.
- Omit batch messages for root batch
The LocalJobs Properties tab now contains only static properties of the job, so it is no longer a bug that we're not updating it dynamically when the current batch changes.
- Move just below job listing, since that's what they operate on.
- Remove draggable splittable handle separating buttons from widget above.
- Reduce size.
- Update appropriately when job status changes.
Without this, the BackgroundTask and `outputHandler` threads would continue running while Python was torn down around them on the main thread. This led to crashes when trying to release the GIL from the BackgroundTask.

The Changes.md entry doesn't refer to crashes, because the behaviour in 1.3 was different due to us using a Python thread instead of a BackgroundTask. In 1.3 Python would wait patiently for the Job thread to finish, even if that meant waiting a very very very long time. _Unless_ the user hit `Ctrl+C` again, in which case it would forcibly quit, leaving behind zombie task processes.

I was in two minds as to whether or not to apply the exit handler to all JobPools or just the default one. Since we don't have any non-default pools outside of the unit tests I don't really have any information to base a decision on, so went with only the default pool for now.
Strictly speaking, no such View should ever exist, because the `in` plug is added in the View constructor. But `ImageView::insertConverter()` currently replaces it with another plug, meaning that it is missing when the replacement is added, which is when `cancelAffectedTasks()` is called. This was causing a crash in `GafferImageUITest.ImageViewTest.testDeriving`. The problem had gone unnoticed until now because previously there were no BackgroundTasks to cancel so we never got as far as the bad check. But now that the LocalDispatcher uses a BackgroundTask, some completed tasks lay dormant waiting for garbage collection after LocalDispatcherTest has run, and we hit the null dereference.
This is probably neither here nor there in actual usage in the UI, but it is
important for running the unit tests when the UI module has been imported. That's
because `GafferUI.EventLoop` unconditionally installs a UIThreadCallHandler that
just buffers the calls until an event loop is started. That prolonged Job lifetimes
way outside of their associated tests, eventually causing this spooky action at a
distance :

```
Traceback (most recent call last):
File "/__w/gaffer/gaffer/build/python/GafferTest/BoxTest.py", line 1138, in testComputeNodeCastDoesntRequirePython
    node = CastChecker()
  File "/__w/gaffer/gaffer/build/python/GafferTest/BoxTest.py", line 1129, in __init__
    self["out"] = Gaffer.IntPlug( direction = Gaffer.Plug.Direction.Out )
IECore.Exception: Traceback (most recent call last):
  File "/__w/gaffer/gaffer/build/python/GafferTest/BoxTest.py", line 1133, in isInstanceOf
    raise Exception( "Cast to ComputeNode should not require Python" )
Exception: Cast to ComputeNode should not require Python
```

How did that happen? Well, a Job contains a BackgroundTask, and `self["out"]` is
a graph edit, and graph edits cancel background tasks, and the cancellation code does a `runTimeCast()`
on the subject of the edit. So if a BackgroundTask exists thanks to EventLoop, then
the cancellation code runs and does a `runTimeCast()` on `CastChecker`, which throws.

It really would be better if EventLoop scoped its UIThreadCallHandler between `start()` and `stop()`.
We were already doing this in a few specific tests but are now doing it for all. Our intention is still that our Python code should never make circular references between Python objects (and we have checks for that in `GafferUITest.TestCase.tearDown()` among other places), so we're not calling `gc.collect()`. But we can't do anything about the unfortunate circular references created by RefCountedBinding itself, and it's preferable to have those broken predictably at the end of the test rather than unpredictably when a later test runs.

The immediate motivation for doing this is to destroy the `LocalDispatcher` instances and their associated `JobPool` in `LocalDispatcherTest`, before we run any other tests. If we don't, then the completed BackgroundTasks in the jobs cause a failure in `BoxTest.testComputeNodeCastDoesntRequirePython`. See previous commit for more details.
If we use the default pool, then the internal jobs show up confusingly in the LocalJobs editor. This wasn't a problem before as completed jobs were being removed immediately.
@johnhaddon
Copy link
Member Author

Sorry, meant to make this PR to my own repo for testing - closing.

@johnhaddon johnhaddon closed this Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant