-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskCollaboration improvements for review #5493
TaskCollaboration improvements for review #5493
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now taken a look through all this, I haven't been able to spot anything wrong.
It sounds like you either need to back out the isolated task group stuff though ( or do the slightly goofy but probably effective "only use isolated task group 90% of the time )?
# / | \ | ||
# 9 10 11 | ||
# \ | / | ||
# 12 (for width 3 and depth 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per offline conversation, I had originally just thought this diagram was upside down, but you corrected me that the numbers just go in a different order in this test.
At a basic level, it might help to label the root computation somehow ... since I don't think there is any global convention for whether the root goes at the top or bottom.
And actually, this is a bit more confusing, because as far as I can tell, the root computation is implicit, not given an explicit key? I can't see where testCollaborationTaskDistribution creates a root named "0", or where testFanOutGatherPerformance creates a root named "12" - it seems like the dependency dict starts one level away? So it's a bit deceptive to label is as "12" in one case, and "0" in the other? Could it just be in the diagram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a basic level, it might help to label the root computation somehow ... since I don't think there is any global convention for whether the root goes at the top or bottom.
I was just following the Gaffer convention, where pretty much all graphs flow top-to-bottom, but I've pushed a fixup containing a comment to make this explicit : 91ee00c.
I can't see where testCollaborationTaskDistribution creates a root named "0", or where testFanOutGatherPerformance creates a root named "12" - it seems like the dependency dict starts one level away?
In all cases, the root process is passed as an argument to runTestProcess()
.
return result; | ||
} | ||
|
||
void runTestProcess( const Plug *plug, int expectedResult, dict dependenciesDict ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a name, GafferTest.runTestProcess is about the least specific name I can think of ... maybe not a problem, but if I saw this name in isolation, I would have no idea what it's used for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestions for an improvement? It definitely does run a TestProcess
, so I guess the problem is that it could be running some non-specific "test process". Would the solution be to rename TestProcess
to something else? We do have a load of TestFoo
classes which are instances of Foo
designed only for testing Foos
though, so the name seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the reason it feels odd to me, is that even aside from "Process" being a generic name, this feels like it's for testing a specific aspect of Processes, and it's running a network of Processes, not a single one.
Something like runTestDependencyProcesses or something would feel clearer to me, but I don't feel strongly if you're happy with this name.
} | ||
else | ||
{ | ||
result = Process::acquireCollaborativeResult<HashProcess>( cacheKey, p, plug, computeNode ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic looks correct, but for some reason it felt like I had to read it more times than I should have to follow it.
I think I would find this fair bit clearer ( though maybe the duplicated line is bad? ):
if( forceMonitoring )
{
return Process::acquireCollaborativeResult<HashProcess>( cacheKey, p, plug, computeNode );
}
std::optional<IECore::MurmurHash> cachedValue = g_cache.getIfCached( cacheKey );
if( cachedValue )
{
return *cachedValue;
}
else
{
return Process::acquireCollaborativeResult<HashProcess>( cacheKey, p, plug, computeNode );
}
( This might be totally subjective, feel free to ignore it ... just stuck out to me as something I was tripping over )
I do appreciate more you wanting to just rip out the forceMonitoring stuff now ... though I do still feel like it has enough potential usefulness that it might be premature to do that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this does feel a bit fiddly to me too. I don't think your suggestion quite works as-is though because we don't want to return
but rather assign result =
. So we'd require another level of nesting with the cache lookup in an else
block, which would reduce its clarity a bit.
I did do what I did fairly consciously because I noticed a pre-existing bug where we were failing to perform any kind of TaskCollaboration when forcing monitoring. So I felt the slightly more awkward construction was worth it to preclude the possibility of that sort of thing happening again.
So on balance, I'd like to keep this as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do appreciate more you wanting to just rip out the forceMonitoring stuff now
It's not the forceMonitoring
stuff that I want to rip out - that's essential. What I'm fairly keen to rip out is the HashCacheMode
stuff, but I thought that should at least wait until a new major version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. It makes that this logic is actually a bit complex - it probably should actual give a bit more pause when reading it than most 2 if blocks.
Default, | ||
/// Deprecated synonym for Default. Will be removed in a future | ||
/// release. | ||
Legacy = Default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default seems like a reasonable name for how things currently function ... if we have better support for collaborations to join forward with processes that are currently working in the future ... then it's a bit unclear whether Default would be the best name, rather than something more explicit like "Lightweight"?
It feels bit weird to suggest that "unlikely to required from multiple threads concurrently" would be true by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I do quite like the idea of something like Lightweight
, but for now I was trying to stick more closely to the existing names, which are about what the cache does, rather than being explicitly about what the process does. If I can get this alternative scheduler off the ground then I think that might be the time for a wholesale renaming of the options.
It feels bit weird to suggest that "unlikely to required from multiple threads concurrently" would be true by default?
It is the assumption we make though, so I think it should be advertised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not trying to anticipate future changes, these names definitely look fine for now
I've backed it out for now - I reverted it in 2c845a4 so that we still have the history for it if we want to revisit later. It might be worth revisting with the 10% trick, but the next step for me is to do some performance testing on production scenes, and for that I think it'll be nice to know that I'm just measuring any overhead of the new |
2c845a4
to
e0bb9d6
Compare
Rebased to fix merge conflict - everything else is the same. |
A variety of tests of the current state of this PR run on a random selection of IE production shots hasn't raised any red flags, so it feels like we're on the path to getting this merged... 🚀 |
e0bb9d6
to
684a5b9
Compare
This will be used to replace the TaskMutex collaboration mechanism currently used via the LRUCaches in ValuePlug. The underlying collaboration mechanism is still a `task_arena/task_group` combo, but this arrangement has a few key benefits : - Moving the collaboration out from the guts of the LRUCache gives us much more control over how it is performed, particularly for collaborating threads. - We are now tracking interprocess dependencies completely ourselves, so we are no longer vulnerable to the deadlocks that TaskMutex couldn't prevent. - We are no longer using `task_arena_observer`, which gives us much greater freedom to change implementation in future.
All these tests would hang if ValuePlug wasn't now using `Process::acquireCollaborativeResult()`.
Our rationale is as follows : - Standard has no right to exist - if a compute is slow then it should be multithreaded and TaskCollaboration should be used. - TaskIsolation has no right to exist either. If a compute is heavy enough to use TBB then it's essential that we can collaborate on it. And since we're no longer using a `task_arena` to perform collaboration, we have less concern about potential overhead. - Legacy is actually a good policy for lightweight computes. True, it can mean redundant identical computes being performed concurrently. But this is better than making one thread wait for the other because it allows both threads to collaborate on an upstream TaskCollaboration. - Changing the behaviour of the Standard policy would be a breaking change, so Default is introduced as the new "blessed" name for Legacy.
This also disables the test for `Standard` and `TaskIsolation` policies because they are now the same as `TaskCollaboration`. This does highlight the fact that we now have the possibility of new stalls in the UI, but only for cache policies that aren't ideal anyway. We need to focus our energies in to fixing the shared cancellation for TaskCollaboration, and making as many processes as possible us TaskCollaboration.
This provides some substantial performance improvements : ``` - testFanOutPerformance (GafferTest.CollectTest.CollectTest) : was 0.83s now 0.44s (47% reduction) - testLoop (GafferTest.CollectTest.CollectTest) : was 1.64s now 0.09s (95% reduction) - testDeepTreePerformance (GafferTest.ProcessTest.ProcessTest) : was 15.86s now 0.04s (100% reduction) ``` Full disclosure : it also shows a roughly 30% increase in runtime for `ProcessTest.testCollaborationTransitionPerformance`, and I'm not sure why. But that's small beans compared to the orders of magnitude in `testLoop` and `testDeepTreePerformance`. Note that `isolated_task_group` is actually a TBB preview feature, so technically could be taken away from us in the future. But it is still present in the latest TBB release (2021.10.0) and is still in the latest source code at https://github.com/oneapi-src/oneTBB/blob/master/include/oneapi/tbb/task_group.h. It is also [used in Embree](https://github.com/embree/embree/blob/master/man/man3/rtcCommitScene.4embree4#L34-L38) where the documentation cautions against the alternative of using a `task_arena` "due to its high runtime overhead". So I think we should be OK. If it ever disappears, alternatives include : - Rolling our own, which should be doable provided that TBB continues to expose `isolate_within_arena()`, or might also be doable using public API by constructing a regular `task_group` from inside `this_task_arena::isolate()`. - Using `tbb::collaborative_call_once()`, which is not available in our current TBB version but is available in future versions.
This was a poor man's deadlock-avoidance scheme which is no longer needed now that ValuePlug is using `Process:acquireCollaborativeResult()`.
Now that WorkerRead recursion has been removed, writability is determined entirely by AquireMode, so we can replace conditionals with asserts.
This reverts commit 148291d.
684a5b9
to
a62f892
Compare
Rebased to squash fixups in and get Changes.md in order. Should be ready to merge. |
Following on from #5489, this presents a solution to issue #4978, in a form ready for review. As discussed in the meeting today, the question of whether to use
task_arena
orisolated_task_group
is still an open one, with each working better in some scenarios, and in truth, both being a poor fit for what we need. But that shouldn't prevent a thorough review of the rest of the collaboration framework, which is largely independent of the exact method used to launch tasks.