-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instancer Capsule #5453
Instancer Capsule #5453
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Daniel! This is all feeling pretty good to me, and its pleasing that we're not really having to restructure anything particularly radically to achieve it.
Was I right to remove prototypeChildNamesPlug? I think the answer is definitely yes,
I concur.
though this is currently causing some small performance regressions ... I don't think they're actually important
I agree here too. I think losing a tiny bit of performance for the non-encapsulated case in return for huge wins for the encapsulated case is a good trade-off. The non-encapsulated case isn't the one people choose when performance is their number one priority.
but I will probably investigate further, just because it's deeply bothering that I don't understand why some of the tests have gotten slower
Seems worth a quick look, but the way you've structured things currently seems pretty clean and simple, and I'd be reluctant to make that much more complex just to claw back a few percent in the case where the user has opted out of the higher performance approach.
and how I should go about testing this
I'd definitely recommend looking at using CapturingRenderer, as it provides a pretty direct way of checking that you're getting what you want. We could perhaps also give CapturingRenderer a custom constructor argument to get it to pretend to be the Arnold renderer, so we could test the instancer command/object.
8003b22
to
d0dec40
Compare
Well, this has taken way too long, but I'm finally feeling like I might have gotten some decent test coverage and hunted down all the weird corner case bugs from before. The code I would look at with the most suspicion are:
I did do a quick investigation to figure out why without the Arnold specific code path, ArnoldRendererTest.testInstancerEncapsulatePerf is showing a 20% slowdown ( though with the special code path it goes 30X faster ). Turns out, the slowdown is because it is now completely single threaded - it is taking 20% more time on one thread than the previous code took while fully loading 16 threads. Comparing single thread evaluation, it is 10X faster without the Arnold specific code, and 150X faster with the special case. We probably should be able to add some multithreading here to improve performance when the render translation is bottlenecked on a single instancer. |
The tests aren't running because I had originally targetted main which resulted in a merge conflict. I retargeted to 1.3, but now can't see any way to run the tests - all the relevant ones passed locally however. |
73d2f0d
to
ea4ba88
Compare
Thanks Daniel, I've been through and checked and resolved all the comments from my previous review - thanks for addessing them. I haven't rereviewed the latest code as it looks like there are a few problems with it : CI is showing an Instancer-related crash in the Linux debug build, and a compilation error in the Windows build :
I also ran into a crash when doing a quick Arnold performance test here. Hopefully you can reproduce it with the file below and
The output I get is this :
While taking care of all that, could you squash in the fixups and reverts so I can review in the form we'll merge it (I think I've forgotten anything I might have known from the previous review anyway) please? Cheers... |
ea4ba88
to
a2d6581
Compare
Fixed one silly mistake and squashed. Tests are all passing locally, lets see what CI thinks ... |
a2d6581
to
623e5a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Daniel! As discussed in a side channel just now, I'm posting my half-complete review to give you a few simple things to deal with today, and a few performance measurements from my end to consider. I'll return to review the trickier stuff in Instancer.cpp
tomorrow, hopefully with a clearer head.
Cheers...
John
f640e63
to
e77ebb6
Compare
I think I've addressed all feedback. The one other significant change here - since you said you hadn't really looked at the changes to Instancer.cpp much, I've gone back and split the commit to pretend that the changes to remove __prototypeChildNames happened before InstancerCapsule was added. This change was a pre-requisite because for InstancerCapsule, we wanted the EngineData to contain all the information needed, rather than having InstancerCapsule need to evaluate multiple plugs. This should make it easier to review, and hopefully helps clarify why the performance for the non-encapsulated case might have changed ( if it's improved for you, that's great ). The remaining big concern is why we're not seeing the performance improvement scale properly with threads on your computer. |
That's great, thanks! When I first saw that huge commit I was wondering if it could be split like that to make it more digestible. |
This knocks around 30% off the runtime of `ArnoldRenderTest.testInstancerEncapsulatePerf` in PR GafferHQ#5453, testing with 64 threads.
This knocks around 30% off the runtime of `ArnoldRenderTest.testInstancerEncapsulatePerf` in PR GafferHQ#5453, testing with 64 threads.
This knocks around 30% off the runtime of `ArnoldRenderTest.testInstancerEncapsulatePerf` in PR GafferHQ#5453, testing with 64 threads.
e77ebb6
to
bfe43ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers Daniel - comments for remaining commits inline. Note that I haven't commented on your last commit as you indicated it was hacky work-in-progress.
This knocks around 30% off the runtime of `ArnoldRenderTest.testInstancerEncapsulatePerf` in PR GafferHQ#5453, testing with 64 threads.
9860216
to
543f020
Compare
OK, I think I've addressed all feedback - it is a bit of a mixture of separate commits and squashing in place ... some of the comments were about it not making sense how things were spread through the commits, so it wouldn't have helped much to add more fix commits to the picture ... and in the course of this rearranging, I've actually altered the history so that "omitDuplicateIds" comes before "Optimize encapsulated" ... I had to squash them together to address some feedback, and when splitting them, the other order made more sense. If you want me to, I can go through and annotate which comment is fixed where, because this is a bit more confusing currently than sometimes. The main commit that would probably be good to have you look at now is the last one, "Parallelize prototype creation when encapsulated". This requires some locking that is more complicated than I would have hoped, but I don't see a way to make it simpler without adding more contention. It is definitely useful, producing a 3X speedup on The remaining thing I haven't fixed is all the bugs introduced by outputting objects directly instead of capsules. |
I've fixed the most obvious issue with skipping internal capsules in favour of naked objects when the prototype has no children. I think there are still issues with light linking and such, but I'll talk to you tomorrow about what the requirements are here. |
adc3c34
to
6c0e2c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Daniel! Hopefully this is the final round of comments - mostly trivialities, but hopefully I've spotted a couple of more worthwhile things as well.
As I mentioned offline yesterday, I'm seeing really very substantial IPR startup improvements with this in production scenes, so it has most definitely been worth the effort. Other than the copy()
overhead I mentioned, the main remaining stuff in a profile was AiNode()
and a bunch of our ArnoldRenderer
internals - I think we can probably squeeze out some more improvements from that once this is merged.
I've also done a bit of interactive testing with side-by-side encapsulated and non-encapsulated copies of an instancer in IPR today, and so far everything has behaved perfectly (once I'd made the fix in #5532). I'm going to keep prodding at that a bit more, but it's looking like we're there now.
Thanks!
John
@@ -2528,82 +2529,112 @@ void Instancer::PrototypeScope::setPrototype( const EngineData *engine, const Sc | |||
namespace | |||
{ | |||
|
|||
struct Prototype | |||
// It shouldn't be necessary for this to be refcounted - but LRUCache is set up to make it impossible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up against something similar to this the other day, and although I decided not to dice with it, I think it might actually be possible to use Prototype *
as the value type, and then use a RemovalCallback
to do delete value
. Let's not worry about it until after we get this merged, but it might be worth playing with to see if it offers any benefit afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ( thought I ) had some spare time today, so I took a swing at this. It resulted in a 20% speedup in testInstancerManyPrototypesEncapsulatePerf ... this surprised me a bit - I had expected it to be more significant in testInstancerFewPrototypesEncapsulatePerf, since I was expecting the overhead in my code to play a greater part when the LRUCache just has one thing in it, and isn't working that hard. These results suggest that there's some overhead inside the LRUCache ... are we doing reference counting when the hash map resizes? I would have hoped that modern C++ could prevent that with an std::move
? Anyway, that might merit some investigation - but I think this faster code feels a bit cleaner to me anyway? Happy to omit this commit if it raises some uncertainty for you though that would prevent merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this explains what you're seeing, but I'd expect the LRUCache to perform better when queries are spread across a number of different keys. Internally it uses a binned map and mutex on each entry, so it sees much less contention if different threads are asking for different things than if they're all fighting over the one thing. Perhaps that fighting-over-one-thing drowns out any potential benefit from removing the ref-counting?
boost::python::objects::copy_class_object( | ||
type_id<Capsule>(), Instancer::instancerCapsuleTypeInfo() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "Selection Mask" menu in the viewer is broken now, always showing the highlighted "non default" icon, and refusing to tick the Capsules item. I suspect it's something to do with this, but haven't verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll leave it to you to sort this one out, since you know the _SelectionMaskPlugValueWidget.
The cause of the break is pretty understandable. _SelectionMaskPlugValueWidget needs to find the types inheriting from a base class which are actually going to be rendered ... and it approximates this with _leafTypes, which returns any derived types which do not themselves have derived types. From a theoretical perspective, I think this is simply wrong ... just because something has a derived type doesn't mean it can't also be an object that gets rendered.
_SelectionMaskPlugValueWidget then compares sets of leaf types, but in the case of the hardcoded menu options, it compares leaf types to the hardcoded type ( rather than converting the hardcoded type to leaf types before comparing ). If this was fixed, then selecting the menu option would work ... but it would fail to actually mask Capsules that aren't InstancerCapsules.
So the trigger isn't the python binding stuff, but just the IE_CORE_DECLAREEXTENSIONOBJECT( InstancerCapsule, GafferScene::InstancerCapsuleTypeId, GafferScene::Capsule );
which declares something deriving from Capsule ... I don't think there's anything wrong with this though.
What I think a fix would look like is:
- rather than _leafTypes, it should use something like _descendantTypes, which includes a type itself and all descendants whether or not they are leaves. This would include some types which actually abstract rather than concrete renderable things ... I wouldn't think that would actually screw it up, but I might be missing something
- rather than
checked = currentTypes.issuperset( types )
it feels like it might be more consistent to dochecked = currentTypes.issuperset( _descendantTypes( types ) )
so that both sides of the comparison have been processed through _descendantTypes ... though I guess switching from _leafTypes to _descendantTypes means this isn't strictly necessary
Anyway, I can do the work if this sounds like the right approach, but I figured I check in before trying to change the meanings of some stuff I don't fully understand.
4b1ca2c
to
59ea28b
Compare
4 more fixup commits for you to take a look at. Should be just about there! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Daniel. Comments inline - I think we have a memory leak now. We can chat about this in the meeting today, but this has gone on long enough and we're close enough that I'm feeling inclined to drop "Use raw pointer" and "Switch to renderInstance" in favour of getting something out without further iteration and testing. Then we can follow up with additional improvements once we've actually got some results in people's hands.
src/GafferScene/Instancer.cpp
Outdated
@@ -2736,7 +2731,15 @@ void Instancer::InstancerCapsule::render( IECoreScenePreview::Renderer *renderer | |||
!hasAttributes | |||
); | |||
}, | |||
limits<size_t>::max() // Never evict, even if prototypes are all unique | |||
|
|||
std::numeric_limits<size_t>::max(), // Never evict, even if prototypes are all unique |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we should have used numeric_limits
in the original commit as well? What is limits
actually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not even sure what limits
was finding, other than that it must have been a large integer, since the tests passed. I only fixed here because it was a last minute fix to get CI going, and I was planning to squash anyway. I've now fixed it in the correct place.
src/GafferScene/Instancer.cpp
Outdated
@@ -2612,6 +2612,76 @@ struct Prototype | |||
} | |||
} | |||
|
|||
// This internal interface for rendering an instance from a prototype is mostly pretty clean, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like small_vector
might let us clean this up a bit? I suppose the other option would be thread_local
for our scratch space?
src/GafferScene/Instancer.cpp
Outdated
template< typename F > | ||
inline void renderInstance( | ||
const std::string &name, const std::vector<float> &transformSampleTimes, | ||
std::vector<M44f> &instanceTransforms, F &&attributeProcessor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it doesn't matter all that much since this is all internal, but this attributeProcessor
feels a bit awkward. I wonder if we could use the same "scratch space" idea from instanceTransforms
so that we pass CompoundObject &instanceAttributes
and then just insert into it from m_attributes
? At least then we only have one idiom for avoiding allocations? I don't think it will be as beneficial as it could be in this case as CompoundObject is using a map
internally, but I'd assume that if it was using unordered_map
we might avoid allocating at all after the first go round?
59ea28b
to
8fb87b9
Compare
OK, I've dropped the two troublesome commits, and added a hack for the selection mask thing. Hopefully that gets us to a reasonable place for merging. |
These tests are so sensitive that they fail constantly if run for fewer iterations, but are rather slow for normal runs.
It looks like this was a bug that somehow got copied into the test ( oops )
8fb87b9
to
3adc801
Compare
Thanks! I've tweaked the selection mask workaround to avoid adding the Merging! |
Based on #3998, the concept for this is that when an Instancer is set to "encapsulateInstanceGroups", it should be able to add special cases to communicate more directly with the render, improving performance.
My goal for today was to clear away enough of the construction sawdust for it to not be totally unreasonable for John to do a first pass review before his vacation.
Known issues:
Main questions:
Was I right to remove prototypeChildNamesPlug? I think the answer is definitely yes, though this is currently causing some small performance regressions ... I don't think they're actually important, but I will probably investigate further, just because it's deeply bothering that I don't understand why some of the tests have gotten slower ( testEngineDataPerf has gotten obviously slower for obvious reasons, but is not an important case in practice ). If there is going to be any plug that holds something that needs to be global to the Instancer, but doesn't get initialized when EngineData is initialized, then rather than prototypeChildNamesPlug, what would be nice to store separately is the computation of m_prototypeIndexRemap and m_instancesForPrototype which occur here:
gaffer/src/GafferScene/Instancer.cpp
Line 439 in 8003b22
... main time when this work would happen unnecessarily is when there are several transform segments, but the topology does not change within the shutter. Then rather than building separate mappings for each EngineData, you could just build a mapping for the first EngineData, and then compare the topology of each subsequent EngineData, and if they're equal, just use the mapping from the first one. Currently, in order to compare if they're identical, you'd have to get the EngineData for each sample, triggering all the mapping builds. Probably not worth actually doing, but I might try to benchmark the cost we're currently paying at least.
Where should InstancerCapsule live? Currently, EngineData is only forward-declared in Instancer.h ( and I think this is a good thing ), which means that in order to use EngineData, InstancerCapsule must be defined in Instancer.cpp. To achieve this, I've put InstancerCapsule inside Instancer, though I think things might be clearer if it was outside, and just declared as a friend in order to access what it needs. Would it be acceptable to have both GafferScene::Instancer and GafferScene::InstancerCapsule both defined in Instancer.cpp?
I guess those are my specific questions, but also just general feedback on how things are structured ... and how I should go about testing this ... it feels like there are a huge number of special cases.