Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Fusion #113
base: branch-24.03
Are you sure you want to change the base?
Task Fusion #113
Changes from 46 commits
8ff0261
4a9248f
99cd51c
a4e21d8
9b57fc8
bf7973a
b6121e7
e3708bf
8eca06f
dbf8ea9
a2c3874
303866b
e8b9021
7b0ee30
8238083
46856c7
6b4182b
db65e43
ddd8dc7
ef677e6
ba955e2
de337cf
78335cc
369909d
ab0c044
a9e1014
54d3bb8
dd46bdd
02103ad
d6ccdd2
94a1fba
12d13d1
51dd00f
f348080
f388f07
345f275
ab96ebc
1f9a655
e2afa73
80aa90f
f9eb119
ba55358
78366c8
f07381e
ea12377
d7a8dab
fe66dad
a73dec1
437e67e
9594cb5
239ae35
b49557b
e05e3ff
a59e142
399d070
5bb4df5
36c40cb
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo: all changes to this file will be reverted to what's in nv-legate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all changes to this file will be reverted before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this design. The fusion task shouldn't be any different from a normal Legate task. I believe the fusion metadata can be passed as a bunch of scalar arguments to the fusion task. If we do that, then we don't need to make every place in the core handle fusion. Making the default path aware of task fusion doesn't sound like a good design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is def worth discussing (I pondered this myself):
I did think about making them scalars, but was not sure if there were any other implications/downsides of storing all the metadata that way (eg are there a maximum number of scalars we can give a task), since we'd be sending in quite a few scalars to the task (or rather multiple lists, each of which can have >50 scalars; the number of scalars in 4/5 of these lists is equivalent to the fused_op length, even with potential deduplication of stores).
If we did this, in the task's
scalars
array, we'd thus be mixing "metadata" scalars (which are really a bunch of lists) with the actual scalars used by the sub-tasks, which seemed to not really adhere to the abstraction of having a dedicated "scalars" array incontext
data structure. I thus chose to serialize it as Legate Task argument data. As you stated, the downside is that the core/default has to be "fusion aware".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you don't see any potential downsides of making them all scalars, then yeah it could totally be implemented to be scalar based. Would mostly just have to move book-keeping logic from the deserializer into the Fused Legate Task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel what's actually missing here is an ability to pass an arbitrary chunk of data (probably a
memoryview
object in Python) to a task, which in theory possible, so we should probably extend the buffer builder interface to take an an arbitrary memory view object and pack it as a single contiguous chunk. The assumption here is that the task knows how to interpret those bytes. It shouldn't be too difficult for me to add it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion is being removed for performance reasons. The new method
pack_32bit_int_arr
above is a much faster way of packing a set of 32 bit ints (rather than appending them individually), but it breaks the invariant thatlen(fmtstr) == len(self.args) + 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably generalize this code, as I'm sure this is a utility not just for 32-bit integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense; I found it faster to batch the packing of ints this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one (ie hack) is interesting, and will be removed. When creating a fused operation, if one of the stores has a partition with tag=0, but another reference to the the same store w/ the same partition has tag=1 in the same fused op, the runtime will complain that these 2 partitions are aliased, even though theoretically nothing is wrong. I will likely change this to to ignore tags when coalescing within a fused op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like an auto-partitioner issue: the auto-parallelizer marks a partition as the "key" partition used to distribute tasks across processors. And the auto-parallelizer conveys that information in the tag. With task fusion, the auto-parallelizer shouldn't pick a tag for a particular task, but for the whole window, so that the partition has the same tag throughout the window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, a hash method is used to map a store (or rather it's constraints) to it's partition in
solver.py
. For a certain store (for inputs, outputs, or reductions), its hash will ingest/use its task's absolute id, which is problematic for fusion; the fused task will have a different absolute id than the unfused tasks.To take care of this we set
input_parts
toself._unfused_input_parts
, so we can hash into the original "unfused" partitions. Otherwise, the method for hashing into a store's partition will complain and say that the partition for the current store doesn't exist for the fused opThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's two way of generating the set of partitions for a fused op:
-1 is to create the fused op, then repartition all the stores of the fused op (even though they were already partitioned in their unfused form)
-The other way is to build a
superstrategy
of all the individual 'unfused' partitionsThe second is much cheaper (partitioning_stores twice has significant/tangible overhead) and is what is currently used in the code