-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of np.unique(return_index=True) #1138
base: branch-24.03
Are you sure you want to change the base?
Initial implementation of np.unique(return_index=True) #1138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, please add more comments to the code, especially in places where you have changed some logic in a non-trivial manner. Another good place for comments is when you introduce (reasonable) code duplication -- a small comment saying why this is the case is good for future readers.
cunumeric/array.py
Outdated
@@ -4168,8 +4170,17 @@ def unique(self) -> ndarray: | |||
Multiple GPUs, Multiple CPUs | |||
|
|||
""" | |||
thunk = self._thunk.unique() | |||
return ndarray(shape=thunk.shape, thunk=thunk) | |||
thunk = self._thunk.unique(return_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] -- i wouldn't call this thunk
anymore, since when return_index=True
it's not a thunk anymore, it's a tuple of thunks.
|
||
result = None | ||
# Assuming legate core will always choose GPU variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]: Add a comment as to why the branches here are creating the thunk in different ways (i.e. different types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra nit -- comments should be complete sentences, so punctuation and periods please.
if return_index: | ||
returned_indices = self.runtime.create_unbound_thunk(ty.int64) | ||
if self.runtime.num_gpus > 0: | ||
task.add_output(returned_indices.base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, comment this asymmetry between the CPU and GPU implementations
) | ||
if return_index: | ||
task = self.context.create_auto_task(CuNumericOpCode.UNZIP) | ||
task.add_input(result.base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an unclear way of writing the code, adding result.base
as input, overwriting result, and then adding result.base
as output.
int64_t index; | ||
}; | ||
|
||
// Surprisingly it seems as though thrust can't figure out this comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might not need this if you either:
- implement comparison overloads on
ZippedIndex
- use a builtin type like
std::pair
instead, which has equality implemented already
|
||
if (return_index) { | ||
indices.first.destroy(); | ||
other_index.first.destroy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you be early releasing this above like other_piece
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally wanted to destroy the buffers here largely for the sake of clean code to minimize the number of special blocks for the return_index case. Are you saying we should free these buffers closer to where we release my_piece and other_piece to potentially free up space before
my_piece.first = output.create_output_buffer<VAL, 1>(buf_size);
in line 182?
sizeof(int64_t) * my_piece.second, | ||
cudaMemcpyDeviceToDevice, | ||
stream)); | ||
merged_index.destroy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it is safe for you to do a cudaMemcpyAsync
and then destroy merged_index
before the cudaMemcpyAsync
is guaranteed to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating here it is presently safe to do this in Legate as all operations on a given GPU use the same stream. Since this is a tree reduction I think it's probably best to destroy without manually synchronizing to avoid excess latency and duplicating data across multiple iterations of the reduction.
@@ -105,7 +105,8 @@ std::vector<StoreMapping> CuNumericMapper::store_mappings( | |||
} | |||
case CUNUMERIC_MATMUL: | |||
case CUNUMERIC_MATVECMUL: | |||
case CUNUMERIC_UNIQUE_REDUCE: { | |||
case CUNUMERIC_UNIQUE_REDUCE: | |||
case CUNUMERIC_UNZIP_INDICES: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on looking at the implementation of unzip_indices
below, you don't need any special mapping for it and should fall to the default case at the bottom here.
…age when unzipping indices
for more information, see https://pre-commit.ci
|
||
result = None | ||
# Assuming legate core will always choose GPU variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra nit -- comments should be complete sentences, so punctuation and periods please.
PR motivating nv-legate/legate#942
I also decided to create a special task for unzipping indices and values as opposed to modifying legate.core's Reduce implementation after speaking with @magnatelee .
@rohany