-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Merge: work with sparse #2305
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2305 +/- ##
==========================================
- Coverage 73.33% 73.28% -0.05%
==========================================
Files 317 317
Lines 55447 55474 +27
==========================================
- Hits 40662 40656 -6
- Misses 14785 14818 +33 |
Waiting #2286 because |
#2286 is merged. Are there some changes still needed? Can we do something about radon? |
@nikicc : Radon does not complain anymore because this code now uses universal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orange/widgets/data/owmergedata.py
Outdated
(indices[1], arr2, right)): | ||
known = ind != -1 | ||
if sum(known): | ||
to_change[known] = lookup[ind[known]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not sufficient for sparse data. For dense, we initially set all values as missing and hence it's ok to set only known values. For sparse, we initially set all values as zeros. Hence we should, along setting known values, also set unknown values as nans.
Orange/widgets/data/owmergedata.py
Outdated
tpe = object if object in (left.dtype, right.dtype) else left.dtype | ||
left_width, right_width = left.shape[1], right.shape[1] | ||
arr = np.full((indices.shape[1], left_width + right_width), np.nan, tpe) | ||
|
||
sparse = sp.issparse(left) or sp.issparse(right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean flags should begin with is_
. https://stackoverflow.com/questions/1227998/naming-conventions-what-to-name-a-boolean-variable
is_sparse = sp.issparse(left) or sp.issparse(right)
if is_sparse:
...
Orange/widgets/data/owmergedata.py
Outdated
return np.full((height, width), np.nan, dtype=dtype) | ||
else: | ||
return sp.csr_matrix((height, width), dtype=dtype) | ||
|
||
tpe = object if object in (left.dtype, right.dtype) else left.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.find_common_type([left.dtype, right.dtype])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. In that case all tests fail.
Orange/widgets/data/owmergedata.py
Outdated
if not sparse: | ||
return np.full((height, width), np.nan, dtype=dtype) | ||
else: | ||
return sp.csr_matrix((height, width), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emits a warning in the for loop below:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
Orange/widgets/data/owmergedata.py
Outdated
if not is_sparse: | ||
known = ind != -1 | ||
if sum(known): | ||
to_change[known] = lookup[ind[known]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re lint, does this work?
def _join_array_by_indices(left, right, indices, string_cols=None):
def prepare(arr, inds):
try:
newarr = arr[inds]
except IndexError:
newarr = np.full_like(arr, np.nan)
else:
newarr[inds == -1] = np.full(arr.shape[1], np.nan)
return newarr
res = hstack((prepare(left, indices[0]),
prepare(right, indices[1])))
if string_cols:
res[:, string_cols] = "" # IDK what this does
return res
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this does not work.
Orange/widgets/data/owmergedata.py
Outdated
arr1[:, [sc for sc in string_cols if sc < left_width]] = "" | ||
arr2[:, [sc - left_width for sc in string_cols if sc >= left_width]] = "" | ||
|
||
for ind, to_change, lookup in ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works!
def _join_array_by_indices(left, right, indices, string_cols=None):
def prepare(arr, inds, str_cols):
try:
newarr = arr[inds]
except IndexError:
newarr = np.full_like(arr, np.nan)
else:
empty = np.full(arr.shape[1], np.nan)
if str_cols:
assert arr.dtype == object
empty = empty.astype(object)
empty[str_cols] = ''
newarr[inds == -1] = empty
return newarr
left_width = left.shape[1]
str_left = [i for i in string_cols or () if i < left_width]
str_right = [i - left_width for i in string_cols or () if i >= left_width]
res = hstack((prepare(left, indices[0], str_left),
prepare(right, indices[1], str_right)))
return res
Well-done, @kernc ! |
@nikicc Could you please check and merge if this is done? :) |
Issue
Fixes #2155.
Description of changes
Includes