New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[FIX] Merge: work with sparse #2305

Merged

nikicc merged 2 commits into biolab:master from jerneju:sparse-merge

May 29, 2017

Contributor

jerneju commented May 10, 2017 •

edited

Loading

Issue

Fixes #2155.

Description of changes

Includes

Code changes
Tests
Documentation

jerneju changed the title ~~Merge: works with sparse~~ [WIP][FIX] Merge: work with sparse

codecov-io commented May 10, 2017 •

edited

Loading

Codecov Report

Merging #2305 into master will decrease coverage by 0.04%.
The diff coverage is 92.5%.

@@            Coverage Diff             @@
##           master    #2305      +/-   ##
==========================================
- Coverage   73.33%   73.28%   -0.05%     
==========================================
  Files         317      317              
  Lines       55447    55474      +27     
==========================================
- Hits        40662    40656       -6     
- Misses      14785    14818      +33

Contributor Author

jerneju commented May 12, 2017 •

edited

Loading

Waiting #2286 because hstack from util.py is needed.

Contributor

nikicc commented May 18, 2017

#2286 is merged. Are there some changes still needed? Can we do something about radon?

jerneju changed the title ~~[WIP][FIX] Merge: work with sparse~~ [FIX] Merge: work with sparse

Contributor Author

jerneju commented May 18, 2017

@nikicc : Radon does not complain anymore because this code now uses universal hstack.

nikicc suggested changes

View reviewed changes

Contributor

nikicc left a comment

If I concatenate two BoW data sets on the output all variables seem to have a name a. Please check.

Orange/widgets/data/owmergedata.py Outdated

+                              (indices[1], arr2, right)):
+                          known = ind != -1
+                          if sum(known):
+                              to_change[known] = lookup[ind[known]]

Contributor

nikicc May 19, 2017

This is probably not sufficient for sparse data. For dense, we initially set all values as missing and hence it's ok to set only known values. For sparse, we initially set all values as zeros. Hence we should, along setting known values, also set unknown values as nans.

nikicc added the DH2017 label

nikicc assigned kernc

kernc reviewed

View reviewed changes

Orange/widgets/data/owmergedata.py Outdated

                       tpe = object if object in (left.dtype, right.dtype) else left.dtype
                       left_width, right_width = left.shape[1], right.shape[1]
-                      arr = np.full((indices.shape[1], left_width + right_width), np.nan, tpe)
+                      sparse = sp.issparse(left) or sp.issparse(right)

Contributor

kernc May 26, 2017

boolean flags should begin with is_. https://stackoverflow.com/questions/1227998/naming-conventions-what-to-name-a-boolean-variable

is_sparse = sp.issparse(left) or sp.issparse(right)
if is_sparse:
    ...

Orange/widgets/data/owmergedata.py Outdated

+                              return np.full((height, width), np.nan, dtype=dtype)
+                          else:
+                              return sp.csr_matrix((height, width), dtype=dtype)
                       tpe = object if object in (left.dtype, right.dtype) else left.dtype

Contributor

kernc May 26, 2017

np.find_common_type([left.dtype, right.dtype])

Contributor Author

jerneju May 26, 2017

No. In that case all tests fail.

kernc suggested changes

View reviewed changes

Orange/widgets/data/owmergedata.py Outdated

+                          if not sparse:
+                              return np.full((height, width), np.nan, dtype=dtype)
+                          else:
+                              return sp.csr_matrix((height, width), dtype=dtype)

Contributor

kernc May 26, 2017

Emits a warning in the for loop below:

SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.

kernc approved these changes

View reviewed changes

kernc reviewed

View reviewed changes

Orange/widgets/data/owmergedata.py Outdated

+                          if not is_sparse:
+                              known = ind != -1
+                              if sum(known):
+                                  to_change[known] = lookup[ind[known]]

Contributor

kernc May 26, 2017 •

edited

Loading

Re lint, does this work?

def _join_array_by_indices(left, right, indices, string_cols=None):

    def prepare(arr, inds):
        try:
            newarr = arr[inds]
        except IndexError:
            newarr = np.full_like(arr, np.nan)
        else:
            newarr[inds == -1] = np.full(arr.shape[1], np.nan)
        return newarr

    res = hstack((prepare(left, indices[0]),
                  prepare(right, indices[1])))
    if string_cols:
        res[:, string_cols] = ""  # IDK what this does
    return  res

Contributor Author

jerneju May 26, 2017 •

edited

Loading

No, this does not work.

kernc suggested changes

View reviewed changes

Orange/widgets/data/owmergedata.py Outdated

+                          arr1[:, [sc for sc in string_cols if sc < left_width]] = ""
+                          arr2[:, [sc - left_width for sc in string_cols if sc >= left_width]] = ""
+                      for ind, to_change, lookup in (

Contributor

kernc May 26, 2017

This works!

    def _join_array_by_indices(left, right, indices, string_cols=None):
        def prepare(arr, inds, str_cols):
            try:
                newarr = arr[inds]
            except IndexError:
                newarr = np.full_like(arr, np.nan)
            else:
                empty = np.full(arr.shape[1], np.nan)
                if str_cols:
                    assert arr.dtype == object
                    empty = empty.astype(object)
                    empty[str_cols] = ''
                newarr[inds == -1] = empty
            return newarr

        left_width = left.shape[1]
        str_left = [i for i in string_cols or () if i < left_width]
        str_right = [i - left_width for i in string_cols or () if i >= left_width]
        res = hstack((prepare(left, indices[0], str_left),
                      prepare(right, indices[1], str_right)))
        return res

jerneju added 2 commits

May 26, 2017 14:31


          Merge Data: work with sparse

c79669f


          Table: fix sparse indices

e461dc3

kernc approved these changes

View reviewed changes

Contributor Author

jerneju commented May 26, 2017

Well-done, @kernc !

Contributor

ajdapretnar commented May 29, 2017

@nikicc Could you please check and merge if this is done? :)

nikicc approved these changes

View reviewed changes

nikicc merged commit 554b885 into biolab:master

jerneju deleted the sparse-merge branch

May 29, 2017 10:38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet