Sparse data may cause problems for some preprocessors #1719

BlazZupan · 2016-11-04T13:03:37Z

Orange version

3.3.8 on Windows 7

Expected behavior

Orange should seamlessly propagate sparse data through the pipeline.

Actual behavior

Some preprocessors transform sparse data to dense data and hence clog memory, Orange crashes with MemoryError. An obvious example of such preprocessor is imputation, and is invoked before (any?) learner of scikit.

A typical trace of the error is:

 File \"C:\\Python34\\lib\\site-packages\\Orange\\canvas\\scheme\\widgetsscheme.py\ line 823, in process_signals_for_widget 
handler(*args) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\widgets\\utils\\sql.py\ line 28, in new_f 
return f(widget, data, *args, **kwargs) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\widgets\\utils\\owlearnerwidget.py\ line 159, in set_data 
self.update_model() 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\widgets\\utils\\owlearnerwidget.py\ line 175, in update_model 
self.model = self.learner(self.data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\base.py\ line 249, in __call__ 
m = super().__call__(data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\base.py\ line 44, in __call__ 
data = self.preprocess(data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\base.py\ line 239, in preprocess 
data = super().preprocess(data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\base.py\ line 68, in preprocess 
data = pp(data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\preprocess\\preprocess.py\ line 168, in __call__ 
X = self.imputer.fit_transform(data.X) 
 File \"C:\\Python34\\lib\\site-packages\\sklearn\\base.py\ line 494, in fit_transform 
return self.fit(X, **fit_params).transform(X) 
 File \"C:\\Python34\\lib\\site-packages\\sklearn\\preprocessing\\imputation.py\ line 313, in transform 
force_all_finite=False, copy=self.copy) 
 File \"C:\\Python34\\lib\\site-packages\\sklearn\\utils\\validation.py\ line 382, in check_array 
array = np.array(array, dtype=dtype, order=order, copy=copy) 
MemoryError

Steps to reproduce the behavior

We have spotted this type of error from bug reports and inferred the cause from error traces.

The text was updated successfully, but these errors were encountered:

ajdapretnar · 2017-05-29T08:34:19Z

@nikicc Does any of the recent changes address this? Imputation still seems to take a while when connected to Corpus.

nikicc · 2017-05-29T11:04:35Z

@ajdapretnar they probably do. As of modeling, everything should be ok now. We use SklImputeas a default preprocessor before passing the data to scikit, and it seems to know how to handle sparse data sets properly.

But with our imputers, there still might be some problems.

ajdapretnar · 2018-01-11T10:07:33Z

Related perhaps to #1713.

I know a lot of work has been done on sparse since. @BlazZupan Do you have an example we could test against?

ajdapretnar · 2018-07-27T07:59:55Z

We need better support for sparse data, but this is a bigger issue.
If anyone has a good example on how to reproduce, please add it here.

ajdapretnar closed this as completed Jul 27, 2018

ajdapretnar mentioned this issue Jul 27, 2018

Initializing and resizing sparse Tables #2253

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse data may cause problems for some preprocessors #1719

Sparse data may cause problems for some preprocessors #1719

BlazZupan commented Nov 4, 2016

ajdapretnar commented May 29, 2017

nikicc commented May 29, 2017

ajdapretnar commented Jan 11, 2018

ajdapretnar commented Jul 27, 2018

Sparse data may cause problems for some preprocessors #1719

Sparse data may cause problems for some preprocessors #1719

Comments

BlazZupan commented Nov 4, 2016

Orange version

Expected behavior

Actual behavior

Steps to reproduce the behavior

ajdapretnar commented May 29, 2017

nikicc commented May 29, 2017

ajdapretnar commented Jan 11, 2018

ajdapretnar commented Jul 27, 2018