Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse data may cause problems for some preprocessors #1719

Closed
BlazZupan opened this issue Nov 4, 2016 · 4 comments
Closed

Sparse data may cause problems for some preprocessors #1719

BlazZupan opened this issue Nov 4, 2016 · 4 comments

Comments

@BlazZupan
Copy link
Contributor

Orange version

3.3.8 on Windows 7

Expected behavior

Orange should seamlessly propagate sparse data through the pipeline.

Actual behavior

Some preprocessors transform sparse data to dense data and hence clog memory, Orange crashes with MemoryError. An obvious example of such preprocessor is imputation, and is invoked before (any?) learner of scikit.

A typical trace of the error is:

 File \"C:\\Python34\\lib\\site-packages\\Orange\\canvas\\scheme\\widgetsscheme.py\ line 823, in process_signals_for_widget 
handler(*args) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\widgets\\utils\\sql.py\ line 28, in new_f 
return f(widget, data, *args, **kwargs) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\widgets\\utils\\owlearnerwidget.py\ line 159, in set_data 
self.update_model() 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\widgets\\utils\\owlearnerwidget.py\ line 175, in update_model 
self.model = self.learner(self.data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\base.py\ line 249, in __call__ 
m = super().__call__(data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\base.py\ line 44, in __call__ 
data = self.preprocess(data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\base.py\ line 239, in preprocess 
data = super().preprocess(data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\base.py\ line 68, in preprocess 
data = pp(data) 
 File \"C:\\Python34\\lib\\site-packages\\Orange\\preprocess\\preprocess.py\ line 168, in __call__ 
X = self.imputer.fit_transform(data.X) 
 File \"C:\\Python34\\lib\\site-packages\\sklearn\\base.py\ line 494, in fit_transform 
return self.fit(X, **fit_params).transform(X) 
 File \"C:\\Python34\\lib\\site-packages\\sklearn\\preprocessing\\imputation.py\ line 313, in transform 
force_all_finite=False, copy=self.copy) 
 File \"C:\\Python34\\lib\\site-packages\\sklearn\\utils\\validation.py\ line 382, in check_array 
array = np.array(array, dtype=dtype, order=order, copy=copy) 
MemoryError
Steps to reproduce the behavior

We have spotted this type of error from bug reports and inferred the cause from error traces.

@ajdapretnar
Copy link
Contributor

@nikicc Does any of the recent changes address this? Imputation still seems to take a while when connected to Corpus.

@nikicc
Copy link
Contributor

nikicc commented May 29, 2017

@ajdapretnar they probably do. As of modeling, everything should be ok now. We use SklImputeas a default preprocessor before passing the data to scikit, and it seems to know how to handle sparse data sets properly.

But with our imputers, there still might be some problems.

@ajdapretnar
Copy link
Contributor

Related perhaps to #1713.

I know a lot of work has been done on sparse since. @BlazZupan Do you have an example we could test against?

@ajdapretnar
Copy link
Contributor

We need better support for sparse data, but this is a bigger issue.
If anyone has a good example on how to reproduce, please add it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants