Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
After Orange 3.21 we introduced quite a few performance-regressions to K-means, which became very slow for large data set.
Description of changes
Preprocess data once. In master, data set is preprocessed separately for each number of clusters and then also when computing silhouettes. If used with from-to this caused too many in-memory copies of data. Which means crashes on big data due to memory usage. Note that sklearn's k-means makes another copy of the data.
Only compute approximate (sampled) silhouettes for big data. Fixes bug introduced in Orange 3.22. Compute them in worker threads.
Fix O(|features|^2) when creating centroids.
There is a further obvious improvement that I did not tackle in this PR: preprocessing in a worker thread. Now (=this branch, master, current release) the preprocessing for big data blocks UI for about a minute for a data set with 98304 rows and 806 columns. It is mainly normalization. If it is disabled, it only blocks for a few seconds.
Benchmarks
Orange 3.21
Master
This branch
Still not quite Orange 3.21 performance, but will do. The different in [wide] is due to normalization.
Includes