-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K-means with Silhouette scoring crashes on insufficient RAM #1502
Comments
@klonuo I've tried with several different data sets and k-means works fine for me (latest version, Win). Tested with zoo.tab and voting.tab. |
k-means works fine but not silhouette scoring or plot, as they crash Orange. |
The silhouette scoring fails because you have given it 23k rows and you don't have enough RAM (or not enough RAM is available to a 32-bit process) to compute 23k**2 distance matrix (some 4.3 GB needed). If you have enough RAM physically (e.g. >= 6 GB), try installing 64-bit Python 3.5 Anaconda and their 64-bit build of Orange 3. Alternatively, reduce your data set. According to silhouette, the optimal number of clusters is 5. But it's pretty tight. |
How does the crash look like? |
Thanks @kernc Indeed, my laptop has 4GB RAM.
Regular crash - dialog pops informing that Python has crashed and Orange process ends. |
So if Python's catching MemoryError works (which may not always be the case), Can you test if wrapping that |
np, but I could not catch the exception, neither in Also I added breakpoint in |
class Silhouette(ClusteringScore):
separate_folds = True
def compute_score(self, results):
try:
return self.from_predicted(results, silhouette_score)
except MemoryError:
return 'whatever'
|
Yep, as mentioned above crash seems to happen before arriving to |
In either case, having some 20k examples is not unreasonable, whereas crashing is. The potential fix from top of the head is to replace all |
Never mind. This is not it. 🎉 |
Ok. Python crashes on: orange3/Orange/clustering/kmeans.py Line 29 in 074ceca
Stepping inside, crash happens on: https://github.com/scikit-learn/scikit-learn/blob/3f37cb989af44c1f7ff8067cba176cf9b0c61eb7/sklearn/metrics/pairwise.py#L245 called from https://github.com/scikit-learn/scikit-learn/blob/3f37cb989af44c1f7ff8067cba176cf9b0c61eb7/sklearn/metrics/pairwise.py#L1078 But I couldn't analyze this loop... |
Just to add this... I tried again with sklearn (0.17.1 from my 64bit Python 3.5 shell) and my system freeze - Python process took exactly as you mentioned - 4.3 GB. Previously I must have used some sub-sample of the data... IMHO this is worse than just crashing Python, as I had no other option then system reset. I don't know if its feasible, but it could be nice if for some demanding algorithms we could validate user memory (with psutil perhaps) to user data. |
Pinged sklearn, hoping for a reply on the issue. @kernc Is this something we can fix or do we have to wait for sklearn? |
I guess, right before this line, we could add: _ = np.empty((X.shape[0] + 10,) * 2) # +10 overhead margin
del _ When |
Furthermore, is trying to catch MemoryError still needed? |
Ah, seems like it's mitigated, thanks. It would probably still crash if <= 200M RAM available, and the except block, as is, wouldn't catch it. |
Read file with two integer columns, then select k-means and try silhouette scoring - result crash.
Same happens with silhouette plot.
Using sklearn to the same shows no issues.
latest Orange 3.3.8 nightly build on Windows
The text was updated successfully, but these errors were encountered: