[FIX] DataSampler: Fix crash when stratifying unbalanced datasets #1952

astaric · 2017-01-25T10:05:44Z

Issue

DataSampler crashed when stratified was selected, but one class contained only one element. Bootstrap did not work (invalid signature).

Description of changes

If stratified sampling fails, show a warning and sample without.
Add argument data to Bootstrap to satisfy signature.

Includes

Code changes
Tests

astaric · 2017-01-25T10:06:14Z

https://sentry.io/biolab/orange3/issues/206658574/

astaric · 2017-01-25T10:07:24Z

https://sentry.io/biolab/orange3/issues/208834147/

ajdapretnar · 2017-01-25T10:29:42Z

This works well for me, however Bootstrap always returns the same number of instances as on the input. I don't think this is the intended behaviour.

codecov-io · 2017-01-25T11:55:31Z

Current coverage is 89.53% (diff: 100%)

Merging #1952 into master will not change coverage

@@             master      #1952   diff @@
==========================================
  Files            90         90          
  Lines          9179       9179          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits           8218       8218          
  Misses          961        961          
  Partials          0          0

Powered by Codecov. Last update e0ec04e...ff90792

lanzagar · 2017-01-25T14:03:47Z

That is the intended behaviour. This method samples with replacement, so the size is always the same, but the instances in the sample are different (some are repeated).
But it is different from other sampling options, and maybe we should make it clearer to the user (e.g. tooltip; a horizontal rule after other methods and/or small caption Sampling with replacement; or even just rename to "Bootstrap sampling with replacement).

For this PR, however, I think this could be an excellent addition to the new tests for bootstrap - assert that len(sample) == len(input). So we will know in the future that this was indeed intended.

astaric · 2017-01-25T22:26:03Z

done. Also added comments clarifying the asserts.

janezd · 2017-01-25T21:42:14Z

Orange/widgets/data/tests/test_owdatasampler.py

+        self.assertGreater(len(in_sample), 0)
+        # The following assert has a really high probability of being true:
+        # 1-(1/150*2/150*...*145/150) ~= 1-2e-64
+        self.assertGreater(len(in_remaining), 0)


Shouldn't you use self.assertProbablyGreater here? :)

janezd · 2017-01-25T21:43:03Z

Orange/widgets/data/tests/test_owdatasampler.py

@@ -12,7 +12,7 @@ def setUpClass(cls):
        cls.iris = Table("iris")

    def setUp(self):
-        self.widget = self.create_widget(OWDataSampler)
+        self.widget = self.create_widget(OWDataSampler)  # type: OWDataSampler


This is cool. It always annoys me, but I never thought of helping PyCharm here...

I saw this in tests aleš writes :)

janezd · 2017-01-25T22:44:16Z

Orange/widgets/data/owdatasampler.py

@@ -46,6 +46,9 @@ class OWDataSampler(OWWidget):
    number_of_folds = Setting(10)
    selectedFold = Setting(1)

+    class Warning(OWWidget.Warning):
+        could_not_stratify = Msg("Could not stratify\n{}")


Could you turn this message into a sentence, something like "Stratification failed"?

janezd · 2017-01-25T22:45:44Z

Orange/widgets/data/tests/test_owdatasampler.py

+        self.assertEqual(len(in_sample & in_remaining), 0)
+        # Sampling with replacement will always produce at least one distinct
+        # instance in sample, and at least one instance in remaining with
+        # high probability (1-(1/150*2/150*...*145/150) ~= 1-2e-64)


Then you should probably use self.assertProbablyGreater? :)

I do not like the implicit nature of self.assertProbablyGreater, as it is not obvious how the check is performed. I am more inclined towards a decorator

@should_pass(3, out_of=5)["times"]

that would run the test for the required amount of retries and then compute the probability of it being correct.

But just in case we ever need it:

def assertProbablyGreater(self, a, b, msg=None): """Assert that a in greater than b in at least one out of two runs.""" import inspect try: return self.assertGreater(a, b, msg) except AssertionError: pass frames = inspect.stack()[1:] test_frames = [f for f in frames if f.function.startswith("test_")] if any(f for f in frames if f.function == "c") or not test_frames: # Retried, but still not working or nothing to retry. Fail return self.assertGreater(a, b, msg) getattr(self, test_frames[0].function)()

First, your decorator, cunning as it is, still leaves a non-zero probability that the test fails. To increase the probability the it will work, use

@should_pass(9, out_of=10)["times"] @should_pass(3, out_of=5)["times"]

Also, it should be called should_pass_at_least.

Second, I believe your formula is incorrect. Denominators should go to 150; the probability of failing is 150! / 150^150. (I guess you probably used the correct formula; this indeed gives 2.2e-64).

But just in case we ever need it:

I don't think so. :) Please just check the "third" remark before merging, the one about len(in_sample) being at least 1. I actually meant that one.

Second, I believe your formula is incorrect.

Nice catch. I have updated the formula.

janezd · 2017-01-26T08:59:02Z

Orange/widgets/data/tests/test_owdatasampler.py

+        # Sampling with replacement will always produce at least one distinct
+        # instance in sample, and at least one instance in remaining with
+        # high probability (1-(1/150*2/150*...*145/150) ~= 1-2e-64)
+        self.assertGreater(len(in_sample), 0)


Third, isn't len(in_sample) always at least 1? :)

It is, it is also stated in the comment above:
will always produce at least one distinct instance in sample
The high probability part refers to the second assertion
and at least one instance in remaining with high probability

When stratification is not possible, warn user and sample without.

[FIX] DataSampler: Fix crash when stratifying unbalanced datasets (cherry picked from commit a8f71bc) Conflicts: Orange/widgets/data/owdatasampler.py

astaric changed the title ~~[FIX] DataSampler: Fix crashes~~ [FIX] DataSampler: Fix crash when stratifying unbalanced datasets Jan 25, 2017

astaric added this to the 3.3.11 milestone Jan 25, 2017

astaric force-pushed the stratified-when-possible branch from 49f5b26 to 27e5265 Compare January 25, 2017 22:15

janezd reviewed Jan 25, 2017

View reviewed changes

astaric force-pushed the stratified-when-possible branch from 27e5265 to e49602c Compare January 26, 2017 08:49

janezd reviewed Jan 26, 2017

View reviewed changes

astaric force-pushed the stratified-when-possible branch from e49602c to 2dab4dc Compare January 26, 2017 10:03

astaric added 2 commits January 26, 2017 11:41

DataSampler: Fix crash on unbalanced data

5799340

When stratification is not possible, warn user and sample without.

DataSampler: Fix Bootstrap signature

ff90792

astaric force-pushed the stratified-when-possible branch from 2dab4dc to ff90792 Compare January 26, 2017 10:41

lanzagar merged commit a8f71bc into biolab:master Jan 26, 2017

astaric pushed a commit that referenced this pull request Feb 3, 2017

Merge pull request #1952 from astaric/stratified-when-possible

a3d67c8

[FIX] DataSampler: Fix crash when stratifying unbalanced datasets (cherry picked from commit a8f71bc) Conflicts: Orange/widgets/data/owdatasampler.py

astaric deleted the stratified-when-possible branch September 8, 2017 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] DataSampler: Fix crash when stratifying unbalanced datasets #1952

[FIX] DataSampler: Fix crash when stratifying unbalanced datasets #1952

astaric commented Jan 25, 2017

astaric commented Jan 25, 2017

astaric commented Jan 25, 2017

ajdapretnar commented Jan 25, 2017

codecov-io commented Jan 25, 2017 •

edited

Loading

lanzagar commented Jan 25, 2017

astaric commented Jan 25, 2017

janezd Jan 25, 2017

janezd Jan 25, 2017

astaric Jan 26, 2017

janezd Jan 25, 2017

janezd Jan 25, 2017

astaric Jan 26, 2017

astaric Jan 26, 2017

janezd Jan 26, 2017

janezd Jan 26, 2017

astaric Jan 26, 2017

janezd Jan 26, 2017

astaric Jan 26, 2017

[FIX] DataSampler: Fix crash when stratifying unbalanced datasets #1952

[FIX] DataSampler: Fix crash when stratifying unbalanced datasets #1952

Conversation

astaric commented Jan 25, 2017

Issue

Description of changes

Includes

astaric commented Jan 25, 2017

astaric commented Jan 25, 2017

ajdapretnar commented Jan 25, 2017

codecov-io commented Jan 25, 2017 • edited Loading

Current coverage is 89.53% (diff: 100%)

lanzagar commented Jan 25, 2017

astaric commented Jan 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jan 25, 2017 •

edited

Loading