Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] DataSampler: Fix crash when stratifying unbalanced datasets #1952

Merged
merged 2 commits into from
Jan 26, 2017

Conversation

astaric
Copy link
Member

@astaric astaric commented Jan 25, 2017

Issue

DataSampler crashed when stratified was selected, but one class contained only one element. Bootstrap did not work (invalid signature).

Description of changes

If stratified sampling fails, show a warning and sample without.
Add argument data to Bootstrap to satisfy signature.

Includes
  • Code changes
  • Tests

@astaric
Copy link
Member Author

astaric commented Jan 25, 2017

@astaric
Copy link
Member Author

astaric commented Jan 25, 2017

@astaric astaric changed the title [FIX] DataSampler: Fix crashes [FIX] DataSampler: Fix crash when stratifying unbalanced datasets Jan 25, 2017
@astaric astaric added this to the 3.3.11 milestone Jan 25, 2017
@ajdapretnar
Copy link
Contributor

This works well for me, however Bootstrap always returns the same number of instances as on the input. I don't think this is the intended behaviour.

@codecov-io
Copy link

codecov-io commented Jan 25, 2017

Current coverage is 89.53% (diff: 100%)

Merging #1952 into master will not change coverage

@@             master      #1952   diff @@
==========================================
  Files            90         90          
  Lines          9179       9179          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits           8218       8218          
  Misses          961        961          
  Partials          0          0          

Sunburst

Powered by Codecov. Last update e0ec04e...ff90792

@lanzagar
Copy link
Contributor

That is the intended behaviour. This method samples with replacement, so the size is always the same, but the instances in the sample are different (some are repeated).
But it is different from other sampling options, and maybe we should make it clearer to the user (e.g. tooltip; a horizontal rule after other methods and/or small caption Sampling with replacement; or even just rename to "Bootstrap sampling with replacement).

For this PR, however, I think this could be an excellent addition to the new tests for bootstrap - assert that len(sample) == len(input). So we will know in the future that this was indeed intended.

@astaric
Copy link
Member Author

astaric commented Jan 25, 2017

done. Also added comments clarifying the asserts.

self.assertGreater(len(in_sample), 0)
# The following assert has a really high probability of being true:
# 1-(1/150*2/150*...*145/150) ~= 1-2e-64
self.assertGreater(len(in_remaining), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you use self.assertProbablyGreater here? :)

@@ -12,7 +12,7 @@ def setUpClass(cls):
cls.iris = Table("iris")

def setUp(self):
self.widget = self.create_widget(OWDataSampler)
self.widget = self.create_widget(OWDataSampler) # type: OWDataSampler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool. It always annoys me, but I never thought of helping PyCharm here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this in tests aleš writes :)

@@ -46,6 +46,9 @@ class OWDataSampler(OWWidget):
number_of_folds = Setting(10)
selectedFold = Setting(1)

class Warning(OWWidget.Warning):
could_not_stratify = Msg("Could not stratify\n{}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you turn this message into a sentence, something like "Stratification failed"?

self.assertEqual(len(in_sample & in_remaining), 0)
# Sampling with replacement will always produce at least one distinct
# instance in sample, and at least one instance in remaining with
# high probability (1-(1/150*2/150*...*145/150) ~= 1-2e-64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you should probably use self.assertProbablyGreater? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like the implicit nature of self.assertProbablyGreater, as it is not obvious how the check is performed. I am more inclined towards a decorator

@should_pass(3, out_of=5)["times"]

that would run the test for the required amount of retries and then compute the probability of it being correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But just in case we ever need it:

def assertProbablyGreater(self, a, b, msg=None):
    """Assert that a in greater than b in at least one out of two runs."""
    import inspect

    try:
        return self.assertGreater(a, b, msg)
    except AssertionError:
        pass

    frames = inspect.stack()[1:]
    test_frames = [f for f in frames if f.function.startswith("test_")]
    if any(f for f in frames if f.function == "c") or not test_frames:
        # Retried, but still not working or nothing to retry. Fail
        return self.assertGreater(a, b, msg)

    getattr(self, test_frames[0].function)()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, your decorator, cunning as it is, still leaves a non-zero probability that the test fails. To increase the probability the it will work, use

@should_pass(9, out_of=10)["times"]
@should_pass(3, out_of=5)["times"]

Also, it should be called should_pass_at_least.

Second, I believe your formula is incorrect. Denominators should go to 150; the probability of failing is 150! / 150^150. (I guess you probably used the correct formula; this indeed gives 2.2e-64).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But just in case we ever need it:

I don't think so. :) Please just check the "third" remark before merging, the one about len(in_sample) being at least 1. I actually meant that one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second, I believe your formula is incorrect.

Nice catch. I have updated the formula.

# Sampling with replacement will always produce at least one distinct
# instance in sample, and at least one instance in remaining with
# high probability (1-(1/150*2/150*...*145/150) ~= 1-2e-64)
self.assertGreater(len(in_sample), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third, isn't len(in_sample) always at least 1? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, it is also stated in the comment above:
will always produce at least one distinct instance in sample
The high probability part refers to the second assertion
and at least one instance in remaining with high probability

When stratification is not possible, warn user and sample without.
@lanzagar lanzagar merged commit a8f71bc into biolab:master Jan 26, 2017
astaric pushed a commit that referenced this pull request Feb 3, 2017
[FIX] DataSampler: Fix crash when stratifying unbalanced datasets
(cherry picked from commit a8f71bc)

 Conflicts:
	Orange/widgets/data/owdatasampler.py
@astaric astaric deleted the stratified-when-possible branch September 8, 2017 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants