-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] DataSampler: Fix crash when stratifying unbalanced datasets #1952
Conversation
This works well for me, however Bootstrap always returns the same number of instances as on the input. I don't think this is the intended behaviour. |
Current coverage is 89.53% (diff: 100%)@@ master #1952 diff @@
==========================================
Files 90 90
Lines 9179 9179
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 8218 8218
Misses 961 961
Partials 0 0
|
That is the intended behaviour. This method samples with replacement, so the size is always the same, but the instances in the sample are different (some are repeated). For this PR, however, I think this could be an excellent addition to the new tests for bootstrap - assert that len(sample) == len(input). So we will know in the future that this was indeed intended. |
49f5b26
to
27e5265
Compare
done. Also added comments clarifying the asserts. |
self.assertGreater(len(in_sample), 0) | ||
# The following assert has a really high probability of being true: | ||
# 1-(1/150*2/150*...*145/150) ~= 1-2e-64 | ||
self.assertGreater(len(in_remaining), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you use self.assertProbablyGreater
here? :)
@@ -12,7 +12,7 @@ def setUpClass(cls): | |||
cls.iris = Table("iris") | |||
|
|||
def setUp(self): | |||
self.widget = self.create_widget(OWDataSampler) | |||
self.widget = self.create_widget(OWDataSampler) # type: OWDataSampler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool. It always annoys me, but I never thought of helping PyCharm here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this in tests aleš writes :)
@@ -46,6 +46,9 @@ class OWDataSampler(OWWidget): | |||
number_of_folds = Setting(10) | |||
selectedFold = Setting(1) | |||
|
|||
class Warning(OWWidget.Warning): | |||
could_not_stratify = Msg("Could not stratify\n{}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you turn this message into a sentence, something like "Stratification failed"?
self.assertEqual(len(in_sample & in_remaining), 0) | ||
# Sampling with replacement will always produce at least one distinct | ||
# instance in sample, and at least one instance in remaining with | ||
# high probability (1-(1/150*2/150*...*145/150) ~= 1-2e-64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you should probably use self.assertProbablyGreater
? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like the implicit nature of self.assertProbablyGreater, as it is not obvious how the check is performed. I am more inclined towards a decorator
@should_pass(3, out_of=5)["times"]
that would run the test for the required amount of retries and then compute the probability of it being correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But just in case we ever need it:
def assertProbablyGreater(self, a, b, msg=None):
"""Assert that a in greater than b in at least one out of two runs."""
import inspect
try:
return self.assertGreater(a, b, msg)
except AssertionError:
pass
frames = inspect.stack()[1:]
test_frames = [f for f in frames if f.function.startswith("test_")]
if any(f for f in frames if f.function == "c") or not test_frames:
# Retried, but still not working or nothing to retry. Fail
return self.assertGreater(a, b, msg)
getattr(self, test_frames[0].function)()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, your decorator, cunning as it is, still leaves a non-zero probability that the test fails. To increase the probability the it will work, use
@should_pass(9, out_of=10)["times"]
@should_pass(3, out_of=5)["times"]
Also, it should be called should_pass_at_least
.
Second, I believe your formula is incorrect. Denominators should go to 150; the probability of failing is 150! / 150^150. (I guess you probably used the correct formula; this indeed gives 2.2e-64).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But just in case we ever need it:
I don't think so. :) Please just check the "third" remark before merging, the one about len(in_sample)
being at least 1. I actually meant that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second, I believe your formula is incorrect.
Nice catch. I have updated the formula.
27e5265
to
e49602c
Compare
# Sampling with replacement will always produce at least one distinct | ||
# instance in sample, and at least one instance in remaining with | ||
# high probability (1-(1/150*2/150*...*145/150) ~= 1-2e-64) | ||
self.assertGreater(len(in_sample), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Third, isn't len(in_sample)
always at least 1? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, it is also stated in the comment above:
will always produce at least one distinct instance in sample
The high probability part refers to the second assertion
and at least one instance in remaining with high probability
e49602c
to
2dab4dc
Compare
When stratification is not possible, warn user and sample without.
2dab4dc
to
ff90792
Compare
[FIX] DataSampler: Fix crash when stratifying unbalanced datasets (cherry picked from commit a8f71bc) Conflicts: Orange/widgets/data/owdatasampler.py
Issue
DataSampler crashed when stratified was selected, but one class contained only one element. Bootstrap did not work (invalid signature).
Description of changes
If stratified sampling fails, show a warning and sample without.
Add argument data to Bootstrap to satisfy signature.
Includes