-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] normalize: Adjust number_of_decimals after scaling #4779
Conversation
Actually, I don't think zero centering should impact the num. of decimals. Imagine small values (e.g. below 1e-5) that are just centered - they are shifted around 0, but still small and need more decimals. And this can be done in the widget as well! |
If we want to be even more fancy and complicate things: @markotoplak, @janezd - What do you think about just setting to 3 after rescaling or computing the needed number of decimals? |
aebe793
to
8188592
Compare
I've added this line here
which automatically adjusts the number of decimals based on the standard deviation. I've also had to add to this
because otherwise, on standardized iris, I would get a mix of 0.000 and 0.0000 (4 zeros), which looked strange. |
f3536d0
to
4a8fbfc
Compare
Let me know if you like this solution, or in what other way this could be solved. If we want to go with this, I'll fix the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the solution, but it needs some more thought and polishing. I made a few line comments, but my suggestions still don't produce the results I would be completely satisfied with.
You can take a look at feature statistics on housing + preprocessing to immediately see some problems that still need solving.
Orange/preprocess/normalize.py
Outdated
if self.center: | ||
compute_val = Norm(var, avg, 1 / sd) | ||
num_decimals = 3 | ||
else: | ||
compute_val = Norm(var, 0, 1 / sd) | ||
return var.copy(compute_value=compute_val) | ||
num_decimals = None | ||
num_decimals += int(-np.floor(np.log10(sd))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said previously, I don't think the if else
about centering should have anything to do with num_decimals.
So probably the computation should be just:
num_decimals = var.number_of_decimals + correction
And it also looks like your correction of int(-np.floor(np.log10(sd)))
was not right - for sd=100 it should increase by 2 decimals not decrease...
I think the correct formula is (and someone should doublecheck this!):
num_decimals = var.number_of_decimals + int(np.log10(sd))
Currently, you didn't change normalize_by_span
, but I expect the final solution should have the same correction there (just using diff instead of sd)
@@ -42,20 +42,23 @@ def normalize(self, dist, var): | |||
var = self.normalize_by_sd(dist, var) | |||
elif self.norm_type == Normalize.NormalizeBySpan: | |||
var = self.normalize_by_span(dist, var) | |||
var.number_of_decimals = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After removing the reset of num. dec. here, you can delete the next line and just return directly in the elifs above...
... is what I wanted to write, then I saw that there is no else and lint would probably complain :/
(so I guess just leave it as is)
4a8fbfc
to
cf5e816
Compare
Codecov Report
@@ Coverage Diff @@
## master #4779 +/- ##
==========================================
+ Coverage 83.84% 84.17% +0.32%
==========================================
Files 281 277 -4
Lines 56745 56500 -245
==========================================
- Hits 47577 47558 -19
+ Misses 9168 8942 -226 |
64c5b5e
to
f7ff577
Compare
Issue
Data, when explicitly centered through the preprocess widget, would show up as having mean 1-e16 in the Feature Statistics widget (and potentially elsewhere). E.g. File (iris) > Preprocess (normalize to have mean=0) > Feature Statistics
Description of changes
number_of_decimals=3
.-0.000
because of rounding a tiny negative number e.g. 1e-16 to zero. Zeros are now always positive.Possible issues
This is impossible to do through the UI, but someone could potentially call
normalize
with ansd>>1
, so all the values would be scaled to something tiny, and thenstr_val
would print 0 for everything. Again, this cannot happen in Orange, because there is no way to manually specify the standard deviation. Let me know if you think this is worth fixing.Includes