Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] normalize: Adjust number_of_decimals after scaling #4779

Merged
merged 4 commits into from
Jun 19, 2020

Conversation

pavlin-policar
Copy link
Collaborator

Issue

Data, when explicitly centered through the preprocess widget, would show up as having mean 1-e16 in the Feature Statistics widget (and potentially elsewhere). E.g. File (iris) > Preprocess (normalize to have mean=0) > Feature Statistics

Description of changes
  1. Anything that is explicitly zero-centered will have number_of_decimals=3.
  2. Sometimes things would show up as -0.000 because of rounding a tiny negative number e.g. 1e-16 to zero. Zeros are now always positive.
Possible issues

This is impossible to do through the UI, but someone could potentially call normalize with an sd>>1, so all the values would be scaled to something tiny, and then str_val would print 0 for everything. Again, this cannot happen in Orange, because there is no way to manually specify the standard deviation. Let me know if you think this is worth fixing.

Includes
  • Code changes
  • Tests
  • Documentation

@lanzagar
Copy link
Contributor

Anything that is explicitly zero-centered will have number_of_decimals=3.

Actually, I don't think zero centering should impact the num. of decimals. Imagine small values (e.g. below 1e-5) that are just centered - they are shifted around 0, but still small and need more decimals. And this can be done in the widget as well!
Rather, it is the rescaling of the variable that should reset the num. of decimals - for normalization to std=1 as well as to 0-1 range it should be OK to use 3 decimals.
For just centering (without scaling) I think the correct fix is to keep the number of decimals of the original variable.

@lanzagar
Copy link
Contributor

If we want to be even more fancy and complicate things:
The new number of decimals after rescaling could even be computed and adjusted after rescaling.
E.g. original var had num_of_decimals=5, then we rescaled by dividing with std=10 -> the new var should have 6 decimals. If we divided by std=0.1 then it should have 4 decimals.

@markotoplak, @janezd - What do you think about just setting to 3 after rescaling or computing the needed number of decimals?

@pavlin-policar pavlin-policar force-pushed the feature-statistics-1 branch 2 times, most recently from aebe793 to 8188592 Compare May 15, 2020 12:21
@pavlin-policar
Copy link
Collaborator Author

I've added this line here

num_decimals += int(-np.floor(np.log10(sd)))

which automatically adjusts the number of decimals based on the standard deviation. I've also had to add to this

        if float(self._format_str % val) == 0:
             return "%.2f" % 0.0

because otherwise, on standardized iris, I would get a mix of 0.000 and 0.0000 (4 zeros), which looked strange.

@pavlin-policar pavlin-policar force-pushed the feature-statistics-1 branch 3 times, most recently from f3536d0 to 4a8fbfc Compare May 15, 2020 12:29
@pavlin-policar
Copy link
Collaborator Author

Let me know if you like this solution, or in what other way this could be solved. If we want to go with this, I'll fix the tests.

Copy link
Contributor

@lanzagar lanzagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the solution, but it needs some more thought and polishing. I made a few line comments, but my suggestions still don't produce the results I would be completely satisfied with.
You can take a look at feature statistics on housing + preprocessing to immediately see some problems that still need solving.

Comment on lines 51 to 57
if self.center:
compute_val = Norm(var, avg, 1 / sd)
num_decimals = 3
else:
compute_val = Norm(var, 0, 1 / sd)
return var.copy(compute_value=compute_val)
num_decimals = None
num_decimals += int(-np.floor(np.log10(sd)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said previously, I don't think the if else about centering should have anything to do with num_decimals.

So probably the computation should be just:
num_decimals = var.number_of_decimals + correction
And it also looks like your correction of int(-np.floor(np.log10(sd))) was not right - for sd=100 it should increase by 2 decimals not decrease...
I think the correct formula is (and someone should doublecheck this!):
num_decimals = var.number_of_decimals + int(np.log10(sd))

Currently, you didn't change normalize_by_span, but I expect the final solution should have the same correction there (just using diff instead of sd)

@@ -42,20 +42,23 @@ def normalize(self, dist, var):
var = self.normalize_by_sd(dist, var)
elif self.norm_type == Normalize.NormalizeBySpan:
var = self.normalize_by_span(dist, var)
var.number_of_decimals = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing the reset of num. dec. here, you can delete the next line and just return directly in the elifs above...
... is what I wanted to write, then I saw that there is no else and lint would probably complain :/
(so I guess just leave it as is)

@codecov
Copy link

codecov bot commented Jun 19, 2020

Codecov Report

Merging #4779 into master will increase coverage by 0.32%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4779      +/-   ##
==========================================
+ Coverage   83.84%   84.17%   +0.32%     
==========================================
  Files         281      277       -4     
  Lines       56745    56500     -245     
==========================================
- Hits        47577    47558      -19     
+ Misses       9168     8942     -226     

@lanzagar lanzagar changed the title normalize: set number_of_decimals=3 when centering data [FIX] normalize: Adjust number_of_decimals after scaling Jun 19, 2020
@lanzagar lanzagar merged commit 0cc522f into biolab:master Jun 19, 2020
@pavlin-policar pavlin-policar deleted the feature-statistics-1 branch June 19, 2020 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants