Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Shiny renewed widget Distributions #3896

Merged
merged 26 commits into from
Jul 13, 2019

Conversation

janezd
Copy link
Contributor

@janezd janezd commented Jun 19, 2019

  • The widget does not look nice. In particular, numeric variables are discretized and histograms look like histograms of discrete variables
  • Fixes Probability distribution fitting widget #1849: Adds curve fitting
  • Fixes OWDistributions: legible axis labels #3736: legible axis labels
  • Current non-parametric curves will become just one of possible curves
  • Showing histograms and fitted curves should not be exclusive
  • Widget's code should be improved and modernized

To do:

@janezd janezd changed the title [ENH] Widget Distributions: various improvements [WIP] [ENH] Widget Distributions: various improvements Jun 19, 2019
@janezd janezd force-pushed the owdistributions-fitting branch 3 times, most recently from f3687ba to 6d08361 Compare June 22, 2019 16:27
@codecov
Copy link

codecov bot commented Jun 24, 2019

Codecov Report

Merging #3896 into master will increase coverage by 0.02%.
The diff coverage is 83.06%.

@@            Coverage Diff             @@
##           master    #3896      +/-   ##
==========================================
+ Coverage   84.95%   84.97%   +0.02%     
==========================================
  Files         376      376              
  Lines       65907    66458     +551     
==========================================
+ Hits        55992    56476     +484     
- Misses       9915     9982      +67

@janezd janezd changed the title [WIP] [ENH] Widget Distributions: various improvements [ENH] Shiny renewed widget Distributions Jun 24, 2019
@janezd
Copy link
Contributor Author

janezd commented Jun 24, 2019

The widget includes a lot of user interaction, so the test coverage will not reach the requested threshold. @ajdapretnar can start torturing it when she finishes touring.

@lanzagar
Copy link
Contributor

First of all, I like that we are updating and improving a widget as important as Distributions - it really is one of the first things a user checks to explore new data.

The new version looks very nice, but here are some comments anyway:

  • I don't think the checkboxes under Column should be checked by default - currently the first figure that opens is the hardest to understand and requires the user to think a bit before she knows what she is looking at. I think both unchecked is a good default, but @janezd said he preferred one of them to be checked (forgot which and why?)
  • Number of bins slider is a bit confusing until someone explains how it is implemented and why it works the way it does. A regular user will be frustrated that the number of bins can't be increased from 4 to 5 or set it to 10 etc. Maybe this could be improved by renaming the slider to "Bin width" and show a tooltip over it while dragging the slider that shows the width. This way the user sees the bin width is changing from e.g. 0.5 to 1 to 2 etc. and understands the progression and why some numbers of bins are left out.
  • There are new options to fit various distributions to the data, but we lost the existing gaussian kernel density estimation. Since many users have been using this now and there is no similar alternative among the new options, I am hesitant about removing it completely. A big advantage of it compared to all new distributions was for example its ability to also handle multimodal data.
  • I don't know what I am looking at when I have Show probabilities checked and Fitted distribution=Normal selected :)
    Also, with Show probabilities checked, changing Fitted distribution between None and Normal changes the background figure/axes slightly, which does not happen otherwise.
  • Click and drag the figure up and down. This probably shouldn't be possible.
  • When using Split by, the last column changes from X >= val1 to val1 <= X < val2
  • When Fitted distribution = None and Split by = None there is no legend, but there is a small empty box top right
  • This is probably hard and might not be in the scope for this PR: but I like the dynamic labels very much (eliding based on available space) and the only missing step to an almost perfect solution is to be able to check the whole word. Often long values differ only in the suffixes and seeing only the first half is quite useless. If one could see the complete string in a tooltip while hovering over the elided value it would be great. I am asking about this with reuse of this class in other widgets in mind since this is one of the annoying problems that we want to solve in many places.

@janezd
Copy link
Contributor Author

janezd commented Jul 11, 2019

Fixed

I don't think the checkboxes under Column should be checked by default

I disabled both.

Number of bins slider is a bit confusing until someone explains

I like the suggestion. Tooltip doesn't look well, so I added the bin width as a label. It is a bit weird that dragging right decreases the width (because it increases the number of bins). I can of course reverse it (or rename width -> narrowness :), but then it will look strange in reverse. I prefer it as it is, but won't object if you prefer turning it arond.

I don't know what I am looking at when I have Show probabilities checked and Fitted distribution=Normal selected

The label now changes from "Fitted distributions" to "Fitted probability", and I also added a tooltip.

Click and drag the figure up and down.

And if you zoomed it, it even crashed.

When using Split by, the last column changes from X >= val1 to val1 <= X < val2

When Fitted distribution = None and Split by = None there is no legend, but there is a small empty box top right

Fixed and fixed.

Won't fix

with Show probabilities checked, changing Fitted distribution between None and Normal changes the background figure/axes slightly, which does not happen otherwise.

I don't know why this happens - the curves are computed the same way so they should not affect the x axis (although they are inserted into a different plot). Looks like a glitch that one can waste an afternoon on - but I wouldn't.

we lost the existing gaussian kernel density estimation

We (at least @BlazZupan and I :) agreed to remove it. I don't strongly oppose it and can easily add it later (or somebody else does it).

I like the dynamic labels very much (eliding based on available space) and the only missing step to an almost perfect solution is to be able to check the whole word

Tick labels are not separate QGraphicsItems, they are printed in paint after all the space computations and scalings and mappings. Implementing tooltips would require exploring "what is printed there", or, perhaps, guessing that the mouse is in the labels area and then compute the underlying label index from the mouses's x position... It sounds like a good nightmare, so let us not do it here.

@ajdapretnar
Copy link
Contributor

There is an issue with the selection. By default if you click above the histograms, nothing will be selected. Unless you have Show probabilities checked, then you can select the space above histograms.

I am slightly confused by the missing Gaussian distribution. We use it in many widgets and I think most people are used to it. I do not mind having only histograms, but it might be very strange having this widget without it.

I can displace the legend by pushing it too far out of the screen. 🙈 I don't know how to get it back. It is lost in time and space.

I think it is a little weird having the variable list jump up and down when changing attributes. I am not entirely against it, but I am playing with the idea of having just the Distributions box disabled for discrete.

🐛 Connect Twitter widget to Distributions and search for random 100 tweets.

Traceback (most recent call last):
  File "/Users/p/orange/orange-canvas-core/orangecanvas/scheme/signalmanager.py", line 718, in __process_next
    self.process_queued()
  File "/Users/p/orange/orange-canvas-core/orangecanvas/scheme/signalmanager.py", line 476, in process_queued
    self.process_node(node_update_front[0])
  File "/Users/p/orange/orange-canvas-core/orangecanvas/scheme/signalmanager.py", line 523, in process_node
    self.send_to_node(node, signals_in)
  File "/Users/p/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 774, in send_to_node
    self.process_signals_for_widget(node, widget, signals)
  File "/Users/p/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 808, in process_signals_for_widget
    handler(*args)
  File "/Users/p/orange/orange3/Orange/widgets/visualize/owdistributions.py", line 382, in set_data
    self.replot()
  File "/Users/p/orange/orange3/Orange/widgets/visualize/owdistributions.py", line 466, in replot
    self._call_plotting()
  File "/Users/p/orange/orange3/Orange/widgets/visualize/owdistributions.py", line 507, in _call_plotting
    self._cont_split_plot()
  File "/Users/p/orange/orange3/Orange/widgets/visualize/owdistributions.py", line 592, in _cont_split_plot
    fitters.append(self._fit_approximation(group_data))
  File "/Users/p/orange/orange3/Orange/widgets/visualize/owdistributions.py", line 629, in _fit_approximation
    fitted = dist.fit(y)
  File "/Users/p/miniconda3/envs/o3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 2261, in fit
    start = self._fitstart(data)
  File "/Users/p/miniconda3/envs/o3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 2100, in _fitstart
    loc, scale = self._fit_loc_scale_support(data, *args)
  File "/Users/p/miniconda3/envs/o3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 2331, in _fit_loc_scale_support
    data_a = np.min(data)
  File "/Users/p/miniconda3/envs/o3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2618, in amin
    initial=initial)
  File "/Users/p/miniconda3/envs/o3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation minimum which has no identity

@ajdapretnar
Copy link
Contributor

Upon additional exploration, we have some more comments (such ungrateful brats we are!):

  1. Prevent drawing variables with 1 bin. They are uninformative.
  2. Widget is quite a bit slower now due to the computation of bins.
  3. I vote for reversing the Bin width slider. Also the slider doesn't show large values properly. If it is over 1,000 the first part of the value would be hidden. And even integers have decimal zero. Perhaps avoid zero and show more digits, then go to scientific notation for large values?
  4. If I move the legend and it is over a column, a click on the legend selects that column. I am not sure whether this can be avoided, but it is quite strange.
  5. While each variable remembering its own settings seemed like a feature at first, it is indeed very confusing with many variables. I am quite skeptical introducing this before we have a nice 'Reset settings' button for each widget.

@janezd
Copy link
Contributor Author

janezd commented Jul 12, 2019

By default if you click above the histograms, nothing will be selected.

I was aware of this. Extending bounding boxes to the top of the plot would require each bar to know the heights of all other bars. I won't to this. The other two solutions I have require mapping of mouse coordinated. Ask @markotoplak or @VesnaT how complicated are pyqtgraph's mappings. I wasted a few (more) hours this morning, but couldn't. And won't. Making me nervous. Bad mood. Children tiptoeing around me already.

I am slightly confused by the missing Gaussian distribution.

Gaussian distribution is there, at the top, isn't it? See"Normal".

I can displace the legend by pushing it too far out of the screen.

You can do the same in the scatter plot. It's a general issue that should be fixed generally.

I think it is a little weird having the variable list jump up and down when changing attributes.

Now I disable instead of hiding.

Prevent drawing variables with 1 bin. They are uninformative.

Do what instead? Hide them from the list view and post an informative warning at the bottom of the widget? :)

Widget is quite a bit slower now due to the computation of bins.

Computation of thresholds or putting values into bins? The most time consuming operation in the former should be a call to np.unique, which is n log n. Computing bin sizes is linear. Please elaborate.

I vote for reversing the Bin width slider.

OK, reverted.

If I move the legend and it is over a column, a click on the legend selects that column.

Fixed.

Connect Twitter widget to Distributions and search for random 100 tweets.

Please provide another means to replicate (e.g. upload data). Twitter requires creating a developer account or something.

I could guess a problem and add a probable fix, but wouldn't like to pollute the code with unnecessary checks "just for the case".

While each variable remembering its own settings seemed like a feature at first, it is indeed very confusing with many variables.

Do what instead? Keep the slider at the same position? Can't, different variables have different number of possible binnings. Besides, the same position may mean a different number of bins; you would find it confusing, too. I also tried keeping the number of bins as similar as possible when switching variables, e.g. if you have 5 bins and switch to a variables that has 3, 4, 7, 10 or 12, you go to 4. This however doesn't work because it tends to diverge -- you move up and down the list and the number of bins would for some reason go to one of extremes.

Remembering individual settings is not a feature, but necessity.

@ajdapretnar
Copy link
Contributor

Long conversation. 😊

  • Selection: I can select above the histogram when Show probabilities is on. 🤔 I really (I mean really) don't want to cause any family drama, but I am just wondering why it works with relative and not with absolute view. It is only a problem when there are very few data instances, then you get a super-thin line. But I am sure you are aware of this and would fix it if possible (or at least not inhumanely and painstakingly tiresome).

  • With Gaussian I meant the previous continuous plot we had for numeric variables in Distributions.

  • Yeah, disregard the thing about a single bin.

  • Slowness: I was comparing Cell Cycle in mESC (Fluidigm) data from Single Cell Datasets. Distributions was much faster with the old version (switching between variables). I don't know what is causing this slowness, I assumed it was binning.

  • Ignore Twitter. Can't reproduce again. Let's consider it a borderline case. If we can reproduce at any point, I'll open an issue.

  • Are we going to have a "Reset widget" button soon? I think we talked about it at some point. I think it would be nice and it would solve the issue.

@ajdapretnar
Copy link
Contributor

Sorry, sorry, sorry. I have just seen something quite strange.
test.xlsx

Bins are strange for this data. Value results in a very thin column, even though there are two values of 21 and 2 of 4. Cls on the other hand does not even have 2 in the values, but from the histogram it looks like it does. The actual values of cls are 0, 1 and 3, but one cannot see that from the histograms.

Screen Shot 2019-07-12 at 15 36 30

@janezd
Copy link
Contributor Author

janezd commented Jul 12, 2019

  • Selection calls self.items(event.pos()), where event.pos() are mouse coordinates and self.items(...) returns plot items at the given coordinate. When bars extend to the top, they are at the mouse coordinates. When they don't, I'd have to cheat by reporting false boundingRect's of bar items (bad idea - would certainly cause some problems in the future) or by pretending that the mouse is at the bottom of the graph, but one unnecessary fight with my wife later (so the planned cycling with her today is probably cancelled - I don't yet dare to ask) I'm still no closer to mapping graph coordinates to screen coordinates because all transformations in pyqtgraph are identities.

    This widget can move, extend or start selection with keyboard; it shows the user that items can be selected by putting a border around them (and it's a tight border around parallel columns of an item) on hover; it has tooltips on selections that tell you the percentage of selected data ... It's the most advanced selection there is, nema šta nema, so let please let us stop discussing why an item can't be selected by clicking empty space above it.

  • Earlier you mentioned incorrectly printed bin widths. I used variable's repr_val. .0 appeared if the number of decimals was set incorrectly, but this is a problem of, say, Excel reader (and probably appears elsewhere, too). It however didn't work well when the width was smaller than 10 ** -number_of_decimals. Now I use the width that comes directly from binning function and print as %g.

  • Gaussian: as I wrote earlier, "We (at least @BlazZupan and I :) agreed to remove it. I don't strongly oppose it and can easily add it later (or somebody else does it)." The problem is that it's a nice curve, but meaningless. But still, it's trivial to add if somebody decides to.

  • Slowness: I'll try to find this data set and see. Otherwise, determining possible bin widths shouldn't be slow, and counting uses either Table.distributions or Table.contingency for discrete variables (as before), and np.histogram for continuous, which should be way faster than before. Maybe distribution fitting? Did you have it enabled?

  • Reset widget button is similar to report. If widget has the method, the button appears. I doubt it would be of much help here - it's too hidden, and you'd still have to click reset to reset it.

  • Strange bins on test.xlsx. Works as intended, but this feature can look strange on strange data.

    Open the widget preview, with heart disease data and default settings (all check boxes disabled, splittin by class value). Select "major vessels colored". If it looks as it should, consider what had to happen to look as it should. The attribute has just four different values, so we want four bins. Any attribute with no more than 5 values will have bins that match these values. This was done on purpose and it actually complicates the code that defines bins with special cases in several places.

    In the histogram it looks as if cls has value 2. This is because bin boundaries are at values that occur in the data. While one could argue that it would make more sense to put the boundaries at mid-points, remember "major vessels colored" again. Would you like boundaries at 0.5, 1.5 and 2.5?
    They're counting vessels, so having > 2.5 colored vessel (or < 0.5) would look stupid. Similarly, I'd hate to see an animal with less than 0.5 legs.

    I agree that the current solution doesn't look well on artificial data with a few values, but I prefer it too look nice on real data.

@janezd
Copy link
Contributor Author

janezd commented Jul 13, 2019

Outstanding issues:

  • slowness on some data: need to replicate; not inherent in the current solution and can be improved if needed.
  • selecting above bars: can be fixed later; also, non-critical and can be mitigated by using keyboard to select
  • crash on data from Twitter: can't replicate, but probably solved in the latest commit
  • strange bins: works as intended, but can be changed by simply removing an argument in call of decimal_binnings

@janezd janezd merged commit 804b946 into biolab:master Jul 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OWDistributions: legible axis labels Probability distribution fitting widget
4 participants