-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] New widget: Group By #5541
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5541 +/- ##
==========================================
+ Coverage 85.96% 86.01% +0.05%
==========================================
Files 313 315 +2
Lines 65471 65936 +465
==========================================
+ Hits 56280 56714 +434
- Misses 9191 9222 +31 |
When the data set contains TimeVariable, I get this error (try with Banking Crises from Datasets):
|
Also, wouldn't it make sense to put the grouped by variables in metas for the output data table since they are unique? |
Corpus instances are not retained (i.e. a grouped Corpus is not a Corpus on the output). The problem is (possibly) how the |
This PR need #5547 to work properly for TimeVariables |
It is a bug in ListViewSearch. Will be addressed in the separate PR since it is an issue that also appears elsewhere. |
b7274a3
to
75524c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done.
I wrote some minor comments. I'll have more in person. :)
return " ".join(str(v) for v in x if not pd.isnull(v) and len(str(v)) > 0) | ||
|
||
|
||
AGGREGATIONS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of "Delta" is not obvious, and it's place doesn't help; I guess it would be better to put it after min and max.
Concatenate is easier to guess (at least for me), but maybe "List of values" would be more informative. For my taste, it is closer to sum than to delta.
Consider renaming and reordering the aggreagations, maybe like this:
Mean, Median, Mode, Standard deviation, Variance,
Sum, List of all values, Min. value, Max. value, Span
First value, Last Value, Random Value, Count defined, Count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"List of values" doesn't really sound right when talking about strings I would say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a list of possible values but space-separated values that appear in the column (with repetitions).
I agree "List of values" is ambiguous. Concatenate may be better after all, unless we come up with something better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have a better idea for now. I Will leave concatenate and we can change later
Orange/widgets/data/owgroupby.py
Outdated
""" | ||
Reset the aggregation values in the table for the attribute | ||
""" | ||
index = self.index(self.domain.index(attribute), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> Orange.data.Table("zoo").domain.index("name")
-1
I'm guilty as charged, but that's what we inherited from Orange 2. :(
Orange/widgets/data/owgroupby.py
Outdated
def rowCount(self, parent=None) -> int: # pylint: disable=unused-argument | ||
return ( | ||
0 | ||
if self.domain is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or parent.isValid()
, I guess.
Orange/widgets/data/owgroupby.py
Outdated
aggs = sorted( | ||
self.parent.aggregations.get(val, []), key=AGGREGATIONS_ORD.index | ||
) | ||
n_more = "" if len(aggs) <= 2 else f" and {len(aggs) - 2} more" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are <= 3, show all; otherwise show the first two and add and ... more
. Showing "This, that and 1 more" takes about the same width as "This, that and that" :)
Orange/widgets/data/owgroupby.py
Outdated
agg = self.text() | ||
selected_attrs = self.parent.get_selected_attributes() | ||
types_ = set(type(attr) for attr in selected_attrs) | ||
can_be_applied_all = (AGGREGATIONS[agg].types & types_) == types_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as types_ <= AGGREGATIONS[agg].types
?
Orange/widgets/data/owgroupby.py
Outdated
"""Callback for table selection change; update checkboxes""" | ||
selected_attrs = self.get_selected_attributes() | ||
|
||
types_ = set(type(attr) for attr in selected_attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is interesting that we never write list(type(attr) for attr in selected_attrs)
instead of [type(attr) for attr in selected_attrs]
, but we often write set(type(attr) for attr in selected_attrs)
instead of {type(attr) for attr in selected_attrs}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my case, the reason for this is the fact that {}
produces dict and not set and for an empty set, one must write set()
. Then sometimes I automatically use set(...) in comprehensions instead of {...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's probably the reason for all of us.
Orange/widgets/data/owgroupby.py
Outdated
types_ = set(type(attr) for attr in selected_attrs) | ||
active_aggregations = [self.aggregations[attr] for attr in selected_attrs] | ||
for agg, cb in self.agg_checkboxes.items(): | ||
cb.setDisabled(len(types_ & AGGREGATIONS[agg].types) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prehaps not types_ & AGGREGATIONS[agg].types
?
Orange/widgets/data/owgroupby.py
Outdated
for agg, cb in self.agg_checkboxes.items(): | ||
cb.setDisabled(len(types_ & AGGREGATIONS[agg].types) == 0) | ||
|
||
activated = [agg in a for a in active_aggregations] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation: it would be fun to use a set here, activated = {agg in a for a in active_aggregations}
, because the condition would then be Qt.checked if activated == {True} else (Qt.Unchecked if activated == {False} else Qt.PartiallyChecked)
. :)
self.openContext(self.data) | ||
|
||
# update selections in widgets and re-plot | ||
self.agg_table_model.set_domain(data.domain if data else None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Domain is already set above, before opening the context (as it should be). Is there any reason to repeat this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we set the domain to gb_attrs_model
(attribute model for the group by list). Here we set the domain to agg_table_model
which is used for the table. It is set after the context is open to plot the table with settings opened by the context (aggreagtions). Initializing it before the context is open would require another call to the table to replot after the context is open.
Orange/data/aggregate.py
Outdated
|
||
class OrangeTableGroupBy: | ||
""" | ||
The class which object is the result of the groupby operation on Orange's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A class representing the result of ...
Thank you for this widget! I needed this functionality too and was about to start trying an implementation until I saw this pull request. I like it very much, but if I may, I would like to add some more ideas:
I work with time series data and when I group and aggregate I also always have to select the time range of the group to aggregate over. So from a group or stage always only e.g. the last 10 values or 10 minutes or similar are selected and aggregated, because first a settling process must be waited for. Do you understand what I mean? I think this is a relatively common approach.
For each of these aggregations, an additional parameter is required. Also, you might need to use two groupby widgets in a row, the first for grouping and cutting the range, the second for a subsequent aggregation like mean. The apply function could perhaps be called "Custom" and allow text input to use other pandas functions. What do you guys think about this? |
A widget called Unique already does it. Also, if you are not interested in any aggregation, you can use whichever you want and ignore it. :) Though your suggestion is still reasonable. |
Thanks for the hint. |
Here could be a problem: The normal functions return one value per group, mine suggested usually several. What happens to the output table if "Mean" is selected for one signal and "Tail" for another? When selecting a time range (with e.g. last('10s')) the selection of the time channel to be used would also be necessary. |
433ada6
to
b532270
Compare
What unique widget do can be already done with this widget (except the option I think @mw25 had in mind a case where we just group and do not attach any aggregation to it. For example, if we have a table with attributes a, b, c and we group by a and b the table would include only columns a and b (with unique values combinations). Do I understand it right? It cannot be done with the unique widget. I actually think that Group By widget should return this kind of "aggregation" when turning off all aggregations on all attributes. Currently, the error is raised by Pandas. I will fix that. |
I tried hard again to remember what I originally wanted =D |
You can already achieve the second option with Feature Constructor and |
Regarding the other two aggregations:
|
Oh yes that's right. I knew I missed something...
@PrimozGodec Yes, I noticed that afterwards, too. Did you see my comment:
These preprocessing options would then of course apply to all signals and be listed separately in the widget. |
Personally, I'd have a separate widget in Timeseries to handle such use cases. I'd prefer to keep the core Group by widget simple and straightforward. |
I didn't want to comment it, but if you ask me directly, I will: I don't like it. For those who want to code, there are Feature Constructor and Python Script. If we add it here, we'll want to add it to the next widget next week and soon many widgets will have inputs that will only be understood by programmers. |
edcc500
to
cc574b7
Compare
Thank you for your feedback. I think we agree that a custom option is not an option. But what about another box (e.g. in the controlArea under the signal list) with an checkable option "Preprocessing" or "Range selection" or something? There could be a dropdown with the functions tail (n values), head (n values), tail (n seconds), head (n seconds), ... and another input field for n (and maybe another one for the time channel?). Would this still be possible in the groupby widget? Otherwise I think @ajdapretnar idea to implement this in timeseries is good. |
I have extended OrangeTableGroupBy and Table.groupby() a bit to allow preprocessing / range selection for each group. Regardless of whether the preprocessing is placed in the groupby widget or a timeseries widget, or is only available via scripting (or only in a private widget), I would like to propose my extension. How can I commit or propose my changes? |
@mw25 As I think your preprocessing is very much related to timeseries, I would propose you make a PR on the Timeseries repository, where we can review it and merge it. The PR should include at least tests and some documentation. Ideally, this would be a widget for timeseries preprocessing called Preprocess Timeseries. That's how I see it. |
I would propose one additional change (and to demonstrate my commitment, I already made a commit). I would rearrange the columns, that are currently like this: to be like this: which, although visually a bit unusual because the single item is in the last, not the first column, better groups items into categories (statistical moments, sums and extremes, single values and counts). @ajdapretnar (who otherwise visually prefers the original, but agrees that the organization is better in the reformatted version) also commented that this looks better for discrete attributes, because it doesn't disable a randomly scattered collection of options but basically the first two columns (except Concatenate). |
@janezd thank you for the changes made. It was one of the modifications that I wanted to make today. It looked a bit bad to me that in the last column two items were missing, so I had an idea to rearrange such that one is missing in the last two columns. But what you proposed looks actually better (ok it looks a bit off in the beginning but it makes more sense). :) So I agree with changes. |
I fixed the tests. Now it is ready to merge from the implementation part of the widget. Still, the documentation is missing. I will work on it but will probably not have time today. You can merge it if you think it would be good to have a widget in Orange soon and I can then add the documentation in the next PR. If you think we can wait I can add the documentation to this PR in the beginning of next week. |
cd6308a
to
3212dd3
Compare
3212dd3
to
02d205e
Compare
Issue
Orange is missing group by - aggregate widget with similar behaviour than Panda groupby - aggregate
Description of changes
Documentation will be added when we agree on the widget's functionality.
Includes