-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup utils.concat() with aggregate_function #405
Conversation
@ChristianGeng sorry for requesting the review already. I need to first update the tests and double check that the new behavior is really what we want to have. |
Ok, so it means that if you apply e.g. |
The implementation I propose here makes it much harder to predict what you get back as if you have two different columns that have randomly the same label it will not apply |
My comment was too pessimistic. The new changes will still apply |
But it still means that |
Here's a simple example: import numpy as np
import pandas as pd
import audformat
index = audformat.filewise_index(['f1', 'f2'])
y = pd.Series([1, 2], index)
audformat.utils.concat([y, y], aggregate_function=np.sum)
Before it returned:
|
This might not be an issue since |
Yes, so far there was no release, so we can change The behavior currently in |
In principle I agree. It still find it strange that now the following can happen: index = audformat.filewise_index(['f1', 'f2'])
y1 = pd.Series([1, 1], index)
y2 = pd.Series([1, 2], index)
audformat.utils.concat([y1, y2], aggregate_function=np.sum)
but: audformat.utils.concat([y1[:1], y2[:1]], aggregate_function=np.sum)
|
So maybe we need an additional argument to control when the |
I agree that your example looks very counter-intuitive to a user. I'm also not really in favor of adding another argument. Maybe we simply close this pull request. try:
y = audformat.utils.concat(ys)
except ValueError:
y = audformat.utils.concat(ys, aggregate_function=aggregate_function) |
In the worst case when the error is raised late, does it mean the function is executed more or less twice then?
Mhh, ok. To me it seems to make sense, it could have three values: 1. apply always 2. apply on all duplicates 3. apply only on non-matching duplicates. |
Good point, haven't thought about that.
OK, I will have a look into it. |
I have now added the Your third suggestion of |
Sorry, there is still something not behaving as expected, will first take another look. |
I fixed the remaining problem. Now we have the following behavior: concat(
[
pd.Series([1, 1], index=pd.Index([0, 1])),
pd.Series([1, 1], index=pd.Index([0, 1])),
],
aggregate_function=np.sum,
aggregate='always',
) returns
concat(
[
pd.Series([1, 1], index=pd.Index([0, 1])),
pd.Series([1, 1], index=pd.Index([0, 1])),
],
aggregate_function=np.sum,
aggregate='non-matching',
) returns
concat(
[
pd.Series([1, 1], index=pd.Index([0, 1])),
pd.Series([1, 2], index=pd.Index([0, 1])),
],
aggregate_function=np.sum,
aggregate='non-matching',
) returns
There is a third option missing, but I also think that is too hard to grasp: concat(
[
pd.Series([1, 1], index=pd.Index([0, 1])),
pd.Series([1, 1], index=pd.Index([0, 1])),
],
aggregate_function=np.sum,
aggregate='third-option',
) returns
concat(
[
pd.Series([1, 1], index=pd.Index([0, 1])),
pd.Series([1, 2], index=pd.Index([0, 1])),
],
aggregate_function=np.sum,
aggregate='third-option',
) returns
|
I just realised that the third option is actually what we want to have in |
I think it should be So in the docstring we usually talk about duplicates. So maybe we could rename to |
I.e., we then have:
|
For E.g. for audformat.utils.concat(
[
pd.Series([1, 1], index=pd.Index(['b', 'c'])),
pd.Series([2, 3, 4], index=pd.Index(['a', 'b', 'c'])),
],
aggregate_function=lambda y: y[0],
aggregate='always',
) the first solution would return
and the second solution
The other question is: do we really need the option to apply the aggregate function to non-overlapping entries?
I guess that's the reason why it is not covered by your three cases ;) Let's assume we have one case where all values match and another, where one value is different: Matching case: objs = [
pd.Series([1, 1]),
pd.Series([1, 1]),
] Non-matching case: objs = [
pd.Series([1, 1]),
pd.Series([1, 2]),
] Here is what the three different option for
|
The latter, for segments with overlap we also do not expand I think.
If it is too complicated to implement we can skip it. But you could use it to count the number of overlaps for every segment for instance.
Mhh, I am still not sure if I understand what Btw: probably "overlap" would be a better name than "duplicates", as the latter implies that also values have to match, which is not what we want. |
Exactly. For The alternative would be to always use |
Mhh, ok. But shouldn't we let the user give the option to decide which strategy should be used? |
Good idea, then we set the default to |
I think also here we could set the default to |
I implemented now the suggested changes, but skipped |
Very cool, looks like we are done here. |
This adds the
aggregate_strategy
argument toaudformat.utils.concat()
to specify whenaggregate_function
should be applied.aggregate_strategy='overlap'
is the old behavior. In additionaggregate_strategy='mismatch'
is added which applies the aggregate function only to index entries for which values cannot be joined.aggregate_strategy='mismatch'
can be faster by a factor up to >10x for large databases with lots of matching data overlap as we have encountered when testing #399.