Speedup utils.concat() with aggregate_function #405

hagenw · 2023-11-08T15:52:13Z

This adds the aggregate_strategy argument to audformat.utils.concat() to specify when aggregate_function should be applied. aggregate_strategy='overlap' is the old behavior. In addition aggregate_strategy='mismatch' is added which applies the aggregate function only to index entries for which values cannot be joined. aggregate_strategy='mismatch' can be faster by a factor up to >10x for large databases with lots of matching data overlap as we have encountered when testing #399.

hagenw · 2023-11-08T15:56:50Z

@ChristianGeng sorry for requesting the review already.

I need to first update the tests and double check that the new behavior is really what we want to have.
Marked the pull request as draft until I'm ready.

frankenjoe · 2023-11-09T08:20:35Z

So, I think in practice we will gain more by applying aggregate_function only to non matching entries.

Ok, so it means that if you apply e.g. sum() and you have matching entries the result will now change, right?

hagenw · 2023-11-09T10:03:47Z

The implementation I propose here makes it much harder to predict what you get back as if you have two different columns that have randomly the same label it will not apply aggregate_function. So I would recommend to stay with the current approach to apply aggregate_function to all the index entries that have more than one matching column.

hagenw · 2023-11-09T11:06:31Z

The implementation I propose here makes it much harder to predict what you get back as if you have two different columns that have randomly the same label it will not apply aggregate_function. So I would recommend to stay with the current approach to apply aggregate_function to all the index entries that have more than one matching column.

My comment was too pessimistic. The new changes will still apply aggregate_function to all values of the overlapping columns. Only if the two columns can completely be joined without the need to apply aggregate_function it is now skipped.

codecov · 2023-11-09T11:13:46Z

Codecov Report

Merging #405 (aa0cc73) into main (fd9c481) will not change coverage.
Report is 1 commits behind head on main.
The diff coverage is 100.0%.

Additional details and impacted files

Files	Coverage Δ
audformat/core/utils.py	`100.0% <100.0%> (ø)`

frankenjoe · 2023-11-09T13:13:18Z

My comment was too pessimistic. The new changes will still apply aggregate_function to all values of the overlapping columns. Only if the two columns can completely be joined without the need to apply aggregate_function it is now skipped.

But it still means that sum() will return something different than before, right?

frankenjoe · 2023-11-09T13:22:12Z

Here's a simple example:

import numpy as np
import pandas as pd

import audformat

index = audformat.filewise_index(['f1', 'f2'])
y = pd.Series([1, 2], index)
audformat.utils.concat([y, y], aggregate_function=np.sum)

file
f1    1
f2    2
dtype: Int64

Before it returned:

file
f1    2
f2    4
dtype: Int64

frankenjoe · 2023-11-09T13:23:41Z

This might not be an issue since aggregate_function is a new feature. I just wonder if it can lead to unexpected behavior.

hagenw · 2023-11-09T13:48:53Z

Yes, so far there was no release, so we can change aggregate_function however we want. We just need to decide which behavior we would like to have.

The behavior currently in main applies aggregate_function to all samples that are stored in different columns, whereas the new behavior proposed in this pull request only applies aggregate_function to all samples that come from two columns that cannot be joined. I think the new behavior makes more sense, as otherwise the output of aggregate_function depends if labels are stored in several tables or not (e.g. having an unbalanced and balanced test set where the later is a sub-set of the former).

frankenjoe · 2023-11-10T07:51:42Z

In principle I agree. It still find it strange that now the following can happen:

index = audformat.filewise_index(['f1', 'f2'])
y1 = pd.Series([1, 1], index)
y2 = pd.Series([1, 2], index)
audformat.utils.concat([y1, y2], aggregate_function=np.sum)

file
f1    2
f2    3
dtype: Int64
file
f1    2
f2    3
dtype: Int64

but:

audformat.utils.concat([y1[:1], y2[:1]], aggregate_function=np.sum)

file
f1    1
dtype: Int64

frankenjoe · 2023-11-10T07:54:03Z

So maybe we need an additional argument to control when the aggregate_function should be applied?

hagenw · 2023-11-15T08:52:45Z

I agree that your example looks very counter-intuitive to a user.

I'm also not really in favor of adding another argument. Maybe we simply close this pull request.
In #399 I solved the issue by first trying to join without aggregate function and only using it when really needed:

try:
    y = audformat.utils.concat(ys)
except ValueError:
    y = audformat.utils.concat(ys, aggregate_function=aggregate_function)

frankenjoe · 2023-11-15T08:58:35Z

by first trying to join without aggregate function and only using it when really needed:

In the worst case when the error is raised late, does it mean the function is executed more or less twice then?

I'm also not really in favor of adding another argument.

Mhh, ok. To me it seems to make sense, it could have three values: 1. apply always 2. apply on all duplicates 3. apply only on non-matching duplicates.

hagenw · 2023-11-15T09:11:39Z

by first trying to join without aggregate function and only using it when really needed:

In the worst case when the error is raised late, does it mean the function is executed more or less twice then?

Good point, haven't thought about that.
Yes, if the first columns can be joined without error, but only the last column raises an error it will indeed execute more or less twice then.

I'm also not really in favor of adding another argument.

Mhh, ok. To me it seems to make sense, it could have three values: 1. apply always 2. apply on all duplicates 3. apply only on non-matching duplicates.

OK, I will have a look into it.

hagenw · 2023-11-15T12:02:50Z

I have now added the aggregate argument, that can select between always and non-matching for when to apply aggregate_function.

Your third suggestion of duplicates we cannot add as we need to adjust always non-matching labels, otherwise we cannot join the columns.

hagenw · 2023-11-15T12:08:06Z

Sorry, there is still something not behaving as expected, will first take another look.

hagenw · 2023-11-15T12:32:00Z

I fixed the remaining problem.

Now we have the following behavior:

concat(
    [
        pd.Series([1, 1], index=pd.Index([0, 1])),
        pd.Series([1, 1], index=pd.Index([0, 1])),
    ],
    aggregate_function=np.sum,
    aggregate='always',
)

returns

0    2
1    2
dtype: Int64

concat(
    [
        pd.Series([1, 1], index=pd.Index([0, 1])),
        pd.Series([1, 1], index=pd.Index([0, 1])),
    ],
    aggregate_function=np.sum,
    aggregate='non-matching',
)

returns

0    1
1    1
dtype: Int64

concat(
    [
        pd.Series([1, 1], index=pd.Index([0, 1])),
        pd.Series([1, 2], index=pd.Index([0, 1])),
    ],
    aggregate_function=np.sum,
    aggregate='non-matching',
)

returns

0    1
1    3
dtype: Int64

There is a third option missing, but I also think that is too hard to grasp:

concat(
    [
        pd.Series([1, 1], index=pd.Index([0, 1])),
        pd.Series([1, 1], index=pd.Index([0, 1])),
    ],
    aggregate_function=np.sum,
    aggregate='third-option',
)

returns

0    1
1    1
dtype: Int64

concat(
    [
        pd.Series([1, 1], index=pd.Index([0, 1])),
        pd.Series([1, 2], index=pd.Index([0, 1])),
    ],
    aggregate_function=np.sum,
    aggregate='third-option',
)

returns

0    2
1    3
dtype: Int64

hagenw · 2023-11-15T17:08:20Z

I just realised that the third option is actually what we want to have in audformat.Database.get(). I will add it under the name 'when-non-matching' and will rename 'non-matching' to 'only-non-matching'.

frankenjoe · 2023-11-15T18:16:50Z

I think it should be 'not-maching' or maybe better 'mismatch'. But I am also not sure if I would understand the difference between 'when-not-matching' and 'only-not-matching'.

So in the docstring we usually talk about duplicates. So maybe we could rename to 'always', 'duplicates' and 'mismatch'?

frankenjoe · 2023-11-15T18:19:03Z

I.e., we then have:

'always': apply to every file / segment
'duplicates': apply when files / segments overlap
'mismatch': apply when files / segments overlap and their values do not match

hagenw · 2023-11-16T09:03:28Z

For always I'm not completely sure what to expect. Should it first expand the columns with NaN to make sure they have the same number of entries and then apply the aggregate function or should it work on the given number of values that are available for each index entry?

E.g. for

audformat.utils.concat(
    [
        pd.Series([1, 1], index=pd.Index(['b', 'c'])),
        pd.Series([2, 3, 4], index=pd.Index(['a', 'b', 'c'])),
    ],
    aggregate_function=lambda y: y[0],
    aggregate='always',
)

the first solution would return

a    NaN
b    1
c    1
dtype: Int64

and the second solution

b    1
c    1
a    2
dtype: Int64

The other question is: do we really need the option to apply the aggregate function to non-overlapping entries?

But I am also not sure if I would understand the difference between 'when-not-matching' and 'only-not-matching'.

I guess that's the reason why it is not covered by your three cases ;)
My current solution in #399 is not covered by duplicates nor by mismatch, but would require duplicates-when-mismatch.

Let's assume we have one case where all values match and another, where one value is different:

Matching case:

objs = [
    pd.Series([1, 1]),
    pd.Series([1, 1]),
]

Non-matching case:

objs = [
    pd.Series([1, 1]),
    pd.Series([1, 2]),
]

Here is what the three different option for aggregate would return for aggregate_function=np.sum:

case	duplicates	mismatch	duplicates-when-mismatch
matching	[2, 2]	[1, 1]	[1, 1]
non-matching	[2, 3]	[1, 3]	[2, 3]

frankenjoe · 2023-11-17T08:29:52Z

For always I'm not completely sure what to expect. Should it first expand the columns with NaN to make sure they have the same number of entries and then apply the aggregate function or should it work on the given number of values that are available for each index entry?

The latter, for segments with overlap we also do not expand I think.

The other question is: do we really need the option to apply the aggregate function to non-overlapping entries?

If it is too complicated to implement we can skip it. But you could use it to count the number of overlaps for every segment for instance.

I guess that's the reason why it is not covered by your three cases ;)
My current solution in #399 is not covered by duplicates nor by mismatch, but would require duplicates-when-mismatch.

Mhh, I am still not sure if I understand what duplicates-when-mismatch does :) Is it that as soon as there is one value mismatch for one segment that overlaps that then it applies the aggregate function to all segments with overlap? Why exactly do we need this case?

Btw: probably "overlap" would be a better name than "duplicates", as the latter implies that also values have to match, which is not what we want.

hagenw · 2023-11-17T09:14:49Z

Is it that as soon as there is one value mismatch for one segment that overlaps that then it applies the aggregate function to all segments with overlap? Why exactly do we need this case?

Exactly. For audformat.Database.get() we want to apply the aggregate function only if needed, e.g. a column has at least one value with a mismatch. My main reason to apply it to all values of the column in that case is to better indicate to the user which parts have overlap (e.g. when tuple was chosen as aggregate function) as it might be the case that the label is the same due to randomness, e.g. same gender for speaker of left and right channel. It can still happen, but it is less likely.

The alternative would be to always use duplicates, but this would slow down get() as we have a lot of tables that simply repeat entries found in other tables.

frankenjoe · 2023-11-17T09:33:26Z

Mhh, ok. But shouldn't we let the user give the option to decide which strategy should be used?

hagenw · 2023-11-17T10:08:43Z

Good idea, then we set the default to mismatch in Database.get() and the user can change this if it is really needed.
OK, then let's stay with the current 3 options, and rename duplicates to overlap as you suggested.

frankenjoe · 2023-11-17T10:20:05Z

I think also here we could set the default to mismatch as this is the most common use-case.

hagenw · 2023-11-17T11:46:00Z

I implemented now the suggested changes, but skipped 'always' as the desired result was not achievable with the current code. If we still want to have it we should open an issue for it.
I selected now 'mismatch' as the default behavior.

audformat/core/utils.py

frankenjoe · 2023-11-21T18:00:57Z

Very cool, looks like we are done here.

hagenw added 2 commits November 8, 2023 16:42

Speedup utils.concat() with aggregate_function

a81b4d9

Update docstring

a5d6763

hagenw requested a review from ChristianGeng November 8, 2023 15:52

hagenw marked this pull request as draft November 8, 2023 15:56

hagenw added 2 commits November 9, 2023 12:02

Adjust tests

41c0fe3

Add more tests

7013fa1

Fix coverage

caf4f84

hagenw requested a review from frankenjoe November 9, 2023 11:14

hagenw marked this pull request as ready for review November 9, 2023 11:15

hagenw mentioned this pull request Nov 15, 2023

Add audformat.Database.get() #399

Merged

hagenw added 2 commits November 15, 2023 12:01

First try to add aggregate argument

ad07980

Extend tests

9f75cfc

hagenw marked this pull request as draft November 15, 2023 11:05

hagenw added 3 commits November 15, 2023 12:26

Test for error messages

02886dd

Extend docstring

eaf38f4

Fix tests for error messages

ee28601

Fix test for categorical error message

c753d27

hagenw marked this pull request as ready for review November 15, 2023 12:00

hagenw marked this pull request as draft November 15, 2023 12:07

Adjust when always is applied

b51ce50

hagenw marked this pull request as ready for review November 15, 2023 12:26

Fix docstring typo

f224937

hagenw marked this pull request as draft November 15, 2023 20:21

Add first idea for three options

eb524d0

Use overlap and mismatch as options

10ff0fd

hagenw marked this pull request as ready for review November 17, 2023 11:44

hagenw added 2 commits November 17, 2023 12:46

Remove debug print in tests

906c8cd

Fix usage of escape sequencies

fa6a364

frankenjoe reviewed Nov 21, 2023

View reviewed changes

audformat/core/utils.py Outdated Show resolved Hide resolved

Rename aggregate to aggregate_strategy

aa0cc73

frankenjoe merged commit 9a7b944 into main Nov 21, 2023
10 checks passed

frankenjoe deleted the speed-up-aggregate-function branch November 21, 2023 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup utils.concat() with aggregate_function #405

Speedup utils.concat() with aggregate_function #405

hagenw commented Nov 8, 2023 •

edited

Loading

hagenw commented Nov 8, 2023

frankenjoe commented Nov 9, 2023

hagenw commented Nov 9, 2023

hagenw commented Nov 9, 2023

codecov bot commented Nov 9, 2023 •

edited

Loading

frankenjoe commented Nov 9, 2023

frankenjoe commented Nov 9, 2023 •

edited

Loading

frankenjoe commented Nov 9, 2023

hagenw commented Nov 9, 2023

frankenjoe commented Nov 10, 2023 •

edited by hagenw

Loading

frankenjoe commented Nov 10, 2023

hagenw commented Nov 15, 2023 •

edited

Loading

frankenjoe commented Nov 15, 2023

hagenw commented Nov 15, 2023

hagenw commented Nov 15, 2023

hagenw commented Nov 15, 2023

hagenw commented Nov 15, 2023

hagenw commented Nov 15, 2023

frankenjoe commented Nov 15, 2023 •

edited

Loading

frankenjoe commented Nov 15, 2023

hagenw commented Nov 16, 2023

frankenjoe commented Nov 17, 2023

hagenw commented Nov 17, 2023

frankenjoe commented Nov 17, 2023

hagenw commented Nov 17, 2023

frankenjoe commented Nov 17, 2023

hagenw commented Nov 17, 2023

frankenjoe commented Nov 21, 2023

Speedup utils.concat() with aggregate_function #405

Speedup utils.concat() with aggregate_function #405

Conversation

hagenw commented Nov 8, 2023 • edited Loading

hagenw commented Nov 8, 2023

frankenjoe commented Nov 9, 2023

hagenw commented Nov 9, 2023

hagenw commented Nov 9, 2023

codecov bot commented Nov 9, 2023 • edited Loading

Codecov Report

frankenjoe commented Nov 9, 2023

frankenjoe commented Nov 9, 2023 • edited Loading

frankenjoe commented Nov 9, 2023

hagenw commented Nov 9, 2023

frankenjoe commented Nov 10, 2023 • edited by hagenw Loading

frankenjoe commented Nov 10, 2023

hagenw commented Nov 15, 2023 • edited Loading

frankenjoe commented Nov 15, 2023

hagenw commented Nov 15, 2023

hagenw commented Nov 15, 2023

hagenw commented Nov 15, 2023

hagenw commented Nov 15, 2023

hagenw commented Nov 15, 2023

frankenjoe commented Nov 15, 2023 • edited Loading

frankenjoe commented Nov 15, 2023

hagenw commented Nov 16, 2023

frankenjoe commented Nov 17, 2023

hagenw commented Nov 17, 2023

frankenjoe commented Nov 17, 2023

hagenw commented Nov 17, 2023

frankenjoe commented Nov 17, 2023

hagenw commented Nov 17, 2023

frankenjoe commented Nov 21, 2023

hagenw commented Nov 8, 2023 •

edited

Loading

codecov bot commented Nov 9, 2023 •

edited

Loading

frankenjoe commented Nov 9, 2023 •

edited

Loading

frankenjoe commented Nov 10, 2023 •

edited by hagenw

Loading

hagenw commented Nov 15, 2023 •

edited

Loading

frankenjoe commented Nov 15, 2023 •

edited

Loading