Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add audformat.Database.get() #399

Merged
merged 46 commits into from
Nov 24, 2023
Merged

Add audformat.Database.get() #399

merged 46 commits into from
Nov 24, 2023

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Oct 19, 2023

Closes #398

Adds audformat.Database.get() to request a data frame containing columns with labels based on a scheme, limited by selected tables and/or splits and additional schemes drawn from the whole database.

To also support databases that did not used schemes, forgot to assign it to a column or use dictionaries to provide scheme labels, I added the strict=False argument.

For schemes that have a simple mapping, e.g. transcriptions with {'a0': 'some text'} the label is expanded unless the user explicitly disables it with map=False.

For example:

import audb
import audformat

db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
db.get('emotion', ['gender'])

returns

                   emotion  gender
file                              
wav/03a01Fa.wav  happiness    male
wav/03a01Nc.wav    neutral    male
wav/03a01Wa.wav      anger    male
wav/03a02Fc.wav  happiness    male
wav/03a02Nc.wav    neutral    male
...                    ...     ...
wav/16b10Lb.wav    boredom  female
wav/16b10Tb.wav    sadness  female
wav/16b10Td.wav    sadness  female
wav/16b10Wa.wav      anger  female
wav/16b10Wb.wav      anger  female

[535 rows x 2 columns]

image


I also updated the docstring example for the audformat.Database class:

image

@hagenw hagenw marked this pull request as draft October 19, 2023 09:35
@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

Merging #399 (3beb8c5) into main (9a7b944) will not change coverage.
The diff coverage is 100.0%.

Additional details and impacted files
Files Coverage Δ
audformat/core/database.py 100.0% <100.0%> (ø)
audformat/core/utils.py 100.0% <ø> (ø)

@hagenw
Copy link
Member Author

hagenw commented Oct 20, 2023

@frankenjoe I implemented a first version that provides already a bunch of features. But there are also some challenges left, for which I don't know yet how to best tackle them:

Loose definition of a scheme

As we did not started with a misc table and we also allow for columns without a scheme,
it makes sense to not only return labels matching a scheme, but also those based on column and scheme label names.
But we might think about adding a strict argument, that could enforce to only return labels matched by the scheme.

Handling of different labels for same data point

If the ratings in the example from the description are the same,
they will be grouped together (which is a feature of audformat.utils.concat(), e.g.

import audformat

db = audformat.Database('db')
db.schemes['rating'] = audformat.Scheme('int')
index = audformat.filewise_index(['f1', 'f2'])
db['session1'] = audformat.Table(index)
db['session1']['rating'] = audformat.Column(scheme_id='rating')
db['session1']['rating'].set([0, 1])
db['session2'] = audformat.Table(index)
db['session2']['rating'] = audformat.Column(scheme_id='rating')
db['session2']['rating'].set([0, 1])

db.get('rating')

returns

      rating
file        
f1         0
f2         1

I'm not sure yet if we can always say this is the expected behavior.

A related issue is if you have more than two entries, the algorithm is not able to group some of the columns together, e.g.

import audformat

db = audformat.Database('db')
db.schemes['rating'] = audformat.Scheme('int')
index = audformat.filewise_index(['f1', 'f2'])
db['session1'] = audformat.Table(index)
db['session1']['rating'] = audformat.Column(scheme_id='rating')
db['session1']['rating'].set([0, 1])
db['session2'] = audformat.Table(index)
db['session2']['rating'] = audformat.Column(scheme_id='rating')
db['session2']['rating'].set([1, 1])
db['session3'] = audformat.Table(index)
db['session3']['rating'] = audformat.Column(scheme_id='rating')
db['session3']['rating'].set([1, 1])

db.get('rating')

returns

      rating  rating-1  rating-2
file                            
f1         0         1         1
f2         1         1         1

instead of merging rating-1 and rating-2 together.
Even worse this means, we now depend on the order of the tables, e.g.

import audformat

db = audformat.Database('db')
db.schemes['rating'] = audformat.Scheme('int')
index = audformat.filewise_index(['f1', 'f2'])
db['session1'] = audformat.Table(index)
db['session1']['rating'] = audformat.Column(scheme_id='rating')
db['session1']['rating'].set([1, 1])
db['session2'] = audformat.Table(index)
db['session2']['rating'] = audformat.Column(scheme_id='rating')
db['session2']['rating'].set([1, 1])
db['session3'] = audformat.Table(index)
db['session3']['rating'] = audformat.Column(scheme_id='rating')
db['session3']['rating'].set([0, 1])

db.get('rating')

returns

      rating  rating-1
file                  
f1         1         0
f2         1         1

even though the original ratings were identical, just stored in a different order.
So I think we need a better solution here.

Handling of different data types

Unfortunately, we easily get completely different dtypes for the same scheme when collecting all the labels from the database. There are several reasons for this, e.g. in some cases we don't assign the scheme or we store the values as labels in a dict. I cover already some of those cases, but the following still fails as it produces two categorical dtypes, that differ in the dtypes of its categories (int vs. float):

import audformat
import pandas as pd

db = audformat.Database('db')
db.schemes['label'] = audformat.Scheme('int', labels=[0, 1]) 
db['speaker'] = audformat.MiscTable(
    pd.Index(['s1', 's2'], dtype='string', name='speaker')
)   
db['speaker']['label'] = audformat.Column()
db['speaker']['label'].set([1.0, 1.0])
db['files'] = audformat.Table(audformat.filewise_index(['f1', 'f2']))
db['files']['label'] = audformat.Column(scheme_id='label')
db['files']['label'].set([0, 1]) 
db['other'] = audformat.Table(audformat.filewise_index(['f1', 'f2']))
db.schemes['speaker'] = audformat.Scheme('str', labels='speaker')
db['other']['speaker'] = audformat.Column(scheme_id='speaker')
db['other']['speaker'].set(['s1', 's2'])

Then we get:

>>> db['other']['speaker'].get(map='label')
file
f1    1.0
f2    1.0
Name: label, dtype: category
Categories (1, float64): [1.0]

>>> db['files']['label'].get()
file
f1    0
f2    1
Name: label, dtype: category
Categories (2, int64): [0, 1]

>>> db.get('label')
...
ValueError: All categorical data must have the same dtype.

@frankenjoe
Copy link
Collaborator

frankenjoe commented Oct 20, 2023

Handling of different data types

To me it sounds fine if we raise an error if we get different data types for the same scheme. Maybe we can explain in the error message to the user how she should fix the database. Or would you say it's a valid use-case to store gender values in one place as ['female', 'male'], in another as ['f', 'm'] and somewhere else as [0, 1]? To me it sounds just fair to blame the database in that case.

Loose definition of a scheme

Again, I am not sure if we should encourage users to store values that are semantically related in one place as a plain column (e.g. a column named 'gender' without scheme) and in another place using a scheme (e.g. in a speaker column that connects to a gender scheme). As before, I would say it's fine to either find values in an according column or through a scheme, but raise an error if we find both.

Handling of different labels for same data point

This is indeed tricky. The current solution of spreading into several columns is indeed not too elegant. Once for the ordering you mentioned, but it also makes the result hard to predict as we do not know in advance how many columns will be returned. If we go for this solution I would at least use a multi-index for the column that result['scheme'] will return all columns for scheme and not just the first one. Or alternatively, combine the values into tuples.


A completely different approach would be to return an iterator so that the search results are returned one after another. This would leave it to the user how to deal with those issues :)


One general issue I see with the current solution is that it returns values from all splits. But in many use-cases you want the values of a specific split.

@hagenw
Copy link
Member Author

hagenw commented Oct 21, 2023

One general issue I see with the current solution is that it returns values from all splits. But in many use-cases you want the values of a specific split.

There are the tables and/or splits arguments which you can use to get only a certain spit.
What might be suboptimal at the moment is that the data is first limited to a given split, which means if you want to to request a label from the test set + gender information, it will most likely not find any gender information.
So maybe we should change and limit only the returned results to the requested split.

@hagenw
Copy link
Member Author

hagenw commented Oct 21, 2023

A completely different approach would be to return an iterator so that the search results are returned one after another. This would leave it to the user how to deal with those issues :)

I see the advantage of this as well, but it also has a big downside.
The main motivation to introduce Database.get() was for me that I can write code that gets a particular label
without having to know in which table it is stored, e.g. I know which test set I should use and just wanted to add gender information as well. On the other hand we also still have the problem that you get any number of columns back.
If we cannot solve this we might indeed switch to not return a data frame, but it wouldn't be my preferred solution for now.

As before, I would say it's fine to either find values in an according column or through a scheme, but raise an error if we find both.

I think we need both solutions:

  • One that never raises an error as you might want to parse a huge number of databases that are created by different authors and that you cannot easily fix. For the problem with different datatypes we might see if there is a way to automatically find a data type that can handle all the data (or in the worse case, simply convert everything to object).
  • One that raises errors and recommends fixes. We might limit this to return schemes only, or we add another independent argument for it.

To summaries: add 1-2 arguments to handle how strict we with parsing.

@frankenjoe
Copy link
Collaborator

  • For the problem with different datatypes we might see if there is a way to automatically find a data type that can handle all the data

We could allow the user to specify the data type and cast the results we find accordingly. I would prefer that over simply converting everything to object.

There are the tables and/or splits arguments which you can use to get only a certain spit.

Great, I didn't realize there is already an option for it. But there seems indeed be a problem with the way it is currently implemented, e.g.:

db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
db.get(['emotion', 'gender'])
                   emotion  gender
file                              
wav/03a01Fa.wav  happiness    male
wav/03a01Nc.wav    neutral    male
wav/03a01Wa.wav      anger    male
wav/03a02Fc.wav  happiness    male
wav/03a02Nc.wav    neutral    male
...                    ...     ...
wav/16b10Lb.wav    boredom  female
wav/16b10Tb.wav    sadness  female
wav/16b10Td.wav    sadness  female
wav/16b10Wa.wav      anger  female
wav/16b10Wb.wav      anger  female

but:

db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
db.get(['emotion', 'gender'], splits=['test'])
                   emotion
file                      
wav/12a01Fb.wav  happiness
wav/12a01Lb.wav    boredom
wav/12a01Nb.wav    neutral
wav/12a01Wc.wav      anger
wav/12a02Ac.wav       fear
...                    ...
wav/16b10Lb.wav    boredom
wav/16b10Tb.wav    sadness
wav/16b10Td.wav    sadness
wav/16b10Wa.wav      anger
wav/16b10Wb.wav      anger

@hagenw
Copy link
Member Author

hagenw commented Oct 24, 2023

Handling of different labels for same data point

I think we cannot simply return a tuple inside a column that has more as one label as the information were the label comes from is lost, but most likely required for further processing.

One solution might be to use multi-index columns, but I'm not convinced by their usability in our case.

At the moment I would propose to expand the index of the data frame:

  • always return a segmented index
  • add table and column levels that store where the label are coming from

Example:

import audformat

db = audformat.Database('db')
db.schemes['rating'] = audformat.Scheme('int')
index = audformat.filewise_index(['f1', 'f2'])
db['session1'] = audformat.Table(index)
db['session1']['rating'] = audformat.Column(scheme_id='rating')
db['session1']['rating'].set([0, 1])
db['session2'] = audformat.Table(index)
db['session2']['rating'] = audformat.Column(scheme_id='rating')
db['session2']['rating'].set([1, 1])

we will get

>>> db.get('rating')
                                 rating
file start  end table    column        
f1   0 days NaT session1 rating       0
f2   0 days NaT session1 rating       1
f1   0 days NaT session2 rating       1
f2   0 days NaT session2 rating       1

The user would have then the possibility to count the number of available labels by using len(df), and could check if there are duplicates with any(df.index.droplevel([3, 4]).duplicated()) and if there are any duplicates the user could raise an error or provide code that can handle it, e.g. use only datapoints from the first session:

df = df.reset_index([3, 4])
df = df[df.table == 'session1']

If there are no duplicates the index can be used as it is and should work with audinterface.


@frankenjoe I have not yet updated the tests and first wanted to get some feedback on this solution.


Unfortunately, my proposed solution has some unwanted consequences:

>>> db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
>>> db.get(['emotion', 'gender'], splits=['test'])
                                                                            emotion  gender
file            start  end table                                 column                    
wav/12a01Fb.wav 0 days NaT emotion                               emotion  happiness     NaN
                           emotion.categories.test.gold_standard emotion  happiness     NaN
                           files                                 speaker        NaN    male
wav/12a01Lb.wav 0 days NaT emotion                               emotion    boredom     NaN
                           emotion.categories.test.gold_standard emotion    boredom     NaN
...                                                                             ...     ...
wav/16b10Wa.wav 0 days NaT emotion.categories.test.gold_standard emotion      anger     NaN
                           files                                 speaker        NaN  female
wav/16b10Wb.wav 0 days NaT emotion                               emotion      anger     NaN
                           emotion.categories.test.gold_standard emotion      anger     NaN
                           files                                 speaker        NaN  female

[693 rows x 2 columns]

@hagenw
Copy link
Member Author

hagenw commented Oct 24, 2023

But there seems indeed be a problem with the way it is currently implemented, e.g.:

I have now updated the handling of the tables and splits arguments and filter only at the very end:

>>> db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
>>> db.get(['emotion', 'gender'], splits=['test'])
                   emotion  gender
file                              
wav/12a01Fb.wav  happiness    male
wav/12a01Lb.wav    boredom    male
wav/12a01Nb.wav    neutral    male
wav/12a01Wc.wav      anger    male
wav/12a02Ac.wav       fear    male
...                    ...     ...
wav/16b10Lb.wav    boredom  female
wav/16b10Tb.wav    sadness  female
wav/16b10Td.wav    sadness  female
wav/16b10Wa.wav      anger  female
wav/16b10Wb.wav      anger  female

[231 rows x 2 columns]

@frankenjoe
Copy link
Collaborator

At the moment I would propose to expand the index of the data frame

Mhh, probably better than expanding into the columns. But it has the downside that a user will probably have to do some post-processing before it can be used e.g. to train a model. How about introducing an argument where the user can specify an aggregate function that is applied if we get more than one result for a file / segment?

@hagenw
Copy link
Member Author

hagenw commented Oct 24, 2023

At the moment I would propose to expand the index of the data frame

Mhh, probably better than expanding into the columns. But it has the downside that a user will probably have to do some post-processing before it can be used e.g. to train a model. How about introducing an argument where the user can specify an aggregate function that is applied if we get more than one result for a file / segment?

As a user would indeed need to add post-processing to handle the multiple labels anyway, it seems indeed reasonable to solve the issue by providing the custom code already as an aggregate function.
Thanks for the suggestion, I will try to implement it.

@hagenw
Copy link
Member Author

hagenw commented Oct 25, 2023

I implemented now a first version of the aggregate_function argument and it seems to be very helpful.
You can solve the issue of multiple labels with it,
but also the case where you have non-matching dtypes.
I have added several examples to the tests and to the docstring (see updated screenshot).

Here an easy example, that converts all labels to uppercase and adjusts the dtype accordingly:

def upper(y, db, table_id, column_id):
    data = [v.upper() for v in y.values]
    dtype = pd.CategoricalDtype(categories=['FEMALE', 'MALE'], ordered=False)
    return pd.Series(data, index=y.index, name=y.name, dtype=dtype)

db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
db.get('gender', aggregate_function=upper)

which returns

                 gender
file                   
wav/03a01Fa.wav    MALE
wav/03a01Nc.wav    MALE
wav/03a01Wa.wav    MALE
wav/03a02Fc.wav    MALE
wav/03a02Nc.wav    MALE
...                 ...
wav/16b10Lb.wav  FEMALE
wav/16b10Tb.wav  FEMALE
wav/16b10Td.wav  FEMALE
wav/16b10Wa.wav  FEMALE
wav/16b10Wb.wav  FEMALE

[535 rows x 1 columns]

@frankenjoe: if you agree with its implementation, I would update some of the error messages we raise and point to defining such a function, and maybe look for a shorter name of the argument?

@frankenjoe
Copy link
Collaborator

Here an easy example, that converts all labels to uppercase and adjusts the dtype accordingly:

Could you also provide a simple example where we have multiple values for the same file or segment? As far as I see, it now calls the aggregate function with the values from a specific column, I was expecting a function that gets called with a list of values that it found for a specific segment or file. But maybe that is already covered with your solution and I just don't see it :)

@hagenw
Copy link
Member Author

hagenw commented Oct 25, 2023

Sure, let's start with this database:

import audformat

db = audformat.Database('db')
db.schemes['rating'] = audformat.Scheme('int')
index = audformat.filewise_index(['f1', 'f2'])
db['session1'] = audformat.Table(index)
db['session1']['rating'] = audformat.Column(scheme_id='rating')
db['session1']['rating'].set([0, 1])
db['session2'] = audformat.Table(index)
db['session2']['rating'] = audformat.Column(scheme_id='rating')
db['session2']['rating'].set([1, 1])

Without aggregate function you will now get an error (I would update the error message and point to the aggregate_function argument):

>>> db.get('rating')
...
ValueError: Found overlapping data in column 'rating':
      left  right
file             
f1       0      1

There are several possible solutions for this database:

  1. select a particular session
  2. add the table name to the column
  3. calculate mean over the values

1. select a particular session

def select_session1(y, db, table_id, column_id):
    if table_id != 'session1':
        index = audformat.filewise_index()
        y = pd.Series(index=index, dtype=y.dtype, name=y.name)
    return y
>>> db.get('rating', aggregate_function=select_session1)
      rating
file        
f1         0
f2         1

2. add the table name to the column

def add_table_name(y, db, table_id, column_id):
    y.name = f'{y.name}-{table_id}'
    return y
>>> db.get('rating', aggregate_function=add_table_name)
      rating-session1  rating-session2
file                                  
f1                  0                1
f2                  1                1

3. calculate mean over the values

This is indeed not so nicely supported.
You can do the following,
which of cause requires knowledge of the table names:

def average_sessions(y, db, table_id, column_id):
    if table_id in ['session1', 'session2']:
        name = y.name
        if table_id == 'session1':
            y2 = db['session2'][column_id].get()
        else:
            y2 = db['session1'][column_id].get()
        y = pd.concat([y, y2], axis=1).mean(axis=1)
        y.name = name
    return y

or

def average_sessions(y, db, table_id, column_id):
    name = y.name
    if table_id == 'session1':
        y2 = db['session2'][column_id].get()
        y = pd.concat([y, y2], axis=1).mean(axis=1)
        y.name = name
    else:
        index = audformat.filewise_index()
        y = pd.Series(index=index, name=name, dtype='float')
    return y
>>> db.get('rating', aggregate_function=average_sessions)
      rating
file        
f1       0.5
f2       1.0

The case 3, would be much easier with your suggested approach, but I find it also nice to have the possibility to change all labels as this also allows you to combine stuff (see ['gender', 'sex'] example in the tests), or to fix dtypes.

Don't know if there is a way to combine both approaches?

@frankenjoe
Copy link
Collaborator

frankenjoe commented Oct 25, 2023

Don't know if there is a way to combine both approaches?

I would use the aggregate function only to combine values that belong to the same file / segment. As you said case 3. is rather complicated and this was actually the main motivation to have an aggregate function.

To replace the aggregate function that you propose we could have an argument to add column and table IDs to the returned DataFrame. Example 1 could then be solved by doing df[df['column_id' == 'session1']) and Example 2 with df['name'] += '-' + df['column_id'].

@hagenw
Copy link
Member Author

hagenw commented Oct 25, 2023

Example 1 could then be solved by doing df[df['column_id' == 'session1'])

This will unfortunately not work as it cannot create df before as it will raise already an error as we have different labels for the same index. If we want to select a rating based on the table ID, this has to happen already inside the aggregate function. This was one of the big hurdles why I did not chose your suggested implementation, but just did a short cut that work on single columns.

But I will try if I manage to create an aggregate function, that has also access to all the table and column names of the label values.

@frankenjoe
Copy link
Collaborator

This will unfortunately not work as it cannot create df before as it will raise already an error as we have different labels for the same index

Wait a moment, but that is the error solved by the aggregation function, no?

@hagenw
Copy link
Member Author

hagenw commented Oct 25, 2023

This will unfortunately not work as it cannot create df before as it will raise already an error as we have different labels for the same index

Wait a moment, but that is the error solved by the aggregation function, no?

Yes, but if we aggregate the values first, e.g. by taking the mean, we cannot later on select the value for the first session.

@frankenjoe
Copy link
Collaborator

Hehe, I think it's getting too complicated for me :)

@hagenw
Copy link
Member Author

hagenw commented Oct 25, 2023

One solution might be to combine both approaches:

  • adding aggregate_function as argument to audformat.utils.concat() and audformat.Database.get() that is not aware of where the values come from, but just combines them to a single value
  • adding modifier_function as argument to audformat.Database.get() which can modify values read from a column and has access to the column ID, table ID and database object

Maybe I will start and create another pull request independent of this one adding aggregate_function to audformat.utils.concat()?

@hagenw
Copy link
Member Author

hagenw commented Oct 25, 2023

I created #401 to first add aggregate_function to audformat.utils.concat().

@hagenw
Copy link
Member Author

hagenw commented Oct 27, 2023

I now added the following three arguments:

  • aggregate_function that is simply passed on to audformat.utils.concat() to combine several values (e.g. by averaging them)
  • modify_function that provides a way to select only certain parts of the matching values and/or to modifiy them, e.g. based on table ID, column ID and other entries from the database object.
  • strict to disable the matching of column names and scheme label dictionary keys

I also added some examples to the docstring show casing what you can do it with it.


There is another question: with the current implementation we return the name of the column, or several columns if they have different names for a matching scheme:

>>> db = Database('mydb')
>>> db.schemes['rating'] = Scheme('float')
>>> db['run1'] = Table(filewise_index(['f1', 'f2']))
>>> db['run1']['rater1'] = Column(scheme_id='rating')
>>> db['run1']['rater1'].set([0.0, 0.9])
>>> db['run2'] = Table(filewise_index(['f3']))
>>> db['run2']['rater1'] = Column(scheme_id='rating')
>>> db['run2']['rater1'].set([0.7])
>>> db.get('rating')
     rater1
file
f1       0.0
f2       0.9
f3       0.7

>>> db = Database('mydb')
>>> db.schemes['rating'] = Scheme('float')
>>> db['run1'] = Table(filewise_index(['f1', 'f2']))
>>> db['run1']['rater1'] = Column(scheme_id='rating')
>>> db['run1']['rater1'].set([0.0, 0.9])
>>> db['run1']['rater2'] = Column(scheme_id='rating')
>>> db['run1']['rater2'].set([0.2, 0.7])
>>> db.get('rating')
     rater1   rater2
file
f1       0.0      0.2
f2       0.9      0.7

This might be not a good idea as a user doesn't know how many columns will be returned and how they are named.
So maybe it is better as default to name the returned columns by the scheme and require an aggregate_function or modify_function to handle the case with different columns?

E.g. to re-enable the current behavior, the user could write a simple modify function:

def return_column_names(y, db, table_id, column_id):
    y.name = column_id
    return y

@frankenjoe
Copy link
Collaborator

frankenjoe commented Oct 30, 2023

I wonder if we should think more about the actual use-case for the function. At the moment it seems very powerful, but also hard to predict what the result will be.

I guess one of the main use-cases will be to retrieve gold standard labels for a specific split. E.g.

import audformat

db = audformat.Database('mydb')
db.schemes['emotion'] = audformat.Scheme('float')
db.splits['test'] = audformat.Split('test')
db['test.gold'] = audformat.Table(
    audformat.filewise_index(['f1', 'f2']),
    split_id='test',
)
db['test.gold']['emotion'] = audformat.Column(scheme_id='emotion')
db['test.gold']['emotion'].set([1.0, 1.0])
db.get('emotion', splits='test')
      emotion
file         
f1        1.0
f2        1.0

This is really nice, since get() allows us to retrieve the values without exact knowledge of the table and column names.

But let's assume we have also stored self report values somewhere in the database:

db['self-report'] = audformat.Table(audformat.filewise_index(['f1', 'f2']))
db['self-report']['emotion'] = audformat.Column(scheme_id='emotion')
db['self-report']['emotion'].set([0.0, 0.0])
db.get('emotion', splits='test', aggregate_function=np.mean)
      emotion
file         
f1        0.5
f2        0.5

We can see that those mess up the result, even though the table with self-report values is not even assigned to the test split. So I think at the moment it's hard for a user to predict what the function will return.

@frankenjoe
Copy link
Collaborator

This might be not a good idea as a user doesn't know how many columns will be returned and how they are named.
So maybe it is better as default to name the returned columns by the scheme and require an aggregate_function or modify_function to handle the case with different columns?

Yes, I also find it problematic that you can request rating but it returns a column raters1 like in your first example or even two columns as in your second example. In the second example my expectation was rather you the user has to provide a aggregate function. And in both cases, the name of the column should be rating I think.

@frankenjoe
Copy link
Collaborator

I am also not too convinced about the modify_function yet. It seems that this function only makes sense if you already have knowledge about the table and column names in the database. But to me the main use-case of the get() function is that you don't need to know these details. Maybe we should split up into two functions? One function that implements the original idea of get() as a shortcut for quickly accessing values of a specific split. And a second one that works more like an iterator for matching tables / columns with the option to modify the returned values?

@hagenw
Copy link
Member Author

hagenw commented Oct 30, 2023

This might be not a good idea as a user doesn't know how many columns will be returned and how they are named.
So maybe it is better as default to name the returned columns by the scheme and require an aggregate_function or modify_function to handle the case with different columns?

Yes, I also find it problematic that you can request rating but it returns a column raters1 like in your first example or even two columns as in your second example. In the second example my expectation was rather you the user has to provide a aggregate function. And in both cases, the name of the column should be rating I think.

I changed the behavior and it now always returns the requested schemes as column names.
Two cases are still missing and on my TODO:

  • a returned empty data frame should also contain the column names of the requested schemes
  • if some of the schemes is missing it should be included as an empty column in the data frame and not dropped.

@hagenw
Copy link
Member Author

hagenw commented Oct 30, 2023

I wonder if we should think more about the actual use-case for the function. At the moment it seems very powerful, but also hard to predict what the result will be.

I guess one of the main use-cases will be to retrieve gold standard labels for a specific split. E.g.

import audformat

db = audformat.Database('mydb')
db.schemes['emotion'] = audformat.Scheme('float')
db.splits['test'] = audformat.Split('test')
db['test.gold'] = audformat.Table(
    audformat.filewise_index(['f1', 'f2']),
    split_id='test',
)
db['test.gold']['emotion'] = audformat.Column(scheme_id='emotion')
db['test.gold']['emotion'].set([1.0, 1.0])
db.get('emotion', splits='test')
      emotion
file         
f1        1.0
f2        1.0

This is really nice, since get() allows us to retrieve the values without exact knowledge of the table and column names.

But let's assume we have also stored self report values somewhere in the database:

db['self-report'] = audformat.Table(audformat.filewise_index(['f1', 'f2']))
db['self-report']['emotion'] = audformat.Column(scheme_id='emotion')
db['self-report']['emotion'].set([0.0, 0.0])
db.get('emotion', splits='test', aggregate_function=np.mean)
      emotion
file         
f1        0.5
f2        0.5

We can see that those mess up the result, even though the table with self-report values is not even assigned to the test split. So I think at the moment it's hard for a user to predict what the function will return.

Yes, there are several cases where we might have problems. E.g. database contains several test splits or different values reported under the same scheme (as in your example).
I don't think there is an easy solution for this problem, besides introducing restrictions during creation of the database.

At the moment, I would say that you should use aggregate_function only in the case you know exactly what will happen. E.g. if I'm only interested in counting the available test set files with emotion labels, your call to the function is still valid. If you want to use the labels for testing, there should be no need to average them.

@hagenw
Copy link
Member Author

hagenw commented Oct 30, 2023

I am also not too convinced about the modify_function yet. It seems that this function only makes sense if you already have knowledge about the table and column names in the database. But to me the main use-case of the get() function is that you don't need to know these details. Maybe we should split up into two functions? One function that implements the original idea of get() as a shortcut for quickly accessing values of a specific split.

It's true that you need to have knowledge about the databases and that the modify_function is also very complex.
I think there might still be use cases for it, e.g. if you have a couple of databases that all share a common structure, but still have slightly different table or column names.

And a second one that works more like an iterator for matching tables / columns with the option to modify the returned values?

As long as we think we will need a modify_function, Database.get() seems still to be the best place to add it too instead of replicating all its behavior again.

@hagenw
Copy link
Member Author

hagenw commented Nov 22, 2023

I integrated now the aggregate_strategy argument from #405 and checked that the Database.get() still works for all our databases and the benchmark times stayed the same.

Which means this should also be finally ready to review/merge.

@frankenjoe
Copy link
Collaborator

Cool, another great feature in place!

@frankenjoe frankenjoe merged commit 261d781 into main Nov 24, 2023
10 checks passed
@frankenjoe frankenjoe deleted the database-get branch November 24, 2023 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce Database.get() to access all datapoints for a scheme
3 participants