-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add audformat.Database.get() #399
Conversation
Codecov Report
Additional details and impacted files
|
@frankenjoe I implemented a first version that provides already a bunch of features. But there are also some challenges left, for which I don't know yet how to best tackle them: Loose definition of a schemeAs we did not started with a misc table and we also allow for columns without a scheme, Handling of different labels for same data pointIf the ratings in the example from the description are the same,
returns
I'm not sure yet if we can always say this is the expected behavior. A related issue is if you have more than two entries, the algorithm is not able to group some of the columns together, e.g. import audformat
db = audformat.Database('db')
db.schemes['rating'] = audformat.Scheme('int')
index = audformat.filewise_index(['f1', 'f2'])
db['session1'] = audformat.Table(index)
db['session1']['rating'] = audformat.Column(scheme_id='rating')
db['session1']['rating'].set([0, 1])
db['session2'] = audformat.Table(index)
db['session2']['rating'] = audformat.Column(scheme_id='rating')
db['session2']['rating'].set([1, 1])
db['session3'] = audformat.Table(index)
db['session3']['rating'] = audformat.Column(scheme_id='rating')
db['session3']['rating'].set([1, 1])
db.get('rating') returns
instead of merging import audformat
db = audformat.Database('db')
db.schemes['rating'] = audformat.Scheme('int')
index = audformat.filewise_index(['f1', 'f2'])
db['session1'] = audformat.Table(index)
db['session1']['rating'] = audformat.Column(scheme_id='rating')
db['session1']['rating'].set([1, 1])
db['session2'] = audformat.Table(index)
db['session2']['rating'] = audformat.Column(scheme_id='rating')
db['session2']['rating'].set([1, 1])
db['session3'] = audformat.Table(index)
db['session3']['rating'] = audformat.Column(scheme_id='rating')
db['session3']['rating'].set([0, 1])
db.get('rating') returns
even though the original ratings were identical, just stored in a different order. Handling of different data typesUnfortunately, we easily get completely different dtypes for the same scheme when collecting all the labels from the database. There are several reasons for this, e.g. in some cases we don't assign the scheme or we store the values as labels in a dict. I cover already some of those cases, but the following still fails as it produces two categorical dtypes, that differ in the dtypes of its categories (int vs. float): import audformat
import pandas as pd
db = audformat.Database('db')
db.schemes['label'] = audformat.Scheme('int', labels=[0, 1])
db['speaker'] = audformat.MiscTable(
pd.Index(['s1', 's2'], dtype='string', name='speaker')
)
db['speaker']['label'] = audformat.Column()
db['speaker']['label'].set([1.0, 1.0])
db['files'] = audformat.Table(audformat.filewise_index(['f1', 'f2']))
db['files']['label'] = audformat.Column(scheme_id='label')
db['files']['label'].set([0, 1])
db['other'] = audformat.Table(audformat.filewise_index(['f1', 'f2']))
db.schemes['speaker'] = audformat.Scheme('str', labels='speaker')
db['other']['speaker'] = audformat.Column(scheme_id='speaker')
db['other']['speaker'].set(['s1', 's2']) Then we get: >>> db['other']['speaker'].get(map='label')
file
f1 1.0
f2 1.0
Name: label, dtype: category
Categories (1, float64): [1.0]
>>> db['files']['label'].get()
file
f1 0
f2 1
Name: label, dtype: category
Categories (2, int64): [0, 1]
>>> db.get('label')
...
ValueError: All categorical data must have the same dtype. |
To me it sounds fine if we raise an error if we get different data types for the same scheme. Maybe we can explain in the error message to the user how she should fix the database. Or would you say it's a valid use-case to store gender values in one place as
Again, I am not sure if we should encourage users to store values that are semantically related in one place as a plain column (e.g. a column named 'gender' without scheme) and in another place using a scheme (e.g. in a speaker column that connects to a gender scheme). As before, I would say it's fine to either find values in an according column or through a scheme, but raise an error if we find both.
This is indeed tricky. The current solution of spreading into several columns is indeed not too elegant. Once for the ordering you mentioned, but it also makes the result hard to predict as we do not know in advance how many columns will be returned. If we go for this solution I would at least use a multi-index for the column that A completely different approach would be to return an iterator so that the search results are returned one after another. This would leave it to the user how to deal with those issues :) One general issue I see with the current solution is that it returns values from all splits. But in many use-cases you want the values of a specific split. |
There are the |
I see the advantage of this as well, but it also has a big downside.
I think we need both solutions:
To summaries: add 1-2 arguments to handle how strict we with parsing. |
We could allow the user to specify the data type and cast the results we find accordingly. I would prefer that over simply converting everything to
Great, I didn't realize there is already an option for it. But there seems indeed be a problem with the way it is currently implemented, e.g.: db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
db.get(['emotion', 'gender'])
but: db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
db.get(['emotion', 'gender'], splits=['test'])
|
Handling of different labels for same data pointI think we cannot simply return a tuple inside a column that has more as one label as the information were the label comes from is lost, but most likely required for further processing. One solution might be to use multi-index columns, but I'm not convinced by their usability in our case. At the moment I would propose to expand the index of the data frame:
Example: import audformat
db = audformat.Database('db')
db.schemes['rating'] = audformat.Scheme('int')
index = audformat.filewise_index(['f1', 'f2'])
db['session1'] = audformat.Table(index)
db['session1']['rating'] = audformat.Column(scheme_id='rating')
db['session1']['rating'].set([0, 1])
db['session2'] = audformat.Table(index)
db['session2']['rating'] = audformat.Column(scheme_id='rating')
db['session2']['rating'].set([1, 1]) we will get >>> db.get('rating')
rating
file start end table column
f1 0 days NaT session1 rating 0
f2 0 days NaT session1 rating 1
f1 0 days NaT session2 rating 1
f2 0 days NaT session2 rating 1 The user would have then the possibility to count the number of available labels by using df = df.reset_index([3, 4])
df = df[df.table == 'session1'] If there are no duplicates the index can be used as it is and should work with @frankenjoe I have not yet updated the tests and first wanted to get some feedback on this solution. Unfortunately, my proposed solution has some unwanted consequences: >>> db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
>>> db.get(['emotion', 'gender'], splits=['test'])
emotion gender
file start end table column
wav/12a01Fb.wav 0 days NaT emotion emotion happiness NaN
emotion.categories.test.gold_standard emotion happiness NaN
files speaker NaN male
wav/12a01Lb.wav 0 days NaT emotion emotion boredom NaN
emotion.categories.test.gold_standard emotion boredom NaN
... ... ...
wav/16b10Wa.wav 0 days NaT emotion.categories.test.gold_standard emotion anger NaN
files speaker NaN female
wav/16b10Wb.wav 0 days NaT emotion emotion anger NaN
emotion.categories.test.gold_standard emotion anger NaN
files speaker NaN female
[693 rows x 2 columns] |
I have now updated the handling of the >>> db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
>>> db.get(['emotion', 'gender'], splits=['test'])
emotion gender
file
wav/12a01Fb.wav happiness male
wav/12a01Lb.wav boredom male
wav/12a01Nb.wav neutral male
wav/12a01Wc.wav anger male
wav/12a02Ac.wav fear male
... ... ...
wav/16b10Lb.wav boredom female
wav/16b10Tb.wav sadness female
wav/16b10Td.wav sadness female
wav/16b10Wa.wav anger female
wav/16b10Wb.wav anger female
[231 rows x 2 columns] |
Mhh, probably better than expanding into the columns. But it has the downside that a user will probably have to do some post-processing before it can be used e.g. to train a model. How about introducing an argument where the user can specify an aggregate function that is applied if we get more than one result for a file / segment? |
As a user would indeed need to add post-processing to handle the multiple labels anyway, it seems indeed reasonable to solve the issue by providing the custom code already as an aggregate function. |
I implemented now a first version of the Here an easy example, that converts all labels to uppercase and adjusts the dtype accordingly: def upper(y, db, table_id, column_id):
data = [v.upper() for v in y.values]
dtype = pd.CategoricalDtype(categories=['FEMALE', 'MALE'], ordered=False)
return pd.Series(data, index=y.index, name=y.name, dtype=dtype)
db = audb.load('emodb', version='1.4.1', only_metadata=True, full_path=False)
db.get('gender', aggregate_function=upper) which returns
@frankenjoe: if you agree with its implementation, I would update some of the error messages we raise and point to defining such a function, and maybe look for a shorter name of the argument? |
Could you also provide a simple example where we have multiple values for the same file or segment? As far as I see, it now calls the aggregate function with the values from a specific column, I was expecting a function that gets called with a list of values that it found for a specific segment or file. But maybe that is already covered with your solution and I just don't see it :) |
Sure, let's start with this database: import audformat
db = audformat.Database('db')
db.schemes['rating'] = audformat.Scheme('int')
index = audformat.filewise_index(['f1', 'f2'])
db['session1'] = audformat.Table(index)
db['session1']['rating'] = audformat.Column(scheme_id='rating')
db['session1']['rating'].set([0, 1])
db['session2'] = audformat.Table(index)
db['session2']['rating'] = audformat.Column(scheme_id='rating')
db['session2']['rating'].set([1, 1]) Without aggregate function you will now get an error (I would update the error message and point to the >>> db.get('rating')
...
ValueError: Found overlapping data in column 'rating':
left right
file
f1 0 1 There are several possible solutions for this database:
1. select a particular sessiondef select_session1(y, db, table_id, column_id):
if table_id != 'session1':
index = audformat.filewise_index()
y = pd.Series(index=index, dtype=y.dtype, name=y.name)
return y >>> db.get('rating', aggregate_function=select_session1)
rating
file
f1 0
f2 1 2. add the table name to the columndef add_table_name(y, db, table_id, column_id):
y.name = f'{y.name}-{table_id}'
return y >>> db.get('rating', aggregate_function=add_table_name)
rating-session1 rating-session2
file
f1 0 1
f2 1 1 3. calculate mean over the valuesThis is indeed not so nicely supported. def average_sessions(y, db, table_id, column_id):
if table_id in ['session1', 'session2']:
name = y.name
if table_id == 'session1':
y2 = db['session2'][column_id].get()
else:
y2 = db['session1'][column_id].get()
y = pd.concat([y, y2], axis=1).mean(axis=1)
y.name = name
return y or def average_sessions(y, db, table_id, column_id):
name = y.name
if table_id == 'session1':
y2 = db['session2'][column_id].get()
y = pd.concat([y, y2], axis=1).mean(axis=1)
y.name = name
else:
index = audformat.filewise_index()
y = pd.Series(index=index, name=name, dtype='float')
return y >>> db.get('rating', aggregate_function=average_sessions)
rating
file
f1 0.5
f2 1.0 The case 3, would be much easier with your suggested approach, but I find it also nice to have the possibility to change all labels as this also allows you to combine stuff (see Don't know if there is a way to combine both approaches? |
I would use the aggregate function only to combine values that belong to the same file / segment. As you said case 3. is rather complicated and this was actually the main motivation to have an aggregate function. To replace the aggregate function that you propose we could have an argument to add |
This will unfortunately not work as it cannot create But I will try if I manage to create an aggregate function, that has also access to all the table and column names of the label values. |
Wait a moment, but that is the error solved by the aggregation function, no? |
Yes, but if we aggregate the values first, e.g. by taking the mean, we cannot later on select the value for the first session. |
Hehe, I think it's getting too complicated for me :) |
One solution might be to combine both approaches:
Maybe I will start and create another pull request independent of this one adding |
I created #401 to first add |
I now added the following three arguments:
I also added some examples to the docstring show casing what you can do it with it. There is another question: with the current implementation we return the name of the column, or several columns if they have different names for a matching scheme: >>> db = Database('mydb')
>>> db.schemes['rating'] = Scheme('float')
>>> db['run1'] = Table(filewise_index(['f1', 'f2']))
>>> db['run1']['rater1'] = Column(scheme_id='rating')
>>> db['run1']['rater1'].set([0.0, 0.9])
>>> db['run2'] = Table(filewise_index(['f3']))
>>> db['run2']['rater1'] = Column(scheme_id='rating')
>>> db['run2']['rater1'].set([0.7])
>>> db.get('rating')
rater1
file
f1 0.0
f2 0.9
f3 0.7
>>> db = Database('mydb')
>>> db.schemes['rating'] = Scheme('float')
>>> db['run1'] = Table(filewise_index(['f1', 'f2']))
>>> db['run1']['rater1'] = Column(scheme_id='rating')
>>> db['run1']['rater1'].set([0.0, 0.9])
>>> db['run1']['rater2'] = Column(scheme_id='rating')
>>> db['run1']['rater2'].set([0.2, 0.7])
>>> db.get('rating')
rater1 rater2
file
f1 0.0 0.2
f2 0.9 0.7 This might be not a good idea as a user doesn't know how many columns will be returned and how they are named. E.g. to re-enable the current behavior, the user could write a simple modify function: def return_column_names(y, db, table_id, column_id):
y.name = column_id
return y |
I wonder if we should think more about the actual use-case for the function. At the moment it seems very powerful, but also hard to predict what the result will be. I guess one of the main use-cases will be to retrieve gold standard labels for a specific split. E.g. import audformat
db = audformat.Database('mydb')
db.schemes['emotion'] = audformat.Scheme('float')
db.splits['test'] = audformat.Split('test')
db['test.gold'] = audformat.Table(
audformat.filewise_index(['f1', 'f2']),
split_id='test',
)
db['test.gold']['emotion'] = audformat.Column(scheme_id='emotion')
db['test.gold']['emotion'].set([1.0, 1.0])
db.get('emotion', splits='test')
This is really nice, since But let's assume we have also stored self report values somewhere in the database: db['self-report'] = audformat.Table(audformat.filewise_index(['f1', 'f2']))
db['self-report']['emotion'] = audformat.Column(scheme_id='emotion')
db['self-report']['emotion'].set([0.0, 0.0])
db.get('emotion', splits='test', aggregate_function=np.mean)
We can see that those mess up the result, even though the table with self-report values is not even assigned to the test split. So I think at the moment it's hard for a user to predict what the function will return. |
Yes, I also find it problematic that you can request |
I am also not too convinced about the |
I changed the behavior and it now always returns the requested schemes as column names.
|
Yes, there are several cases where we might have problems. E.g. database contains several test splits or different values reported under the same scheme (as in your example). At the moment, I would say that you should use |
It's true that you need to have knowledge about the databases and that the
As long as we think we will need a |
As it does not work for different index types This reverts commit 145713f.
I integrated now the Which means this should also be finally ready to review/merge. |
Cool, another great feature in place! |
Closes #398
Adds
audformat.Database.get()
to request a data frame containing columns with labels based on a scheme, limited by selected tables and/or splits and additional schemes drawn from the whole database.To also support databases that did not used schemes, forgot to assign it to a column or use dictionaries to provide scheme labels, I added the
strict=False
argument.For schemes that have a simple mapping, e.g. transcriptions with
{'a0': 'some text'}
the label is expanded unless the user explicitly disables it withmap=False
.For example:
returns
I also updated the docstring example for the
audformat.Database
class: