Reduce number of tables by finding a way to reference selected train, dev, test splits #1

hagenw · 2024-01-25T11:25:04Z

@felixbur pointed out that crema-d has a lot of tables as we encode train, dev, test splits in the name of the table. This has certain adavantages as you just need to load the table and you are done, but it also has the disadvantage that it makes it more crowded when look at which tables are available and requires more disk space and time for loading the tables.

The issue also applies to other datasets.

One solution might be that we store the splits in different tables, e.g. emotion.train that only contain an index (only unofficially supported by audformat at the moment) or contain another label lile speaker ID, even though the label has not really any meaning for this table. The actual emotion labels would be stored in a combined table, e.g. emotion.categories.desired instead of emotion.categories.desired.train, emotion.categories.desired.dev, and emotion.categories.desired.test. If the dev split should be requested you need to do something like this:

df = db["emotion.categories.desired"].get(index=db["emotion.dev"].index)

I think we had this discussion a few years ago already, but I cannot remember what the conclusion was ;)

A more elegant solution seems to me to extend audformat to have the category of split-tables, or maybe better change how audformat.Split is implemented and store the underlying index as part of the split. A user would then be able to do:

df = db["emotion.categories.desired"].get(split="emotion.dev")

The problem is of cause that we should not change the behavior of audformat.Split() in a not backward compatible way.

@ureichel @audeerington @schruefer any opinion on this?

The text was updated successfully, but these errors were encountered:

audeerington · 2024-01-25T11:50:46Z

A more elegant solution seems to me to extend audformat to have the category of split-tables, or maybe better change how audformat.Split is implemented and store the underlying index as part of the split.

I would prefer this solution over the first one, if it's possible to do it in a backward compatible way.

I do think it would be more convenient to store splits this way and thus reduce the number of tables, although I'm not very excited about applying this update to all our datasets. I think it would only make sense to apply this change to all our datasets at more or less the same time. So that any code using the databases doesn't need to differentiate between databases that have been updated this way and databases that have not yet been updated.

hagenw · 2024-01-25T11:50:54Z

One disadvantage of using the approach with split tables is that it might not always be obvious for which of the existing tables they can be used or not.

hagenw · 2024-01-25T11:52:32Z

A more elegant solution seems to me to extend audformat to have the category of split-tables, or maybe better change how audformat.Split is implemented and store the underlying index as part of the split.

I would prefer this solution over the first one, if it's possible to do it in a backward compatible way.

yes, it seems the better solution than index tables. But I would also vote that it needs to be implemented in a backward compatible way, so we don't need to change any existing dataset if we don't want to.

schruefer · 2024-01-25T13:05:36Z

Currently, I would not touch the naming of the tables, not necessarily because I am convinced that it is the absolute best solution. But also because we already use this standard in all our databases and, as @audeerington has already mentioned, we would have to change it in all databases at the same time so that we don't have to use two different conventions in training scripts and tests.

I also find this division into individual tables useful to quickly see, which splits a database has and also to spot database related characteristics, e.g. msppdcast having two test sets or audioset having balanced and unbalanced sets.

hagenw changed the title ~~Reduce number of tables by finding a way to reference selected tran, dev, test splits~~ Reduce number of tables by finding a way to reference selected train, dev, test splits Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of tables by finding a way to reference selected train, dev, test splits #1

Reduce number of tables by finding a way to reference selected train, dev, test splits #1

hagenw commented Jan 25, 2024

audeerington commented Jan 25, 2024

hagenw commented Jan 25, 2024

hagenw commented Jan 25, 2024

schruefer commented Jan 25, 2024

Reduce number of tables by finding a way to reference selected train, dev, test splits #1

Reduce number of tables by finding a way to reference selected train, dev, test splits #1

Comments

hagenw commented Jan 25, 2024

audeerington commented Jan 25, 2024

hagenw commented Jan 25, 2024

hagenw commented Jan 25, 2024

schruefer commented Jan 25, 2024