Skip to content
This repository has been archived by the owner on Dec 9, 2024. It is now read-only.

Reduce number of tables by finding a way to reference selected train, dev, test splits #1

Open
hagenw opened this issue Jan 25, 2024 · 4 comments

Comments

@hagenw
Copy link
Member

hagenw commented Jan 25, 2024

@felixbur pointed out that crema-d has a lot of tables as we encode train, dev, test splits in the name of the table. This has certain adavantages as you just need to load the table and you are done, but it also has the disadvantage that it makes it more crowded when look at which tables are available and requires more disk space and time for loading the tables.

The issue also applies to other datasets.

One solution might be that we store the splits in different tables, e.g. emotion.train that only contain an index (only unofficially supported by audformat at the moment) or contain another label lile speaker ID, even though the label has not really any meaning for this table. The actual emotion labels would be stored in a combined table, e.g. emotion.categories.desired instead of emotion.categories.desired.train, emotion.categories.desired.dev, and emotion.categories.desired.test. If the dev split should be requested you need to do something like this:

df = db["emotion.categories.desired"].get(index=db["emotion.dev"].index)

I think we had this discussion a few years ago already, but I cannot remember what the conclusion was ;)

A more elegant solution seems to me to extend audformat to have the category of split-tables, or maybe better change how audformat.Split is implemented and store the underlying index as part of the split. A user would then be able to do:

df = db["emotion.categories.desired"].get(split="emotion.dev")

The problem is of cause that we should not change the behavior of audformat.Split() in a not backward compatible way.

@ureichel @audeerington @schruefer any opinion on this?

@hagenw hagenw changed the title Reduce number of tables by finding a way to reference selected tran, dev, test splits Reduce number of tables by finding a way to reference selected train, dev, test splits Jan 25, 2024
@audeerington
Copy link
Contributor

A more elegant solution seems to me to extend audformat to have the category of split-tables, or maybe better change how audformat.Split is implemented and store the underlying index as part of the split.

I would prefer this solution over the first one, if it's possible to do it in a backward compatible way.

I do think it would be more convenient to store splits this way and thus reduce the number of tables, although I'm not very excited about applying this update to all our datasets. I think it would only make sense to apply this change to all our datasets at more or less the same time. So that any code using the databases doesn't need to differentiate between databases that have been updated this way and databases that have not yet been updated.

@hagenw
Copy link
Member Author

hagenw commented Jan 25, 2024

One disadvantage of using the approach with split tables is that it might not always be obvious for which of the existing tables they can be used or not.

@hagenw
Copy link
Member Author

hagenw commented Jan 25, 2024

A more elegant solution seems to me to extend audformat to have the category of split-tables, or maybe better change how audformat.Split is implemented and store the underlying index as part of the split.

I would prefer this solution over the first one, if it's possible to do it in a backward compatible way.

yes, it seems the better solution than index tables. But I would also vote that it needs to be implemented in a backward compatible way, so we don't need to change any existing dataset if we don't want to.

@schruefer
Copy link
Member

Currently, I would not touch the naming of the tables, not necessarily because I am convinced that it is the absolute best solution. But also because we already use this standard in all our databases and, as @audeerington has already mentioned, we would have to change it in all databases at the same time so that we don't have to use two different conventions in training scripts and tests.

I also find this division into individual tables useful to quickly see, which splits a database has and also to spot database related characteristics, e.g. msppdcast having two test sets or audioset having balanced and unbalanced sets.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants