-
Notifications
You must be signed in to change notification settings - Fork 0
Reduce number of tables by finding a way to reference selected train, dev, test splits #1
Comments
I would prefer this solution over the first one, if it's possible to do it in a backward compatible way. I do think it would be more convenient to store splits this way and thus reduce the number of tables, although I'm not very excited about applying this update to all our datasets. I think it would only make sense to apply this change to all our datasets at more or less the same time. So that any code using the databases doesn't need to differentiate between databases that have been updated this way and databases that have not yet been updated. |
One disadvantage of using the approach with split tables is that it might not always be obvious for which of the existing tables they can be used or not. |
yes, it seems the better solution than index tables. But I would also vote that it needs to be implemented in a backward compatible way, so we don't need to change any existing dataset if we don't want to. |
Currently, I would not touch the naming of the tables, not necessarily because I am convinced that it is the absolute best solution. But also because we already use this standard in all our databases and, as @audeerington has already mentioned, we would have to change it in all databases at the same time so that we don't have to use two different conventions in training scripts and tests. I also find this division into individual tables useful to quickly see, which splits a database has and also to spot database related characteristics, e.g. msppdcast having two test sets or audioset having balanced and unbalanced sets. |
@felixbur pointed out that
crema-d
has a lot of tables as we encodetrain
,dev
,test
splits in the name of the table. This has certain adavantages as you just need to load the table and you are done, but it also has the disadvantage that it makes it more crowded when look at which tables are available and requires more disk space and time for loading the tables.The issue also applies to other datasets.
One solution might be that we store the splits in different tables, e.g.
emotion.train
that only contain an index (only unofficially supported byaudformat
at the moment) or contain another label lile speaker ID, even though the label has not really any meaning for this table. The actual emotion labels would be stored in a combined table, e.g.emotion.categories.desired
instead ofemotion.categories.desired.train
,emotion.categories.desired.dev
, andemotion.categories.desired.test
. If the dev split should be requested you need to do something like this:I think we had this discussion a few years ago already, but I cannot remember what the conclusion was ;)
A more elegant solution seems to me to extend
audformat
to have the category of split-tables, or maybe better change howaudformat.Split
is implemented and store the underlying index as part of the split. A user would then be able to do:The problem is of cause that we should not change the behavior of
audformat.Split()
in a not backward compatible way.@ureichel @audeerington @schruefer any opinion on this?
The text was updated successfully, but these errors were encountered: