Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust example on Datacard based on actual mime-type of media file #89

Open
hagenw opened this issue May 2, 2024 · 6 comments
Open
Labels
enhancement New feature or request

Comments

@hagenw
Copy link
Member

hagenw commented May 2, 2024

At the moment, we display all example media files as audio on a datacard, e.g.

image

This also works for video files, but displays only the audio.
Further we select the example file to show based on its duration.

I would propose the following improvement:

  • Display video examples using the <video> element instead of the <audio> element and not showing a waveform for them
  • In Add support for TXT files as media files audb#392 we introduce support for other media files besides audio and video files. For now we would simply not show an example for such datasets as duration is always set to 0.0 for those media files. But we might expand for what file types we show an example, by changing this behavior.

For both points to work, we will need to check what is the mimetype of a corresponding media file.

Another question that arises is, how to handle examples for datasets that contain a mixture of different media types. At the moment we use the dependency table of a dataset to select a meaningful example, but the dependency table stores no information about the mime type of the included file.

@maxschmitt
Copy link

Having the number of characters/words as a database property would also be a benefit.

@hagenw
Copy link
Member Author

hagenw commented May 6, 2024

I agree, but I'm afraid that will not be easy to achieve afterwards. We can only easily access information on sampling rate, duration and other media related properties, as we currently store them in the dependency table when publishing a dataset. Otherwise we would need to download every single media file to get those statistics. number of characters/words seems very related to this. If we would like to extract them inside audbcards we would need to download the complete dataset first.

If we really think we need that information (and maybe others about text media files), we will have to extend the dependency table in audb.

@maxschmitt
Copy link

I see, downloading all files might be cumbersome for the larger text datasets, so, it makes sense to skip this for the moment, unless we have a suitable text container format other than plain text, which supports metadata.

@hagenw
Copy link
Member Author

hagenw commented May 6, 2024

unless we have a suitable text container format other than plain text, which supports metadata.

But even then you would have to download all files to collect the metadata over all of them.
In principle, what we need is something that gathers the information when we publish the dataset, as then we have to visit anyway every file to calculate the MD5 sum. As I said, for audio/video files we extract information on sampling rate, channels, bit depts, etc. during that phase and then simply write it to the dependency table, that also tracks the versioning of the files (as we didn't had any better solution). You could also envision a central database, that stores such metadata, but our goal was to be de-central with audb.
@ChristianGeng any thoughts on this?

@ChristianGeng
Copy link
Member

You could also envision a central database, that stores such metadata, but our goal was to be de-central with audb.

@ChristianGeng any thoughts on this?

I would have nothing against a central database, but I think it should not become mandatory to use - for a specific backend deployment. But probably then it is hard to implement. What one thinks of first is a kind of hook mechanism.

Artifactory has a webhook mechanism too, but these are too late in the process chain and require that you implement a rest service that executes for such a thing. So overkill.

On the client side there are other problems: such things are often implemented as decorators so it would not be too involved to implement say @onpublish decorators. The tricky bit would be how to make sure that every call to audb.publish on the audeering-internal servers really throws when the deployment-specific @onpublish is not called.
I know that there is .audb.yaml, but this is a user-, not a deployment- specific setting. So in short, we probably cannot configure audb for specific deployments, can we?

@hagenw
Copy link
Member Author

hagenw commented May 8, 2024

Good point. I think when we want to store additional information during audb.publish() it seems the easiest solution to extend what is stored in the dependency table.

For the approach with the database, I could indeed envision that we have one internally, that is used when creating the HTML overview pages. And can maybe also accessed by single users to request entries. But I would not fill such a database directly during publication, but have a cron job on a compute server running, filling up the database. The only downside would be that it would fill up the shared cache on the compute servers with all datasets. But maybe, we can also call this a feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants