Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speeding up create_raw_dataset.py #56

Open
ljj7975 opened this issue Feb 25, 2021 · 4 comments
Open

speeding up create_raw_dataset.py #56

ljj7975 opened this issue Feb 25, 2021 · 4 comments
Assignees

Comments

@ljj7975
Copy link
Member

ljj7975 commented Feb 25, 2021

create_raw_dataset.py takes quite a long time to generate datasets.

I thinking multi-threading AudioDatasetMetadataWriter write will do the job.

Also, this process terminates with segfault

@ljj7975 ljj7975 self-assigned this Feb 25, 2021
@ljj7975
Copy link
Member Author

ljj7975 commented Mar 7, 2021

segfault was happening due to numba
numba/numba#4323

@ljj7975
Copy link
Member Author

ljj7975 commented Mar 12, 2021

Spent some time applying one of the multiprocessing package but the results weren't that good
please refer to https://github.com/castorini/howl/tree/multi_processing_test

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 20, 2021

When writing a dataset, process function should also take in sample (AudioClipExample) and use sample.audio_data when metadata.path does not exist (https://github.com/castorini/howl/blob/master/howl/data/dataset/serialize.py#L67-L72)

@ColonelThirtyTwo
Copy link

A simple way to speed this up is to call out to ffmpeg in AudioDatasetWriter rather than doing the conversions in Python (which is slow).

@jacobk52 jacobk52 removed their assignment Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants