Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when Downloading the Italian Dataset #12

Open
david-gimeno opened this issue Nov 12, 2023 · 2 comments
Open

Problems when Downloading the Italian Dataset #12

david-gimeno opened this issue Nov 12, 2023 · 2 comments

Comments

@david-gimeno
Copy link

Hi,

I run the following command to download the Italian Datasert from MuAViC:

python get_data.py --root-path ./esperanza/ --src-lang it

However, in some moment of the running the script was interrupted. Please find attached the full error trace:

Traceback (most recent call last):
  File "/home/dgimeno/phd/muavic/utils.py", line 62, in download_file
    wget.download(url, out=str(download_path / filename), bar=custom_bar)
  File "/home/dgimeno/anaconda3/envs/muavic/lib/python3.8/site-packages/wget.py", line 506, in download
    (fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".")
  File "/home/dgimeno/anaconda3/envs/muavic/lib/python3.8/tempfile.py", line 331, in mkstemp
    return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/home/dgimeno/anaconda3/envs/muavic/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: './esperanza/metadata/it_metadata.tgz88g65ab3.tmp'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "get_data.py", line 115, in <module>
    main(args)
  File "get_data.py", line 84, in main
    prepare_mtedx(args)
  File "get_data.py", line 26, in prepare_mtedx
    preprocess_mtedx_video(
  File "/home/dgimeno/phd/muavic/mtedx_utils.py", line 220, in preprocess_mtedx_video
    video_metadata = load_video_metadata(
  File "/home/dgimeno/phd/muavic/utils.py", line 110, in load_video_metadata
    download_extract_file_if_not(
  File "/home/dgimeno/phd/muavic/utils.py", line 89, in download_extract_file_if_not
    download_file(url, download_path)
  File "/home/dgimeno/phd/muavic/utils.py", line 65, in download_file
    raise HTTPError(e.url, e.code, message, e.hdrs, e.fp)
AttributeError: 'FileNotFoundError' object has no attribute 'url'
@Anwarvic
Copy link
Contributor

Anwarvic commented Jan 5, 2024

Hi @david-gimeno ,

Thank you for raising this issue and so sorry for the late reply!

I couldn't replicate your error on my machine. However, I would suggest deleting tgz88g65ab3.tmp from your video files. I think this file wasn't downloaded fully, that's why it has the .tmp suffix. Once deleted, the script should recognize that this file is missing and try to download it again.

Hope this helps!

@david-gimeno
Copy link
Author

david-gimeno commented Jan 14, 2024

No worries for the late reply :) Thanks, your suggestion worked!

However, I would like to highlight you that the number of videos available to download is decreasing. Consequently, one day there will no enough videos to allow further research to provide fair comparisons to previous studies w.r.t. audio-visual or visual-only settings. Regarding audio waveorms, there is no problem since they are coming from the MTEDx corpus.

I think that, although I can understand what this mean and all the infrastructure it can imply, the database should be shared in a different way, similar to LRS3 and not depending on YouTube availability video clips.

Kind regards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants