Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing large file in remote storage after pushing #10448

Open
xiaoFine opened this issue Jun 4, 2024 · 7 comments
Open

missing large file in remote storage after pushing #10448

xiaoFine opened this issue Jun 4, 2024 · 7 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push fs: oss Related to the Alibaba Cloud OSS filesystem

Comments

@xiaoFine
Copy link

xiaoFine commented Jun 4, 2024

Bug Report

push:large files are missing in remote storage

Description

after dvc push, large files (single file>20GB, ) are missing in the remote storge(AliyunOSS), while small files' md5 are successfully pushed and can be found in the oss path

Reproduce

dvc init -f
dvc remote add myoss oss://mybucket/path -d
dvc remote modify myoss oss_endpoint somepublicendpoint
dvc remote modify myoss oss_key_id xxxx
dvc remote modify myoss oss_key_secret xxxxxxxx

dvc add large-chkpoint.pt

dvc push

Expected

I can found large-chkpoint.pt md5 via oss dashboard

Environment information

Output of dvc doctor:

DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.5
Supports:
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0)
Config:
        Global: /home/admins/.config/dvc
        System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: oss
Workspace directory: ext4 on /dev/nvme1n1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/1dec9b5bdab7926326d2cb372ee9b553

Additional Information (if any):

output of pushing log

> dvc push -vvv 
2024-06-04 16:56:23,537 DEBUG: v3.51.2 (pip), CPython 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
2024-06-04 16:56:23,538 DEBUG: command: /home/admins/miniconda3/envs/dvcenv/bin/dvc push -vvv
2024-06-04 16:56:23,538 TRACE: Namespace(quiet=0, verbose=3, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='push', jobs=9, targets=['triton/tensorrt_llm/1/rank0.engine'], remote='oss-qwen', all_branches=False, all_tags=False, all_commits=False, with_deps=False, recursive=False, run_cache=True, glob=False, func=<class 'dvc.commands.data_sync.CmdDataPush'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'dvc.cli.formatter.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2024-06-04 16:56:23,758 TRACE:     1.31 ms in collecting stages from /ws
2024-06-04 16:56:23,758 TRACE:   253.99 mks in collecting stages from /ws
...

2024-06-04 16:56:23,773 DEBUG: Checking if stage 'large-chckpoint.pt' is in 'dvc.yaml'
Collecting                                                                                                                                                 |1.00 [00:00,  135entry/s]
2024-06-04 16:56:23,889 DEBUG: Preparing to transfer data from '/ws/.dvc/cache' to 'oss://mybucket/path'
2024-06-04 16:56:23,889 DEBUG: Preparing to collect status from 'mybucket/path'
2024-06-04 16:56:23,889 DEBUG: Collecting status from 'mybucket/path'
2024-06-04 16:56:23,891 DEBUG: Querying 1 oids via object_exists
2024-06-04 16:56:24,228 DEBUG: Preparing to collect status from '/ws/.dvc/cache'                                                                       
2024-06-04 16:56:24,229 DEBUG: Collecting status from '/ws/.dvc/cache'                                                                                 
Pushing                                                                                                                                                                             /home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/ossfs/async_oss.py:389: RuntimeWarning: coroutine 'resumable_upload' was never awaited     0/1 [00:00<?,     ?file/s]
  await self._call_oss(
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Pushing
1 file pushed                                                                                                                                                                        
2024-06-04 16:56:24,292 DEBUG: Analytics is enabled.
2024-06-04 16:56:24,292 TRACE: Saving analytics report to /tmp/tmptx47o8pe
2024-06-04 16:56:24,354 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmptx47o8pe', '-vv']
2024-06-04 16:56:24,361 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmptx47o8pe', '-vv'] with pid 27977
2024-06-04 16:56:24,361 TRACE: Process 27869 exiting with 0

Does dvc-oss no longer maintain?

@xiaoFine xiaoFine changed the title missing remote large file after pushing large file missing in remote storage after pushing Jun 4, 2024
@xiaoFine xiaoFine changed the title large file missing in remote storage after pushing missing large file in remote storage after pushing Jun 4, 2024
@dberenbaum dberenbaum added fs: oss Related to the Alibaba Cloud OSS filesystem A: data-sync Related to dvc get/fetch/import/pull/push labels Jun 4, 2024
@shcheklein
Copy link
Member

Does dvc-oss no longer maintain?

It was primarily maintained by @karajan1001 . I would appreciate his input here.

As a workaround, could you try a S3 compatible interface - https://www.alibabacloud.com/help/en/oss/developer-reference/use-amazon-s3-sdks-to-access-oss ?

https://dvc.org/doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon

output of pushing log

Hmm, I don't see any details in the logs. Do you see any md5s / hashes for the files that are missing remotely? Is is the full log shared?

Could you try delete /var/tmp/dvc/repo/1dec9b5bdab7926326d2cb372ee9b553 and run the command again in a verbose mode?

@xiaoFine
Copy link
Author

xiaoFine commented Jun 6, 2024

Does dvc-oss no longer maintain?

It was primarily maintained by @karajan1001 . I would appreciate his input here.

As a workaround, could you try a S3 compatible interface - https://www.alibabacloud.com/help/en/oss/developer-reference/use-amazon-s3-sdks-to-access-oss ?

https://dvc.org/doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon

output of pushing log

Hmm, I don't see any details in the logs. Do you see any md5s / hashes for the files that are missing remotely? Is is the full log shared?

Could you try delete /var/tmp/dvc/repo/1dec9b5bdab7926326d2cb372ee9b553 and run the command again in a verbose mode?

I create an empty workspace with a large.bin and a small.txt , and delete all cache in ``/car/tmp/dvc/repo`
here is the push log
image

Only the small file can be found in remote

P.S. the RuntimeWarning won't show if only pushing small files

--
still working on S3 way with some compatible problem :
ListObjectsV2 is called no matter listobjects is true or false

(dvcenv) admins@test-Ai-largemodel:/mnt/datadisk1/laien/ws-dvc$ dvc push -r oss-s3 -vvv
2024-06-06 10:16:37,283 DEBUG: v3.51.2 (pip), CPython 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
2024-06-06 10:16:37,283 DEBUG: command: /home/admins/miniconda3/envs/dvcenv/bin/dvc push -r oss-s3 -vvv
2024-06-06 10:16:37,283 TRACE: Namespace(quiet=0, verbose=3, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='push', jobs=None, targets=[], remote='oss-s3', all_branches=False, all_tags=False, all_commits=False, with_deps=False, recursive=False, run_cache=True, glob=False, func=<class 'dvc.commands.data_sync.CmdDataPush'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'dvc.cli.formatter.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2024-06-06 10:16:37,519 TRACE:    12.48 ms in collecting stages from /mnt/datadisk1/laien/ws-dvc
Collecting                                                                                                                                                                  |0.00 [00:00,    ?entry/s]
2024-06-06 10:16:37,542 DEBUG: Preparing to transfer data from '/mnt/datadisk1/laien/ws-dvc/.dvc/cache/files/md5' to 's3://[remote-path]/dvc/files/md5'
2024-06-06 10:16:37,542 DEBUG: Preparing to collect status from '[remote-path]/dvc/files/md5'
2024-06-06 10:16:37,542 DEBUG: Collecting status from '[remote-path]/dvc/files/md5'
Pushing          '[remote-path]/dvc/files/md5'|                                                                                                                   |0/? [00:00<?,    ?files/s]
Pushing
2024-06-06 10:16:37,896 ERROR: unexpected error - The specified key does not exist.: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist.        
Traceback (most recent call last):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 723, in _lsdir
    async for c in self._iterdir(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 773, in _iterdir
    async for i in it:
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/aiobotocore/paginate.py", line 30, in __anext__
    response = await self._make_request(current_kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/aiobotocore/client.py", line 412, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
    return self.run()
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 64, in run
    processed_files_count = self.repo.push(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/repo/push.py", line 147, in push
    push_transferred, push_failed = ipush(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/index/push.py", line 76, in push
    result = transfer(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 179, in compare_status
    dest_exists, dest_missing = status(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 151, in status
    exists.update(odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback))
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 423, in oids_exist
    remote_size, remote_oids = self._estimate_remote_size(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 305, in _estimate_remote_size
    remote_oids = set(iter_with_pbar(oids))
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 295, in iter_with_pbar
    for oid in oids:
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 262, in _oids_with_limit
    for i, oid in enumerate(self._list_oids(prefixes=prefixes, jobs=jobs), start=1):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 250, in _list_oids
    for path in self._list_prefixes(prefixes=prefixes, jobs=jobs):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 225, in _list_prefixes
    yield from self.fs.find(paths, batch_size=jobs, prefix=prefix)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 816, in find
    yield from self.fs.find(path, prefix=prefix_str)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 848, in _find
    out = await self._lsdir(path, delimiter="", prefix=prefix, **kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 736, in _lsdir
    raise translate_boto_error(e)
FileNotFoundError: The specified key does not exist.

2024-06-06 10:16:37,943 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/.nJecAj_lCq6vSRQavUKoxw.tmp'
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/.nJecAj_lCq6vSRQavUKoxw.tmp'
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/.nJecAj_lCq6vSRQavUKoxw.tmp'
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/ws-dvc/.dvc/cache/files/md5/.YY-I1sW7eTcDot5sfhN07Q.tmp'
2024-06-06 10:16:37,949 DEBUG: Version info for developers:
DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.5
Supports:
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.5.0, boto3 = 1.34.106)
Config:
        Global: /home/admins/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme1n1
Caches: local
Remotes: oss, s3
Workspace directory: ext4 on /dev/nvme1n1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/4f9f0c30c341088cc84e9b8b312f7113

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2024-06-06 10:16:37,952 DEBUG: Analytics is enabled.
2024-06-06 10:16:37,952 TRACE: Saving analytics report to /tmp/tmphllltv1h
2024-06-06 10:16:37,993 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmphllltv1h', '-vv']
2024-06-06 10:16:38,001 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmphllltv1h', '-vv'] with pid 115727
2024-06-06 10:16:38,002 TRACE: Process 115714 exiting with 255

@Wangsongming
Copy link

I also have this problem, it seems that large files are uploaded using sharding when transferring OSS, but it is directly ended without waiting for the return, resulting in the situation of large files that have been unable to pass up, I hope to solve it as soon as possible

@skshetry
Copy link
Member

skshetry commented Sep 3, 2024

Could be related to fsspec/ossfs#129. Please file a bug upstream.

@Wangsongming
Copy link

我也出现这个问题,想在传输OSS的时候使用sharding上传大文件,但是没有等待返回就直接结束了,导致出现大文件一直传不上去的情况,希望尽快解决

Collecting |2.00 [00:00, 250entry/s]
Pushing D:\python\lib\site-packages\ossfs\async_oss.py:388: RuntimeWarning: coroutine 'resumable_upload' was never awaitedile/s]
await self._call_oss(
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Pushing
2 files pushed

@Wangsongming
Copy link

Member

I think this is a bug in the dvc-oss plugin

@ws1336
Copy link

ws1336 commented Nov 7, 2024

Has this problem been resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push fs: oss Related to the Alibaba Cloud OSS filesystem
Projects
None yet
Development

No branches or pull requests

6 participants