You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
zstash create is way too slow to be practical for km-scale global simulations. The issue seems to be collecting all the files into a tar file before moving it to HPSS. I think zstash also uses hsi under the hood and OLCF says hsi is much slower than htar, which is much slower than globus. I ended up having to use a 1 node batch job to try to do this. It ran for the max allowable 2 days then timed out, so I restarted it with zstash update and had to resubmit it every few days for weeks. I think one of my cases might have finished, but I never was able to verify if all the data got uploaded.
zstash check is also impractical for these big simulations - it takes weeks to upload the data and even longer to download it again. Perhaps there could be a "quickcheck" option that just ensures that all the files in the original directory are found in the zstash index and that the total file size is consistent with the sum of the individual files you tarred? I'm also still a bit confused why we can't checksum the data on HPSS without moving it back to scratch.
All of this isn't to say that zstash isn't great for what it was made for and it's fine if we never end up using it for exascale simulations. I'm just providing my experience as the first exascale zstash user.
The text was updated successfully, but these errors were encountered:
Hi @PeterCaldwell Thank you for checking out the tool. I think it is the first application of zstash on exascale simulations, and zstash archiving is not practiced too often on OLCF. Since you also mentioned Globus, I'm wondering what is your workflow using zstash (i.g., archive from OLCF disk to HPSS, OLCF disk to NERSC HPSS, etc). I think it would be helpful to include the zstash command line here (with the data paths), so that developers can understand scale of the data and help diagnose/identify the bottle neck, and to try make zstash useful for exascale simulations!
Understand the reason for the speed difference between direct transfer to HPSS via globus compared to what zstash is doing. If the local tar file generation is the bottleneck, we could consider bypassing it and directly archive to HPSS for large files.
zstash create
is way too slow to be practical for km-scale global simulations. The issue seems to be collecting all the files into a tar file before moving it to HPSS. I think zstash also uses hsi under the hood and OLCF says hsi is much slower than htar, which is much slower than globus. I ended up having to use a 1 node batch job to try to do this. It ran for the max allowable 2 days then timed out, so I restarted it withzstash update
and had to resubmit it every few days for weeks. I think one of my cases might have finished, but I never was able to verify if all the data got uploaded.zstash check
is also impractical for these big simulations - it takes weeks to upload the data and even longer to download it again. Perhaps there could be a "quickcheck" option that just ensures that all the files in the original directory are found in the zstash index and that the total file size is consistent with the sum of the individual files you tarred? I'm also still a bit confused why we can't checksum the data on HPSS without moving it back to scratch.All of this isn't to say that zstash isn't great for what it was made for and it's fine if we never end up using it for exascale simulations. I'm just providing my experience as the first exascale zstash user.
The text was updated successfully, but these errors were encountered: