Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstash is impractical for exascale simulations #249

Open
PeterCaldwell opened this issue Jan 21, 2023 · 2 comments
Open

zstash is impractical for exascale simulations #249

PeterCaldwell opened this issue Jan 21, 2023 · 2 comments

Comments

@PeterCaldwell
Copy link

zstash create is way too slow to be practical for km-scale global simulations. The issue seems to be collecting all the files into a tar file before moving it to HPSS. I think zstash also uses hsi under the hood and OLCF says hsi is much slower than htar, which is much slower than globus. I ended up having to use a 1 node batch job to try to do this. It ran for the max allowable 2 days then timed out, so I restarted it with zstash update and had to resubmit it every few days for weeks. I think one of my cases might have finished, but I never was able to verify if all the data got uploaded.

zstash check is also impractical for these big simulations - it takes weeks to upload the data and even longer to download it again. Perhaps there could be a "quickcheck" option that just ensures that all the files in the original directory are found in the zstash index and that the total file size is consistent with the sum of the individual files you tarred? I'm also still a bit confused why we can't checksum the data on HPSS without moving it back to scratch.

All of this isn't to say that zstash isn't great for what it was made for and it's fine if we never end up using it for exascale simulations. I'm just providing my experience as the first exascale zstash user.

@chengzhuzhang
Copy link
Collaborator

Hi @PeterCaldwell Thank you for checking out the tool. I think it is the first application of zstash on exascale simulations, and zstash archiving is not practiced too often on OLCF. Since you also mentioned Globus, I'm wondering what is your workflow using zstash (i.g., archive from OLCF disk to HPSS, OLCF disk to NERSC HPSS, etc). I think it would be helpful to include the zstash command line here (with the data paths), so that developers can understand scale of the data and help diagnose/identify the bottle neck, and to try make zstash useful for exascale simulations!

@golaz
Copy link
Collaborator

golaz commented Jan 31, 2023

@PeterCaldwell : thanks for trying out and sharing your experience, as well as detailing your eventual workflow:

E3SM-Project/scream#2131

Initial action items from your experience:

  1. Implement the --include functionality, see Add --include option #199
  2. Understand the reason for the speed difference between direct transfer to HPSS via globus compared to what zstash is doing. If the local tar file generation is the bottleneck, we could consider bypassing it and directly archive to HPSS for large files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants