zstash is impractical for exascale simulations #249

PeterCaldwell · 2023-01-21T01:17:07Z

zstash create is way too slow to be practical for km-scale global simulations. The issue seems to be collecting all the files into a tar file before moving it to HPSS. I think zstash also uses hsi under the hood and OLCF says hsi is much slower than htar, which is much slower than globus. I ended up having to use a 1 node batch job to try to do this. It ran for the max allowable 2 days then timed out, so I restarted it with zstash update and had to resubmit it every few days for weeks. I think one of my cases might have finished, but I never was able to verify if all the data got uploaded.

zstash check is also impractical for these big simulations - it takes weeks to upload the data and even longer to download it again. Perhaps there could be a "quickcheck" option that just ensures that all the files in the original directory are found in the zstash index and that the total file size is consistent with the sum of the individual files you tarred? I'm also still a bit confused why we can't checksum the data on HPSS without moving it back to scratch.

All of this isn't to say that zstash isn't great for what it was made for and it's fine if we never end up using it for exascale simulations. I'm just providing my experience as the first exascale zstash user.

The text was updated successfully, but these errors were encountered:

chengzhuzhang · 2023-01-25T00:12:28Z

Hi @PeterCaldwell Thank you for checking out the tool. I think it is the first application of zstash on exascale simulations, and zstash archiving is not practiced too often on OLCF. Since you also mentioned Globus, I'm wondering what is your workflow using zstash (i.g., archive from OLCF disk to HPSS, OLCF disk to NERSC HPSS, etc). I think it would be helpful to include the zstash command line here (with the data paths), so that developers can understand scale of the data and help diagnose/identify the bottle neck, and to try make zstash useful for exascale simulations!

golaz · 2023-01-31T19:34:52Z

@PeterCaldwell : thanks for trying out and sharing your experience, as well as detailing your eventual workflow:

E3SM-Project/scream#2131

Initial action items from your experience:

Implement the --include functionality, see Add --include option #199
Understand the reason for the speed difference between direct transfer to HPSS via globus compared to what zstash is doing. If the local tar file generation is the bottleneck, we could consider bypassing it and directly archive to HPSS for large files.

forsyth2 added this to forsyth2 current tasks Jan 21, 2023

github-project-automation bot moved this to Todo in forsyth2 current tasks Jan 21, 2023

forsyth2 removed the status in forsyth2 current tasks Jan 21, 2023

forsyth2 moved this to Todo in forsyth2 current tasks Jan 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstash is impractical for exascale simulations #249

zstash is impractical for exascale simulations #249

PeterCaldwell commented Jan 21, 2023

chengzhuzhang commented Jan 25, 2023

golaz commented Jan 31, 2023

zstash is impractical for exascale simulations #249

zstash is impractical for exascale simulations #249

Comments

PeterCaldwell commented Jan 21, 2023

chengzhuzhang commented Jan 25, 2023

golaz commented Jan 31, 2023