-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stampede2 bundled submissions #437
Comments
It's possible that this is just the |
When I do the actual submit without --pretend I get the following error:
When I switch back to flow 0.8 submit works normally... I doubt this is connected though. |
(Quoting comment from #438, which I should have written on this issue: #438 (comment)) @DomFijan I resolved the issue with submission and I'm going to merge this PR immediately to prevent the problems from occurring for other testers. I expect that the issue with Tagging @b-butler @vyasr since I think you discussed & fixed this for Stampede2 in #250, #298. |
The submit is now fixed and I've submitted as you suggested. I extracted the submitted script via
The job starts normally but is very very slow. I re-checked this with a simulation I've done previously. I and reran it with the newest flow version and shows exact same slow down. Flow version in the container is 0.11. |
Upon reading #298 and #250 the produced script is in accordance to the fixes implemented there. Although I find it very unintuitive that the line that gets executed looks nothing like one would expect? Perhaps adding a line that says "the above is equivalent to the below commented line" would help? Something like:
|
@bdice yes, the offsets don't get correctly incremented in pretend mode but they should when actually submitting. I don't remember the exact reason for this, but it's basically because of the different way that environment classes are used during run vs submit. The Stampede2Environment maintains an internal offset counter, which is always tacked onto the value of the environment variable and incremented within a given run loop. Since submit calls run and run forks (when using MPI), the environment variable allows communication of the current offset between the parent and child processes. During a pretend submission the internal offset variable calculation won't actually see the new environment variable. I think (this is the detail I don't 100% remember but could look into if needed) that this happens because the environment variable is only loaded when the class itself is created (or equivalently, when the module is loaded), which makes sense since these classes are not designed to be instantiated, but during pretend submission since there is no forking happening the class is only created once and so it never sees the environment variable. IIRC fixing this issue would require substantially more convoluted logic than is worth implementing. @DomFijan If I understand you correctly, you find it confusing that the command run at submission time is different from the operation you submitted, whereas the actual operation you want to submit is shown below the "Eligible to run" header. Assuming I understand you correctly, your suggestion is unfortunately not really accurate. However, I'll try to explain what's happening and maybe you can suggest alternative ways to improve our current output. This distinction is not just a shortcoming in our current code, but a consequence of a core feature of our execution model. Say you have a sequence of operations A->B->C that are all part of a group G, and initially only A is eligible. If you submit a group G, the submission operation will be something like
A priori, there is no way to know whether running the first operation (A) will actually lead to B becoming eligible, so we simply print our best guess for what happens. As a result, the comment you suggest adding wouldn't really be accurate, because the two aren't equivalent. The "Eligible to run" section is the set of things that we know will run, but the "Operations with unmet preconditions" section (and the "Operations with all postconditions met", which could have postconditions invalidated by running other operations), represent operations that might run, but there's no way to know for sure at submit time because they will be reevaluated after the "Eligible to run" operations are completed. Does that make sense? |
@bdice I agree with what Vyas just posted. Having now looked at this, I don't think it is worth the effort necessary to fix the pretend output to have more accurate offsets. It is somewhat confusing, but in this case given our execution model somewhat unavoidable. @DomFijan also your line to add to the submission script is not quite accurate. The submission is not equivalent to the lines below. It depends as Vyas said on preconditions and postconditions as well as in cases like this the use of environmental variables such that even given the pretend output the offset are generated correctly. |
Thanks for explaining @vyasr ! That does indeed make sense. I got tunnel visioned on the particular script I use. I forgot to take the grander context of flow groups into account. You are right. I will try and do some additional benchmarking on slow-down of execution of the code on the nodes when submitted with newest version vs. 0.8 version later today. |
I have tested the stampede2 submission with following versions:
Stampede has 48 cores per node. Another confusing issue is that when submitting jobs with --parallel following warning is issued for both new and old versions:
|
Description
Submission of bundles on stampede2 produces wrong submission script. Might be a template issue.
To reproduce
python3 project.py submit -n 3 -b 3 -w 0.2 --parallel --force --pretend
Error output
On stampede2 bundeling is done thorugh the
iburn
and if bundles are submitted an offest must be provided with -o. The old version of signac-flow (0.8) produced the correct command :while the new version produces (pulled from master on 01/22/2021):
-o argument of
iburn
is always 0 in newest version while it should be 0, 16, 32.System configuration
[GCC Intel(R) C++ gcc 6.3 mode]
Both signac and signac flow were pulled from master on 01/22/2021 and installed with pip.
The text was updated successfully, but these errors were encountered: