-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpointing Example #164
Conversation
4d29c95
to
33627fe
Compare
Waiting for merge of #161. |
0b75ac4
to
51ee228
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not completely done reviewing, moving to #163 before I complete this.
.github/workflows/tests.yml
Outdated
run: python3 -m pip install -r docs/requirements.txt | ||
|
||
- name: Run files generation tests | ||
run: pre-commit run --all-files && [[ -z "$(git status -s)" ]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's && [[ -z "$(git status -s)" ]]
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making sure that nothing changed :
&&
: run the next command only if the previous succeededgit status -s
: only print the files that differs (including new and deleted)[[ -z ... ]]
: checks if the length of the output or variable is zero and returns true if it is zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self, we should do pre-commit run --all-files preprocess-py
to only run this hook in particular.
(I locally added other hooks in my pre-commit config. If we ever add some other hooks for writing the examples (e.g. ruff, black, etc, then we should probably only run the generate hook here)
776bcd7
to
ad75d92
Compare
Rebased on #182 which should be merged before this one |
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
This reverts commit 3b383ad.
Signed-off-by: Fabrice Normandin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(NOTE: J'approve ce qui est devenu ma propre PR, il me faudrait 1-2 autres reviews svp)
Signed-off-by: Fabrice Normandin <[email protected]>
docs/examples/checkpointing/job.sh
Outdated
# trap the signal to the main BATCH script here. | ||
sig_handler() | ||
{ | ||
echo "BATCH interrupted" | ||
wait # wait for all children, this is important! | ||
} | ||
|
||
trap 'sig_handler' SIGINT SIGTERM SIGCONT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless after exec python main.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right! Thanks @obilaniu. Should we encourage users to do srun python (...)
(and trap?) or exec python (...)
in your view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This being an example only, I would adopt one of two strategies - either minimize the delta from the foundation mini-example to this one, or minimize the complexity of this one in the absolute.
I would prefer that you drop the trap
business, and use the Python equivalent (the signal
package).
However, I have no faith in asynchronous checkpointing at all and recommend only synchronous checkpointing. I would use the signal
package only to demonstrate in the mini-example that Python can install signal handlers and execute Python code as a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, here's my take: 779413b
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not able to review the python code, but that changes looks good to me overall.
Overall this is fine, however I do not like the strategy for checkpointing. It's not safe. Your implementation:
You vulnerability is that if you are killed in step 2 during the first checkpoint's write, the reload will detect no My preferred implementation:
|
Signed-off-by: Fabrice Normandin <[email protected]>
No description provided.