Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PT: keep the same checkpoint behavior as TF #3191

Merged
merged 3 commits into from
Jan 28, 2024

Conversation

njzjz
Copy link
Member

@njzjz njzjz commented Jan 27, 2024

Set the default save_ckpt to model.ckpt as the prefix. When saving checkpoints, model.ckpt-100.pt will be saved, and model.ckpt.pt will be symlinked to model.ckpt-100.pt. A checkpoint file will be dedicated to record model.ckpt-100.pt.

This keeps the same behavior as the TF backend. One can do the below using the PT backend just like the TF backend:

dp --pt train input.json
# one can cancel the training before it finishes
dp --pt freeze

Set the default save_ckpt to `model.ckpt` as the prefix. When saving checkpoints, `model.ckpt-100.pt` will be saved, and `model.ckpt.pt` will be symlinked to `model.ckpt-100.pt`. A `checkpoint` file will be saved to record `model.ckpt-100.pt`.

This keeps the same behavior as the TF backend.

Signed-off-by: Jinzhe Zeng <[email protected]>
try:
# remove old one
os.remove(new_ff)
except OSError:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
Signed-off-by: Jinzhe Zeng <[email protected]>
Copy link

codecov bot commented Jan 27, 2024

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (3e4715f) 74.27% compared to head (968ae48) 74.27%.

Files Patch % Lines
deepmd/pt/entrypoints/main.py 0.00% 3 Missing ⚠️
deepmd/common.py 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #3191      +/-   ##
==========================================
- Coverage   74.27%   74.27%   -0.01%     
==========================================
  Files         343      343              
  Lines       31629    31634       +5     
  Branches     1592     1592              
==========================================
+ Hits        23494    23497       +3     
- Misses       7210     7212       +2     
  Partials      925      925              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Jinzhe Zeng <[email protected]>
@wanghan-iapcm wanghan-iapcm merged commit a8168b5 into deepmodeling:devel Jan 28, 2024
45 checks passed
@njzjz njzjz mentioned this pull request Apr 2, 2024
@thangckt
Copy link

thangckt commented May 3, 2024

hi @njzjz

Can I know why you need different file extension .pth and .pt when using pytorch?

The files *.pt are generated when run

dp --pt train input.json

and the file *.pth when run

dp --pt freeze

can we just use one of these ext for convenient when collect files in dpegen?

@njzjz
Copy link
Member Author

njzjz commented May 3, 2024

No control flow is saved in the checkpoint file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants