Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepmd in Paddle for example, just 'water_se_a' model #529

Merged
merged 9 commits into from
Apr 19, 2021

Conversation

zhwesky2010
Copy link

@zhwesky2010 zhwesky2010 commented Apr 18, 2021

Deepmd in Paddle fo example, just water_se_a model.

这是Paddle支持deepmd中 water_se_a 模型的示例写法代码:

目前只针对生物计算的 water_se_a.json 单个模型,使用Paddle动态图进行训练、预测评估、3个自定义OP的支持。其他涉及到的模型代码均不进行改动。

通过该代码对Deepmd water_se_a.json的结果记录如下:

模型精度

1.模型训练

说明:使用Paddle动态图搭建网络,结构与Tensor Flow静态图保持完全一致,唯一只有网络参数初始化时不同(所有参数采用相同均值、方差、seed的Norm inItializer,但由于底层实现不同,Paddle与TF的初始值是不同的)

PaddlePaddle

batch ID rmse_test rmse_train rmse_e_test rmse_e_train rmse_f_test rmse_f_train
0 2.83e+01 2.80e+01 4.81e+00 4.80e+00 8.43e-01 8.33e-01
10000 2.80e+00 2.76e+00 1.62e-02 1.77e-02 9.32e-02 9.19e-02
100000 7.60e-01 6.90e-01 3.08e-03 2.92e-03 4.00e-02 3.64e-02
1000000(结束) 3.61e-02 3.62e-02 2.47e-04 3.61e-04 3.54e-02 3.53e-02

TensorFlow

batch ID rmse_test rmse_train rmse_e_test rmse_e_train rmse_f_test rmse_f_train
0 3.34e+01 3.31e+01 1.03e+01 1.03e+01 8.42e-01 8.30e-01
10000 3.44e+00 3.88e+00 2.30e-02 2.46e-02 1.15e-01 1.29e-01
100000 9.76e-01 9.15e-01 2.22e-03 2.71e-03 5.15e-02 4.83e-02
1000000(结束) 3.88e-02 3.61e-02 2.45e-04 3.03e-04 3.80e-02 3.52e-02

结论: 100万batch训练最终结果:

  • Force loss,Paddle略好于TensorFlow:(3.54e-02(pd) <-> 3.80e-02(tf)
  • Energy loss,Paddle略差于TensorFlow:(2.47e-04(pd) <-> 2.45e-04(tf)
  • 总Loss:Paddle略好于TensorFlow(3.61e-02(pd) <-> 3.88e-02(tf)
  • 精度已对齐,有极细微差异,可能需微调下force 、energy的loss敏感性

2.模型评估

说明:训练1000000个batch结束后,保存模型并加载,使用30帧测试数据进行模型评估。

PaddlePaddle

image

TensorFlow

image

指标 PaddlePaddle-GPU develop 动态图 TensorFlow-GPU 2.20 静态图
Energy RMSE 7.003462e-02 6.748153e-02
Energy RMSE/Natoms 3.647637e-04 3.514663e-04
Force RMSE 3.499594e-02 3.713263e-02

模型评估结论:

  • Energy RMSE指标,基本一致,Paddle略高于TensorFlow
  • Force RMSE指标,基本一致,Paddle略低于TensorFlow
  • 由于Paddle与TF的Normal参数初始化完全不同,且force 、energy对loss敏感性有些不同,最终有略好和略差的极细微差异。

模型性能

训练速度 PaddlePaddle-GPU develop动态图 TensorFlow-GPU 2.20静态图
 GPU V100 + CUDA 10.1 4.05s / 100 batch 2.87s / 100 batch

性能结论:

  • water_se_a模型上,Paddle动态图要比TF静态图慢一些,训练100万batch大概需要11~12h;
  • 主流模型上,Paddle动态图不差于TensorFlow静态图,但动态图由于Python与C++交互非常频繁,模型Python的写法对性能影响很大,可能在写法上还能有优化空间,需要再通过timeline细致排查;

自定义OP精度

本次支持 water_se_a.json 模型中涉及到的prod_env_mat_aprod_force_se_aprod_virial_se_a3个自定义OP,精度均与TF进行对齐。

@zhwesky2010 zhwesky2010 changed the title Deepmd in Paddle fo example, just 'water_se_a' model Deepmd in Paddle for example, just 'water_se_a' model Apr 18, 2021
Copy link
Collaborator

@JiabinYang JiabinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this PR should depends on PaddlePaddle version >= 2.1.0

@amcadmus amcadmus changed the base branch from api to paddle April 19, 2021 07:05
@amcadmus amcadmus merged commit ddcb9d7 into deepmodeling:paddle Apr 19, 2021
njzjz pushed a commit to njzjz/deepmd-kit that referenced this pull request Sep 21, 2023
for INCAR param of VASP, NSW = 0 and 1 both lead to the output of 1 converged SCF, thus 1 valid frame of labeled data. 
NSW = 0 (actually single-point calculation) was not supported to be used as the "md_incar".
Though, the previous limitation might be reasonable in semantics, this update just support the practical branching in VASP settings and eliminates the annoying exception for users.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants