Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freezing with mpirun will obtain a wrong model #2272

Closed
njzjz opened this issue Jan 24, 2023 Discussed in #1289 · 1 comment · Fixed by #2937
Closed

freezing with mpirun will obtain a wrong model #2272

njzjz opened this issue Jan 24, 2023 Discussed in #1289 · 1 comment · Fixed by #2937
Labels
bug reproduced This bug has been reproduced by developers

Comments

@njzjz
Copy link
Member

njzjz commented Jan 24, 2023

Discussed in #1289

Originally posted by TinacciL November 17, 2021
I installed the GPU version of Deepmd-kit ghcr.io/deepmodeling/deepmd-kit:2.0.3_cuda10.1_gpu via Docker, I tested and it work fine with the example proveided.

I start to replicate waters in a cluster configuration (nopbc), I create a database of about 20000 frame (energies and forces from 1 to 200 different H2O cluster at different Temperature).

I train the model with almost the same input provided in the example/water/se_e2_a:

{
    "_comment": " model parameters",
    "model": {
	"type_map":	["O", "H"],
	"descriptor" :{
	    "type":		"se_e2_a",
	    "sel":		[70, 140],
	    "rcut_smth":	0.50,
	    "rcut":		6.00,
	    "neuron":		[25, 50, 100],
	    "resnet_dt":	false,
	    "axis_neuron":	16,
	    "seed":		1,
	    "_comment":		" that's all"
	},
	"fitting_net" : {
	    "neuron":		[340, 340, 340],
	    "resnet_dt":	true,
	    "seed":		1,
	    "_comment":		" that's all"
	},
	"_comment":	" that's all"
    },

    "learning_rate" :{
	"type":		"exp",
	"decay_steps":	5000,
	"start_lr":	0.001,	
	"stop_lr":	3.51e-8,
	"_comment":	"that's all"
    },

    "loss" :{
	"type":		"ener",
	"start_pref_e":	0.02,
	"limit_pref_e":	1,
	"start_pref_f":	1000,
	"limit_pref_f":	1,
	"start_pref_v":	0,
	"limit_pref_v":	0,
	"_comment":	" that's all"
    },

    "training" : {
	"training_data": {
	    "systems":		["../data_gfn2/train_1WM/", "../data_gfn2/train_2WM/", "../data_gfn2/train_10WM/", "../data_gfn2/train_60WM/", "../data_gfn2/train_100WM/", "../data_gfn2/train_200WM/"],
	    "batch_size":	"auto",
	    "_comment":		"that's all"
	},
	"validation_data":{
	    "systems":		["../data_gfn2/test_1WM/", "../data_gfn2/test_2WM/", "../data_gfn2/test_10WM/", "../data_gfn2/test_60WM/", "../data_gfn2/test_100WM/", "../data_gfn2/test_200WM/"],
	    "batch_size":	1,
	    "numb_btch":	3,
	    "_comment":		"that's all"
	},
	"numb_steps":	1000000,
	"seed":		10,
	"disp_file":	"lcurve.out",
	"disp_freq":	100,
	"save_freq":	1000,
	"_comment":	"that's all"
    },    

    "_comment":		"that's all"
}

At the end of the training I achieve these data in the lcurve.out file:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
 999700      2.86e-02    2.12e-02      2.15e-04    1.64e-04      2.76e-02    2.08e-02    3.7e-08
 999800      2.30e-02    2.31e-02      1.26e-04    2.77e-04      2.25e-02    2.25e-02    3.7e-08
 999900      2.20e-02    1.87e-02      6.70e-04    4.38e-04      2.10e-02    1.81e-02    3.7e-08
1000000      2.40e-02    1.95e-02      4.62e-04    4.25e-04      2.32e-02    1.89e-02    3.5e-08

After the freezing of the model I do a test via dp test command on some of the validation data and I achieve this results:

DEEPMD INFO    # number of test data : 10 
DEEPMD INFO    Energy RMSE               : 6.584139e+03 eV
DEEPMD INFO    Energy RMSE/Natoms : 1.097356e+02 eV
DEEPMD INFO    Force  RMSE                : 3.281405e-01 eV/A
DEEPMD INFO    Virial RMSE                  : 2.225164e+00 eV
DEEPMD INFO    Virial RMSE/Natoms    : 3.708607e-02 e

I did it also for the training data, in order to see if was an overfitting problem, and I got:

DEEPMD INFO    # number of test data : 10 
DEEPMD INFO    Energy RMSE               : 6.584253e+03 eV
DEEPMD INFO    Energy RMSE/Natoms : 1.097375e+02 eV
DEEPMD INFO    Force  RMSE                : 3.373093e-01 eV/A
DEEPMD INFO    Virial RMSE                  : 3.646905e+00 eV
DEEPMD INFO    Virial RMSE/Natoms    : 6.078175e-02 eV

Why does the test command not provided the same results of the "testing on the fly" results?
Is it a problem of nopbc or only my inexperience?

Thanks

@njzjz njzjz added the bug label Jan 24, 2023
@njzjz njzjz added the reproduced This bug has been reproduced by developers label Aug 3, 2023
@njzjz
Copy link
Member Author

njzjz commented Aug 3, 2023

I reproduce the behavior once but not able to reproduce it then. The reason is still unclear. But I think the priority of this issue is low.

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Oct 18, 2023
@njzjz njzjz linked a pull request Oct 18, 2023 that will close this issue
wanghan-iapcm pushed a commit that referenced this issue Oct 19, 2023
Fix #2272.

Signed-off-by: Jinzhe Zeng <[email protected]>
@njzjz njzjz closed this as completed Oct 19, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in Bugfixes for DeePMD-kit Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug reproduced This bug has been reproduced by developers
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant