Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train espaloma with MD simulation #1

Merged
merged 60 commits into from
Mar 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
ef04200
use pint.UnitRegistry to convert units
kntkb Feb 9, 2024
b63147f
change units.py
kntkb Feb 12, 2024
3176da2
add implementation error for other systems in experiment.py
kntkb Feb 12, 2024
12a749e
add draft for train.py
kntkb Feb 12, 2024
47f00ce
change logging info in espfit.utils.graphs.drop_and_merge_duplicates
kntkb Feb 15, 2024
f62bfd4
first commit espfit/utils/sampler
kntkb Feb 15, 2024
cd23fff
use pint.UnitRegistry to define units
kntkb Feb 15, 2024
91ac118
remove espaloma train settings and add sampler settings to config.toml
kntkb Feb 15, 2024
08f4319
use local variables instead of instance attributes
kntkb Feb 15, 2024
cef6497
add train_weight method
kntkb Feb 15, 2024
d72a531
first commit test_app_train_sampler.py
kntkb Feb 15, 2024
ab0a251
fix python to 3.11 and add pytest-cov in test_env.yaml
kntkb Feb 16, 2024
ed42b0c
add space to indent lines
kntkb Feb 16, 2024
2918568
Merge branch 'main' into feature/train_with_reweight
kntkb Feb 16, 2024
f546b7f
add space to indent line
kntkb Feb 16, 2024
628440f
convert unit string into openmm unit
kntkb Feb 19, 2024
245ab2c
create sampler system from toml file
kntkb Feb 19, 2024
e4ecdf4
allow creating multiple sampler systems
kntkb Feb 19, 2024
04145ff
rename pdbfixer_min.pdb to target.pdb for nucleoside systems
kntkb Feb 19, 2024
1ada151
add reference fig, table information
kntkb Feb 19, 2024
9d9bbe2
change output file name from pdbfixer_min.pdb to target.pdb
kntkb Feb 19, 2024
f337f55
replace module.py with reweight.py
kntkb Feb 21, 2024
585a685
rename experiment.py to analysis.py
kntkb Feb 21, 2024
1e5b510
add simulation settings to config
kntkb Feb 21, 2024
cf39f68
remove constraint and nonbonded method from instance variable
kntkb Feb 21, 2024
eb234c0
improve running sampler during espaloma training
kntkb Feb 21, 2024
56570cb
update usage in README.md
kntkb Feb 21, 2024
7db4dba
create new samplers during espaloma training
kntkb Feb 21, 2024
f7dbd7a
remove old comment
kntkb Feb 21, 2024
c6b0c3e
rename test_app_experiment.py to test_app_analysis.py
kntkb Feb 22, 2024
7d67a3b
fix minor bug to pass all tests
kntkb Feb 22, 2024
cbcd343
run sampler using on-the-fly espaloma model created during training
kntkb Feb 23, 2024
ff04682
export loss per epoch to reporter.log
kntkb Feb 26, 2024
43f7e99
use j-coupling names for keyname
kntkb Feb 28, 2024
bdb1433
add support to compute sampler loss
kntkb Feb 28, 2024
095c873
refactor reweight.py
kntkb Mar 4, 2024
09565a9
fix SetupSampler.from_toml when args is not defined
kntkb Mar 5, 2024
f70140f
add pseudo code for get_effective_sampler_size
kntkb Mar 5, 2024
52d32b5
clean up
kntkb Mar 5, 2024
6a0f43d
fix docstring
kntkb Mar 6, 2024
fd052d5
remove old comment
kntkb Mar 6, 2024
3ad0aa9
uncomment version
kntkb Mar 6, 2024
19f442f
support minima conformer for each chunk when reshaping graphs
kntkb Mar 6, 2024
252012e
fix report_loss to properly append loss data at each epoch
kntkb Mar 6, 2024
5bf8717
update docstring for reshape_conformation_size
kntkb Mar 6, 2024
8d2871f
update docstring in CustomGraphDataset
kntkb Mar 6, 2024
ee1503e
fix bug in training when restarting from existing checkpoint file
kntkb Mar 6, 2024
c757c73
check output files after running test in test_train_sampler
kntkb Mar 6, 2024
72946c4
minor update
kntkb Mar 6, 2024
9b9b963
remove mockcode.ipynb
kntkb Mar 6, 2024
58036dd
average q_ref if different isomeric smiles are found in merging nonis…
kntkb Mar 7, 2024
5be247c
add testsystems/nucleoside/target.pdb
kntkb Mar 7, 2024
d53595a
compare sorted number of conformers to prevent github CI failure
kntkb Mar 8, 2024
41555f2
add try-except to prevent UnassignedProperTorsionParameterException
kntkb Mar 8, 2024
7164253
deprecate isomeric=False in drop_and_merge_duplicates
kntkb Mar 8, 2024
7490db5
reflect changes in drop_duplicates in graphs.py
kntkb Mar 9, 2024
1c84410
use nonisomeric smiles to detect unique molecules
kntkb Mar 9, 2024
cf5a36d
fix bug in averaging partial charges for duplicate entries
kntkb Mar 10, 2024
0963d0f
fix drop_and_merge_duplicates to drop_duplicates
kntkb Mar 12, 2024
fb97040
fix drop_and_merge_duplicates to drop_duplicates
kntkb Mar 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 28 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,26 @@ Infrastruture to train espaloma with experimental observables


### Quick Usage
```python
from espfit.utils.graphs import CustomGraphDataset
path = 'espfit/data/qcdata/openff-toolkit-0.10.6/dgl2/protein-torsion-sm/'
ds = CustomGraphDataset.load(path)
ds.reshape_conformation_size(n_confs=50, include_min_energy_conf=True)
ds.compute_relative_energy()
# Create esplama model
from espfit.app.train import EspalomaModel
filename = 'espfit/data/config/config.toml'
# Override training settings in config.toml
kwargs = {'output_directory_path': 'checkpoints', 'epochs': 100}
model = EspalomaModel.from_toml(filename, **kwargs)
model.dataset_train = ds
# Set sampler settings
model.train_sampler(sampler_patience=800, neff_threshold=0.2, sampler_weight=1)
```

### Standalone Usage
#### Change logging
```python
# load dgl graph data
from espfit.utils import logging
logging.get_logging_level()
#>'INFO'
Expand All @@ -51,12 +67,11 @@ from espfit.app.train import EspalomaModel
filename = 'espfit/data/config/config.toml'
model = EspalomaModel.from_toml(filename)
model.dataset_train = ds
# Train
model.train(output_directory_path='path/to/output')
# To extend training, update the `epoch` in config.toml
# Alternatively, do the following:
model.config['espaloma']['train']['epochs'] = 50
model.train(output_directory_path='path/to/output')
# Change default training settings
model.epochs = 100
model.output_directory_path = 'path/to/output'
# Train (default output directory is current path)
model.train()
```

#### Standard MD (default: espaloma-0.3.2 force field for solute molecules)
Expand All @@ -66,8 +81,10 @@ from espfit.app.sampler import SetupSampler
c = SetupSampler()
filename = 'espfit/data/target/testsystems/nucleoside/pdbfixer_min.pdb'
c.create_system(biopolymer_file=filename)
c.minimize(maxIterations=10)
c.run(nsteps=10, output_directory_path='path/to/output')
c.minimize()
# Change default settings
c.nsteps = 1000
c.run()
# Export to XML
c.export_xml(exportSystem=True, exportState=True, exportIntegrator=True, output_directory_path='path/to/output')
```
Expand All @@ -76,7 +93,8 @@ c.export_xml(exportSystem=True, exportState=True, exportIntegrator=True, output_
```python
from espfit.app.sampler import SetupSampler
c = SetupSampler.from_xml(input_directory_path='path/to/input')
c.run(nsteps=10, output_directory_path='path/to/output')
c.nsteps = 1000
c.run()
```

#### Compute RNA J-couplings from MD trajectory
Expand Down
4 changes: 2 additions & 2 deletions devtools/conda-envs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

>conda activate espfit
>conda env export --from-history > test_env.yaml
>conda env create -f test_env.yaml -n test_env
>conda env create -f test_env.yaml -n test_env
>#uninstall openff-toolkit and install a customized version to support dgl graphs created using openff-toolkit=0.10.6
>conda uninstall --force openff-toolkit
>pip install git+https://github.com/kntkb/openff-toolkit.git@7e9d0225782ef723083407a1cbf1f4f70631f934
>#uninstall openmmforcefields if < 0.12.0
>#uninstall openmmforcefields if < 0.12.0
>#use pip instead of mamba to avoid dependency issues with ambertools and python
>conda uninstall --force openmmforcefields
>pip install git+https://github.com/openmm/[email protected]
Expand Down
2 changes: 1 addition & 1 deletion espfit/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,4 @@
#from .espfit import *


#from ._version import __version__
from ._version import __version__
103 changes: 69 additions & 34 deletions espfit/app/experiment.py → espfit/app/analysis.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,5 @@
"""
Compute experimental observables from MD simulations.

Notes
-----

TODO
----
"""
import os
import numpy as np
Expand All @@ -17,20 +11,19 @@
class BaseDataLoader(object):
"""Base class for data loader.

TODO
----
* Add more methods to check trajectory information (e.g. number of frames, number of atoms, etc.)

Methods
-------
load_traj(reference_pdb='solvated.pdb', trajectory_netcdf='traj.nc', atom_indices=None, stride=1):
Load MD trajectory.
"""
def __init__(self, input_directory_path=None, output_directory_path=None):
def __init__(self, atomSubset='solute', input_directory_path=None, output_directory_path=None):
"""Initialize base data loader object.

Parameters
----------
atomSubset : str, default='solute'
Subset of atoms to save. Default is 'solute'. Other options 'all' and 'not water'.

input_directory_path : str, optional
Input directory path. Default is None.
If None, the current working directory will be used.
Expand All @@ -39,6 +32,10 @@ def __init__(self, input_directory_path=None, output_directory_path=None):
Output directory path. Default is None.
If None, the current working directory will be used.
"""
self.atomSubset = atomSubset
if self.atomSubset not in ['solute', 'all', 'not water']:
raise ValueError(f"Invalid atomSubset: {self.atomSubset}. Expected 'solute', 'all', or 'not water'.")

if input_directory_path is None:
input_directory_path = os.getcwd()
if output_directory_path is None:
Expand All @@ -61,8 +58,7 @@ def output_directory_path(self, value):
os.makedirs(value, exist_ok=True)


# Should this be a classmethod?
def load_traj(self, reference_pdb='solvated.pdb', trajectory_netcdf='traj.nc', atom_indices=None, stride=1, input_directory_path=None):
def load_traj(self, reference_pdb='solvated.pdb', trajectory_netcdf='traj.nc', stride=1, input_directory_path=None):
"""Load MD trajectory.

Parameters
Expand All @@ -73,16 +69,16 @@ def load_traj(self, reference_pdb='solvated.pdb', trajectory_netcdf='traj.nc', a
trajectory_netcdf : str, optional
Trajectory netcdf file name. Default is 'traj.nc'.

atom_indices : list, optional
List of atom indices to load from trajectory. Default is None.
If None, all atoms will be loaded.

stride : int, optional
Stride to load the trajectory. Default is 1.

input_directory_path : str, optional
Input directory path. Default is None.
If None, the current working directory will be used.

Returns
-------
None
"""
import mdtraj

Expand All @@ -92,22 +88,31 @@ def load_traj(self, reference_pdb='solvated.pdb', trajectory_netcdf='traj.nc', a
# Load reference pdb (solvated system)
pdb = os.path.join(self.input_directory_path, reference_pdb)
ref_traj = mdtraj.load(pdb)

# Select atoms to load from trajectory
if atom_indices is None:
if self.atomSubset == 'all':
self.atom_indices = None
self.ref_traj = ref_traj
else:
self.atom_indices = []
mdtop = ref_traj.topology
res = [ r for r in mdtop.residues if r.name not in ('HOH', 'NA', 'CL', 'K') ]
if self.atomSubset == 'solute':
res = [ r for r in mdtop.residues if r.name not in ('HOH', 'NA', 'CL', 'K') ]
elif self.atomSubset == 'not water':
res = [ r for r in mdtop.residues if r.name not in ('HOH') ]
# Get atom indices
for r in res:
for a in r.atoms:
self.atom_indices.append(a.index)
else:
self.atom_indices = atom_indices
self.ref_traj = ref_traj.atom_slice(self.atom_indices)

self.ref_traj = ref_traj.atom_slice(self.atom_indices)

# Load trajectory
netcdf = os.path.join(self.input_directory_path, trajectory_netcdf)
traj = mdtraj.load(netcdf, top=self.ref_traj.topology, stride=stride)
self.traj = traj.atom_slice(self.atom_indices)
if self.atom_indices:
self.traj = traj.atom_slice(self.atom_indices)
else:
self.traj = traj


class RNASystem(BaseDataLoader):
Expand Down Expand Up @@ -170,15 +175,14 @@ def radian_to_degree(self, a):
return a


def compute_jcouplings(self, couplings=None, residues=None):
def compute_jcouplings(self, weights=None, couplings=None, residues=None):
"""Compute J-couplings from MD trajectory.

TODO
----
* Compute confidence interval.

Parameters
----------
weights : numpy.ndarray, optional
Weights to compute the J-couplings. Default is None.

couplings : str, optional
Name of the couplings to compute. Default is None.
If a list of couplings to be chosen from [H1H2,H2H3,H3H4,1H5P,2H5P,C4Pb,1H5H4,2H5H4,H3P,C4Pe,H1C2/4,H1C6/8]
Expand All @@ -204,7 +208,7 @@ def compute_jcouplings(self, couplings=None, residues=None):
"""
import barnaba as bb

_logger.info("Computing J-couplings from MD trajectory...")
_logger.info("Compute J-couplings from MD trajectory")

if couplings is not None:
# Check if the provided coupling names are valid
Expand All @@ -216,15 +220,32 @@ def compute_jcouplings(self, couplings=None, residues=None):
# residue_list: list of M nucleobases
values, resname_list = bb.jcouplings_traj(self.traj, couplings=couplings, residues=residues)

# Convert numpy.float to float to avoid serialization issues
replace_nan_with_none = lambda x: None if np.isscalar(x) and np.isnan(x) else x.item()

# Loop over residues and couplings to store the computed values
coupling_dict = dict()
for i, resname in enumerate(resname_list):
_values = values[:,i,:] # Coupling values of i-th residue
values_by_names = dict()
for j, coupling_name in enumerate(couplings):
avg = _values[:,j].mean() # Mean value of H1H2 coupling of i-th residue
std = _values[:,j].std() # Standard deviation of H1H2 coupling of i-th residue
values_by_names[coupling_name] = {'avg': avg, 'std': std}
avg_raw = np.round(_values[:,j].mean(), 5) # e.g. mean value of H1H2 coupling of i-th residue
std_raw = np.round(_values[:,j].std(), 5) # e.g. standard deviation of H1H2 coupling of i-th residue
avg_raw = replace_nan_with_none(avg_raw)
std_raw = replace_nan_with_none(std_raw)
if weights is not None:
arr = _values[:,j] * weights
#_logger.info(f'non-weighted: {_values[:,j]}')
#_logger.info(f'weights: {weights}')
#_logger.info(f'weighted: {arr}')
avg = np.round(arr.mean(), 5)
std = np.round(arr.std(), 5)
avg = replace_nan_with_none(avg)
std = replace_nan_with_none(std)
else:
avg = avg_raw
std = std_raw
values_by_names[coupling_name] = {'avg': avg, 'std': std, 'avg_raw': avg_raw, 'std_raw': std_raw}
coupling_dict[resname] = values_by_names

return coupling_dict
Expand All @@ -242,4 +263,18 @@ def get_available_couplings(self):
import barnaba as bb
available_coupling_names = list(bb.definitions.couplings_idx.keys())
return available_coupling_names



#
# Future work?
#
class ProteinSystem(BaseDataLoader):
def __init__(self, **kwargs):
super(ProteinSystem, self).__init__(**kwargs)
raise NotImplementedError("ProteinSystem class is not implemented yet.")


class ProteinLigandSystem(BaseDataLoader):
def __init__(self, **kwargs):
super(ProteinLigandSystem, self).__init__(**kwargs)
raise NotImplementedError("ProteinLigandSystem class is not implemented yet.")
Loading
Loading