Graphs_integration #161

EnricoTrizio · 2024-11-13T17:20:25Z

General description

Add the code for CVs based on GNN in the most (possible) organic way.
This largely inherits from Jintu's work (kudos! @jintuzhang), where all the code was based on a "library within the library".
Some functions were much different for the rest of the code (e.g., all the code for GNN and the GraphDatamodule), others were mostly redundant (e.g., GraphDataset, CV base and specific classes).

It would be wise to reduce the code duplicates and redundancies and make the whole library more organic, still including all the new functionalities.
SPOILER: this requires some thinking and some modifications here and there

(We could also split the story in more PR in case)

Point-by-point description

Data handling

Affecting --> mlcolvar.data, mlcolvar.utils.io

Overview

So far, we have a DictDataset (based on torch.Dataset) and the corresponding DictModule (based on lightning.lightningDataModule).

For GNNs, there was a GraphDataset (based on lists) and the corresponding DictModule (based on lightning.lightningDataModule).
Here, the data are handled using the PyTorchGeometric for convenience.
There are also a bunch of auxiliary functions for neighborhoods and handling of atom types, plus some utils to initialize the dataset easily from files.

Implemented solution

The two things are merged:

A single DictDataset that can handle both types of data.

It also has a metadata attribute that stores in a dict general properties (e.g., cutoff and atom_types).
In the __init__, the user can specify the data_type (either descriptors (default) or graphs. This is then stored in metadata and is used in the DictLoader to handle the data the right way (see below)
New utils have been added in mlcolvar.data.utils: save_dataset, load_dataset and save_dataset_configurations_as_extyz

A single DictModule that can handle both types of data. Depending on the metadata['data_type'] of the incoming dataset it either uses our DictLoader or the torch_geometric.DataLoader.
A new submodule data.graph containing:

atomic.py for the handling of atomic quantities based on the data class Configuration
neighborhood.py for building neighbor lists using matscipy
utils.py to frame Configurations into dataset and one-hot embeddings. It also contains create_test_graph_input as creating inputs for testing here requires several lines of code.

A new create_dataset_from_trajectories utils in mlcolvar.utils.io that allows creating a dataset directly from some trajectory files, providing topology files and using mdtraj
A single create_timelagged_dataset that can also create the time-lagged dataset starting from DictDataset with data_type=='graphs'

NB For the graph datasets, the keys are the original ones:

data_list: all the graph data, e.g., edge src and dst, batch index... (this goes in DictDataset)
z_table: atomic numbers map (this goes in DictDataset.metadata)
cutoff: cutoff used in the graph (this goes in DictDataset.metadata)

Questions

Shall we keep these names for the keys?
Do we like the metadata thing?
Single DataModule?
Maybe make the overall structure smoother? i.e., no too many utils.py here and there and too many submodules?

GNN models

Affecting --> mlcolvar.core.nn

Overview

Of course, they need to be implemented 😄 but we can inherit most of the code from Jintu.
As an overview, there is a BaseGNN parent class that implements the common features, and then each model (e.g., SchNet or GVP) is implemented on top of that.
There is also a radial.py that implements a bunch of tools for radial embeddings.

Implemented solution

The GNN code is now implemented in mlcolvar.core.nn.graph.

There is a BaseGNN class that is a template for the architecture-specific code. This, for example, already has the methods for embedding edges and setting some common properties.
The Radial module implements the tools for radial embeddings
The SchNetModel and GVPModel are implemented based on BaseGNN
In utils.py, there is a function that creates data for the tests for this module. This should be replaced using the very similar function mlcolvar.data.graph.utils.create_test_graph_input that is more general and used also for other things

CV models

Affecting --> mlcolvar.cvs, mlcolvar.core.loss

Overview

In Jintu's implementation, all the CV classes we tested were re-implemented, still using the original loss function code.
The point, there, is that the initialization of the underlying ML model (also in the current version of the library) is performed within the CV class.
We did it to make it simple, and indeed, it is for feed-forward networks, as they have very few things to set (i.e., layers, node, activations) and also because there were no alternatives at the time.
For GNNs, however, the initialization can vary a lot (i.e., different architectures and many parameters one could set).

I am afraid we can't cut corners here if we want to include everything and somewhere we need to add an extra layer of complexity to wither the workflow or the CV models,

Implemented solution

We keep everything similar to what it used to be in the library, except for:

We rename the layers keyword to the more general model in the init of the CV classes that can accept

A list of integers, as it was before. It works as the old layers keyword and initializes a FeedForward with that and all the DEFAULT_BLOCKS' (see point 2), e.g. for DeepLDA: ['norm_in', 'nn', 'lda']`.
A mlcolvar.core.nn.FeedForward or mlcolvar.core.nn.graph.BaseGNN model that you had initialized outside the CV class. This way, one overrides the old default and provides an external model and uses the MODEL_BLOCKS, e.g. for DeepLDA: ['nn', 'lda']. For example, the initialization can be something like this

# for GNN-based
gnn_model = SchNet(...)
model = DeepTDA(..., model=gnn_model, ...)

# for FFNN-based, alternative 1
model = DeepTDA(..., model=[2, 3], ...)

# for FFNN-based, alternative 2
ff_model = FeedForward(layers=[2, 3])
model = DeepTDA(..., model=ff_model, ...)

The BLOCKS of each CV model are duplicated in DEFAULT_BLOCKS' and MODEL_BLOCKS` to account for the different behaviors. This was a simple way to initialize everything in all the cases (maybe not best one, see questions)
In the training step, the change amounts to having a different setup of the data depending on the type of ML-model we are using, then the rest is basically the same as it was.

###Things to note

All the loss functions are untouched! Except for the CommittorLoss as it does not depend only on the output space but also on the derivatives wrt the input/positions.
When an external GNN model is provided, checkpoint and logs are still not working. I left these things for the very end of the PR, focusing on making the things work before.
Autoencoder based CVs only raise a NotImplementedError, as we do not have for now a stable GNN-AE. As a consequence, also the MultiTaskCV is does not support GNN models as, for the way we intend it, it wouldn't have much sense without a GNN-based AE.

Questions

What shall we do with the BLOCKS? Is it worth it to keep this thing?

TODOs

Make logger and checkpointing work with graph models 🗡️
Add autoencoders (in the future)

Explain module

Affecting --> mlcolvar.explain

Overview

There are some new explain functions for GNN model that we should add to the explain module

Possible solution

Include the GNN codes as they are, eventually with some revision, into a mlcolvar.explain.graph module or also without the submodule as there are no overlaps here, I think

Questions

Do we need to create a submodule of explain?

General todos

General questions

How many new dependencies do we want to keep? Can we make something optional?

Status

Ready to go

mlcolvar/core/loss/committor_loss.py

    # inherit right device
    device = x.device 
-
-    mass = mass.to(device)
+    dtype = x.dtype


mlcolvar/core/loss/eigvals.py

mlcolvar/cvs/committor/utils.py

+
+    # create and edit dataset
+    dataset = DictDataset({"data": X, "labels": y, "weights": w})
+    dataset = compute_committor_weights(dataset=dataset, bias=bias, data_groups=[0,1,2], beta=1.0)


mlcolvar/cvs/cv.py

+            # if isinstance(model, FeedForward):
+            #     # self.nn = model
+            # elif isinstance(model, BaseGNN):
+            #     # GNN models need to be scripted!
+            #     # self.nn = torch.jit.script_if_tracing(model)
+            #     # self.nn = model
+            #     self.in_features = None
+            #     self.out_features = model.out_features


mlcolvar/cvs/supervised/deeplda.py

+    # eval
+    model.eval()
+    with torch.no_grad():
+        s = model(X).numpy()


mlcolvar/graph/cvs/committor/utils.py

+import torch
+import numpy as np
+
+from typing import Tuple, Dict, Optional, List


mlcolvar/graph/utils/__init__.py

mlcolvar/utils/timelagged.py

+    # graph data
+    from mlcolvar.data.graph.utils import create_test_graph_input
+    dataset = create_test_graph_input('dataset')
+    lagged_dataset = create_timelagged_dataset(dataset, logweights=torch.randn(len(dataset)))


mlcolvar/core/nn/graph/radial.py

+
+    torch.set_default_dtype(dtype)
+
+    rbf


mlcolvar/core/nn/graph/radial.py

+
+    torch.set_default_dtype(dtype)
+
+    rbf


mlcolvar/core/nn/graph/radial.py

+
+    torch.set_default_dtype(dtype)
+
+    cutoff_function


mlcolvar/cvs/committor/committor.py

        # ===================loss=====================
        if self.training:
            loss, loss_var, loss_bound_A, loss_bound_B = self.loss_fn(
-                x, q, labels, weights 
+                x, z, q, labels, weights 


mlcolvar/cvs/committor/committor.py

        # ===================loss=====================
        if self.training:
            loss, loss_var, loss_bound_A, loss_bound_B = self.loss_fn(
-                x, q, labels, weights 
+                x, z, q, labels, weights 


mlcolvar/cvs/committor/committor.py

            )
        else:
            loss, loss_var, loss_bound_A, loss_bound_B = self.loss_fn(
-                x, q, labels, weights 
+                x, z, q, labels, weights 


mlcolvar/cvs/committor/committor.py

            )
        else:
            loss, loss_var, loss_bound_A, loss_bound_B = self.loss_fn(
-                x, q, labels, weights 
+                x, z, q, labels, weights 


jintuzhang added 30 commits May 9, 2024 13:30

Added placeholders.

8a582a1

Changed importing styles.

e3d3e78

Added the neighbor list module.

7d2b499

Added more placeholders.

a6fc30d

Added doc.

12a944b

Fixed some styles.

f07d4b5

Added more placeholders.

17ad4ae

Added the graph data module.

11d55bd

Added doc and more tests.

417bb85

Minor fix.

727561c

Switched to new pyg grammar.

8c6c3b2

Modified file layouts.

0c095c4

Added more helper functions.

6591511

Added datamodule.

96edcd0

Added more functions.

7f4b830

Fixed some styles.

8cb7218

Added an alias to the dataset object.

6c5e85c

Resolved the circular dependency problem.

3930be3

Added missing type hints.

0dc2e1d

Modified file layouts.

8ff7203

Changed filenames.

bb2edec

Introduced the dataset class.

a662346

Added the IO function.

9eaf323

Added selection checks and changed tests.

2beb7f3

Updated __repr__.

2c876d1

Isolated nodes are not that bad.

31975aa

Added radial functions.

b3b5420

Modified file layouts.

cab127d

Added a dataset field to distinguish receivers.

39846f3

Added edge encoding function.

4d85d66

EnricoTrizio added 17 commits December 11, 2024 16:14

Removed already ported code

74045e8

Condensed variational mask defintion

d902665

added in_features and removed model type

d5671ea

Simplified model parsing

8b21a15

Restore default dtype after test

b105989

Improved streamline and tests

468a746

Added tests

de43be1

fixed save dataset

7c2ea79

Update testing notebook

d28d9b8

Removed already ported files

73c100e

Fixed test

285db63

Fixed test

a834b9e

Added load_args

ec91fbb

Updated committor weights

2a6bd06

Removed ported files

16392d8

Added log var

1971070

Improved test coverage

2bd81f6

github-advanced-security bot found potential problems Dec 13, 2024

View reviewed changes

EnricoTrizio added 4 commits December 16, 2024 11:49

Improved tests

3311f3f

Removed already ported files

e9dca7d

Imporved tests

290034f

Fixed typo

a08d9a7

github-advanced-security bot found potential problems Dec 16, 2024

View reviewed changes

EnricoTrizio added 4 commits December 20, 2024 18:08

Added atom names and indexing to dataset

06167b1

Added graph node sensitivity

f4569a6

Fixed typo

1306f61

Added z and z_reg to commmittor loss

55d07b5

github-advanced-security bot found potential problems Dec 20, 2024

View reviewed changes

EnricoTrizio added 2 commits December 23, 2024 14:37

FIxed default atom names with env

e59d038

Fixed committor forward

e6d90f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphs_integration #161

Graphs_integration #161

EnricoTrizio commented Nov 13, 2024 •

edited

Loading

Graphs_integration #161

Are you sure you want to change the base?

Graphs_integration #161

Conversation

EnricoTrizio commented Nov 13, 2024 • edited Loading

General description

Point-by-point description

Data handling

Overview

Implemented solution

Questions

GNN models

Overview

Implemented solution

CV models

Overview

Implemented solution

Questions

TODOs

Explain module

Overview

Possible solution

Questions

General todos

General questions

Status

EnricoTrizio commented Nov 13, 2024 •

edited

Loading