rewrite the guide for contributing OPs

wanghan-iapcm · Feb 27, 2022 · 77328c6 · 77328c6
1 parent db3f12a
commit 77328c6
Show file tree

Hide file tree

Showing 2 changed files with 200 additions and 45 deletions.
diff --git a/docs/developer.md b/docs/developer.md
@@ -33,27 +33,27 @@ The workflow of DPGEN2 is illustrated in the following figure
 
 ![dpgen flowchart](./figs/dpgen-flowchart.jpg)
 
-In the center is the `block` operator, which is a super-OP for one DP-GEN iteration, i.e. the super-OP of the training, exploration, selection, and labeling steps. The inputs of the `block` OP are `lmp_task_group`, `conf_selector` and `dataset`. 
+In the center is the `block` operator, which is a super-OP (an OP composed by several OPs) for one DP-GEN iteration, i.e. the super-OP of the training, exploration, selection, and labeling steps. The inputs of the `block` OP are `lmp_task_group`, `conf_selector` and `dataset`. 
 - `lmp_task_group`: definition of a group of LAMMPS tasks that explore the configuration space. 
 - `conf_selector`: defines the rule by which the configurations are selected for labeling.
 - `dataset`: the training dataset.
 
 The outputs of the `block` OP are
-- `exploration_report`: a report recording the result of the exploration. 
+- `exploration_report`: a report recording the result of the exploration. For example, home many configurations are accurate enough and how many are selected as candidates for labeling. 
 - `dataset_incr`: the increment of the training dataset.
 
 The `dataset_incr` is added to the training `dataset`. 
 
-The `exploration_report` is passed to the `exploration_strategy` OP. The `exploration_strategy` implements the strategy of exploration. It reads the `exploration_report` generated by each iteration (`block` OP), then tells if the iteration is converged. If not, it generates a group of LAMMPS tasks (`lmp_task_group`) and the criteria of selecting configurations (`conf_selector`). The `lmp_task_group` and `conf_selector` are then used by `block` of the next iteration. The iteration closes.
+The `exploration_report` is passed to the `exploration_strategy` OP. The `exploration_strategy` implements the strategy of exploration. It reads the `exploration_report` generated by each iteration (`block`), then tells if the iteration is converged. If not, it generates a group of LAMMPS tasks (`lmp_task_group`) and the criteria of selecting configurations (`conf_selector`). The `lmp_task_group` and `conf_selector` are then used by `block` of the next iteration. The iteration closes.
 
 ### Inside the `block` operator
 
-The inside of the super-OP `block` is displayed on the right-hand side of the figure. It contains
+The inside of the super-OP `block` is displayed on the right-hand side of the figure. It contains the following steps to finish one DPGEN iteration
 - `prep_run_dp_train`: prepares training tasks of DP models and runs them.
 - `prep_run_lmp`: prepares the LAMMPS exploration tasks and runs them.
 - `select_confs`: selects configurations for labeling from the explored configurations.
 - `prep_run_fp`: prepares and runs first-principles tasks.
-- `collect_data`: collects the `dataset_incr` and adds it to `dataset`.
+- `collect_data`: collects the `dataset_incr` and adds it to the `dataset`.
 
 
 ### The exploration strategy
@@ -74,8 +74,7 @@ Some concepts are explained below:
 
 Anyone interested in the DPGEN2 project may contribute from two aspects: operators and workflows.
 
-One may check the [guide on writing operators](./operator.md)
+- To contribute OPs, one may check the [guide on writing operators](./operator.md)
 
-The DP-GEN workflow is implemented in [dpgen2/flow/loop.py](https://github.com/wanghan-iapcm/dpgen2/blob/master/dpgen2/flow/loop.py) and tested with all operators mocked in [test/test_loop.py](https://github.com/wanghan-iapcm/dpgen2/blob/master/tests/test_loop.py)
+- To contribute workflows, one may take the DP-GEN workflow as an example. It is implemented in [dpgen2/flow/loop.py](https://github.com/wanghan-iapcm/dpgen2/blob/master/dpgen2/flow/loop.py) and tested with all operators mocked in [test/test_loop.py](https://github.com/wanghan-iapcm/dpgen2/blob/master/tests/test_loop.py)
 
-The sub-workflow in `block` is implemented in [dpgen2/flow/block.py](https://github.com/wanghan-iapcm/dpgen2/blob/master/dpgen2/flow/block.py) and tested with all operators mocked in [tests/test_block_cl.py](https://github.com/wanghan-iapcm/dpgen2/blob/master/tests/test_block_cl.py)
diff --git a/docs/operator.md b/docs/operator.md
@@ -1,69 +1,225 @@
 
 # Operators
 
-The operators are building blocks of the workflow. 
+There are two types of OPs in DPGEN2
+
+- [OP](#the-op-rundptrain). An execution unit the the workflow. It can be roughly viewed as a piece of Python script taking some input and gives some outputs. An OP cannot be used in the `dflow` until it is embedded in a super-OP. 
+- [Super-OP](#the-super-op-preprundptrain). An execution unite that is composed by one or more OP and/or super-OPs.
+
+Techinically, OP is a Python class derived from [`dflow.python.OP`](https://github.com/dptech-corp/dflow/blob/master/README.md#13--interface-layer). It serves as the `PythonOPTemplate` of `dflow.Step`. 
+
+The super-OP is a Python class derived from `dflow.Steps`. It contains `dflow.Step`s as building blocks, and can be used as OP template to generate a `dflow.Step`. The explanation of the concepts `dflow.Step` and `dflow.Steps`, one may refer to the [manual of dflow](https://github.com/dptech-corp/dflow/blob/master/README.md#123--workflow).
+
+## The super-OP `PrepRunDPTrain`
+
+In the following we will take the `PrepRunDPTrain` super-OP as an example to illustrate how to write OPs in DPGEN2. 
+
+`PrepRunDPTrain` is a super-OP that prepares several DeePMD-kit training tasks, and submit all of them. This super-OP is composed by two `dflow.Step`s building from `dflow.python.OP`s `PrepDPTrain` and `RunDPTrain`. 
+
+```python
+from dflow import (
+    Step,
+    Steps,
+)
+from dflow.python import(
+    PythonOPTemplate,
+    OP,
+    Slices,
+)
+
+class PrepRunDPTrain(Steps):
+    def __init__(
+            self,
+            name : str,
+            prep_train_op : OP,
+            run_train_op : OP,
+            prep_train_image : str = "dflow:v1.0",
+            run_train_image : str = "dflow:v1.0",
+    ):
+		...
+        self = _prep_run_dp_train(
+            self, 
+            self.step_keys,
+            prep_train_op,
+            run_train_op,
+            prep_train_image = prep_train_image,
+            run_train_image = run_train_image,
+        )            
+```
+The construction of the `PrepRunDPTrain` takes prepare-training `OP` and run-training `OP` and their docker images as input, and implemented in internal method `_prep_run_dp_train`.
+```python
+def _prep_run_dp_train(
+        train_steps,
+        step_keys,
+        prep_train_op : OP = PrepDPTrain,
+        run_train_op : OP = RunDPTrain,
+        prep_train_image : str = "dflow:v1.0",
+        run_train_image : str = "dflow:v1.0",
+):
+    prep_train = Step(
+        ...
+        template=PythonOPTemplate(
+            prep_train_op,
+            image=prep_train_image,
+            ...
+        ),
+        ...
+    )
+    train_steps.add(prep_train)
+
+    run_train = Step(
+        ...
+        template=PythonOPTemplate(
+            run_train_op,
+            image=run_train_image,
+            ...
+        ),
+        ...
+    )
+    train_steps.add(run_train)
+
+    train_steps.outputs.artifacts["scripts"]._from = run_train.outputs.artifacts["script"]
+    train_steps.outputs.artifacts["models"]._from = run_train.outputs.artifacts["model"]
+    train_steps.outputs.artifacts["logs"]._from = run_train.outputs.artifacts["log"]
+    train_steps.outputs.artifacts["lcurves"]._from = run_train.outputs.artifacts["lcurve"]
+
+    return train_steps	
+```
+
+In `_prep_run_dp_train`, two instances of `dflow.Step`, i.e. `prep_train` and `run_train`, generated from `prep_train_op` and `run_train_op`, respectively, are added to `train_steps`. Both of `prep_train_op` and `run_train_op` are OPs (python classes derived from `dflow.python.OP`s) that will be illustrated later. `train_steps` is an instance of `dflow.Steps`. The outputs of the second OP `run_train` are assigned to the outputs of the `train_steps`.
+
+The `prep_train` prepares a list of paths, each of which contains all necessary files to start a DeePMD-kit training tasks. 
+
+The `run_train` slices the list of paths, and assign each item in the list to a DeePMD-kit task. The task is executed by `run_train_op`. This is a very nice feature of `dflow`, because the developer only needs to implement how one DeePMD-kit task is executed, and then all the items in the task list will be executed [in parallel](https://github.com/dptech-corp/dflow/blob/master/README.md#315-produce-parallel-steps-using-loop). See the following code to see how it works
+```python
+    run_train = Step(
+        'run-train',
+        template=PythonOPTemplate(
+            run_train_op,
+            image=run_train_image,
+            slices = Slices(
+                "int('{{item}}')",
+                input_parameter = ["task_name"],
+                input_artifact = ["task_path", "init_model"],
+                output_artifact = ["model", "lcurve", "log", "script"],
+            ),
+        ),
+        parameters={
+            "config" : train_steps.inputs.parameters["train_config"],
+            "task_name" : prep_train.outputs.parameters["task_names"],
+        },
+        artifacts={
+            'task_path' : prep_train.outputs.artifacts['task_paths'],
+            "init_model" : train_steps.inputs.artifacts['init_models'],
+            "init_data": train_steps.inputs.artifacts['init_data'],
+            "iter_data": train_steps.inputs.artifacts['iter_data'],
+        },
+        with_sequence=argo_sequence(argo_len(prep_train.outputs.parameters["task_names"]), format=train_index_pattern),
+        key = step_keys['run-train'],
+    )
+```
+The input parameter `"task_names"` and artifacts `"task_paths"` and `"init_model"` are sliced and supplied to each DeePMD-kit task. The output artifacts of the tasks (`"model"`, `"lcurve"`, `"log"` and `"script"`) are stacked in the same order as the input lists. These lists are assigned as the outputs of `train_steps` by 
+```python
+    train_steps.outputs.artifacts["scripts"]._from = run_train.outputs.artifacts["script"]
+    train_steps.outputs.artifacts["models"]._from = run_train.outputs.artifacts["model"]
+    train_steps.outputs.artifacts["logs"]._from = run_train.outputs.artifacts["log"]
+    train_steps.outputs.artifacts["lcurves"]._from = run_train.outputs.artifacts["lcurve"]
+```
+
+
+## The OP `RunDPTrain`
+
+We will take `RunDPTrain` as an example to illustrate how to implement an OP in DPGEN2.
+The source code of this OP is found [here](https://github.com/wanghan-iapcm/dpgen2/blob/master/dpgen2/op/run_dp_train.py)
+
+Firstly of all, an OP should be implemented as a derived class of `dflow.python.OP`. 
+
+The `dflow.python.OP` requires static type define for the input and output variables, i.e. the signatures of an OP. The input and output signatures of the `dflow.python.OP` are given by `classmethods` `get_input_sign` and `get_output_sign`. 
 
-DPGEN2 implements the OPs in Python. All OPs are derived from the base class `dflow.OP`. An example `OP` `CollectData` is provided as follows.
 
 ```python
 from dflow.python import (
     OP,
     OPIO,
     OPIOSign,
-    Artifact
+    Artifact,
 )
-
-class CollectData(OP):
+class RunDPTrain(OP):
     @classmethod
     def get_input_sign(cls):
         return OPIOSign({
-            "name" : str,
-            "labeled_data" : Artifact(List[Path]),
-            "iter_data" : Artifact(Set[Path]),
+            "config" : dict,
+            "task_name" : str,
+            "task_path" : Artifact(Path),
+            "init_model" : Artifact(Path),
+            "init_data" : Artifact(List[Path]),
+            "iter_data" : Artifact(List[Path]),
         })
-
+    
     @classmethod
     def get_output_sign(cls):
         return OPIOSign({
-            "iter_data" : Artifact(Set[Path]),
+            "script" : Artifact(Path),
+            "model" : Artifact(Path),
+            "lcurve" : Artifact(Path),
+            "log" : Artifact(Path),
         })
+```
+
+All items not defined as `Artifact` are treated as parameters of the `OP`. The concept of parameter and artifact are explained in the [dflow document](https://github.com/dptech-corp/dflow/blob/master/README.md#Parametersandartifacts). To be short, the artifacts can be `pathlib.Path` or a list of `pathlib.Path`. The artifacts are passed by the file system. Other data structures are treated as parameters, they are passed as variables encoded in `str`. Therefore, a large amout of information should be stored in artifacts, otherwise they can be considered as parameters. 
 
+The operation of the `OP` is implemented in method `execute`, and are run in docker containers. Again taking the `execute` method of `RunDPTrain` as an example
+
+```python
     @OP.exec_sign_check
     def execute(
             self,
             ip : OPIO,
     ) -> OPIO:
-        name = ip['name']
-        labeled_data = ip['labeled_data']
+        ...
+        task_name = ip['task_name']
+        task_path = ip['task_path']
+        init_model = ip['init_model']
+        init_data = ip['init_data']
         iter_data = ip['iter_data']
-
-        ## do works to generate new_iter_data
         ...
-        ## done
-
+        work_dir = Path(task_name)
+        ...
+        # here copy all files in task_path to work_dir
+        ...
+        with set_directory(work_dir):
+            fplog = open('train.log', 'w')
+            def clean_before_quit():
+                fplog.close()
+            # train model
+            command = ['dp', 'train', train_script_name]
+            ret, out, err = run_command(command)
+            if ret != 0:
+                clean_before_quit()
+                raise FatalError('dp train failed')
+            fplog.write(out)
+            # freeze model
+            ret, out, err = run_command(['dp', 'freeze', '-o', 'frozen_model.pb'])
+            if ret != 0:
+                clean_before_quit()
+                raise FatalError('dp freeze failed')
+            fplog.write(out)
+            clean_before_quit()
+
         return OPIO({
-            "iter_data" : new_iter_data,
+            "script" : work_dir / train_script_name,
+            "model" : work_dir / "frozen_model.pb",
+            "lcurve" : work_dir / "lcurve.out",
+            "log" : work_dir / "train.log",
         })
-```
+``` 
 
-The `dflow` requires static type define, i.e. the signatures of an OP, for the input and output variables. The input and output signatures of the `OP` are given by `classmethods` `get_input_sign` and `get_output_sign`. 
+The inputs and outputs variables are recorded in data structure `dflow.python.OPIO`, which is initialized by a Python dict. The keys in the input/output `dict`, and the types of the input/output variables will be checked against their signatures by decorator `OP.exec_sign_check`. If any key or type does not match, an exception will be raised.
 
-The operator is executed by the method `OP.executed`. The inputs and outputs variables are recorded in `dict`s. The keys in the input/output `dict`, and the types of the input/output variables will be checked against their signatures by decorator `OP.exec_sign_check`. If any key or type does not match, an exception will be raised.
+It is noted that all input artifacts of the `OP` are read-only, therefore, the first step of the `RunDPTrain.execute` is to copy all necessary input files from the directory `task_path` prepared by `PrepDPTrain` to the working directory `work_dir`.
 
-The python `OP`s will be wrapped to `dflow` operators (named `Step`) to construct the workflow. An example of wrapping is 
-```python
-    collect_data = Step(
-        name = "collect-data"
-        template=PythonOPTemplate(
-            CollectData,
-            image="dflow:v1.0",
-        ),
-        parameters={
-            "name": foo.inputs.parameters["name"],
-        },
-        artifacts={
-            "iter_data" : foo.inputs.artifacts['iter_data'],
-            "labeled_data" : bar.outputs.artifacts['labeled_data'],
-        },
-    )
-```
+`with_directory` method creates the `work_dir` and swithes to the directory before the execution, and then exits the directoy when the task finishes or an error is raised.
+
+In what follows, the training and model frozen bash commands are executed consecutively. The return code is check and a `FatalError` is raised if a non-zero code is detected. 
+
+Finally the trained model file, input script, learning curve file and the log file are recored in a `dflow.python.OPIO` and returned.