From 7b2df5d43b4cec9bd630ace9ee6aee97a4a139b0 Mon Sep 17 00:00:00 2001
From: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Date: Tue, 14 Sep 2021 03:12:06 -0400
Subject: [PATCH] move horovod installation to installation part

as now easy installation also contains horovod, so we may want to discuss separately
---
 doc/install/easy-install.md        |  6 +++---
 doc/install/install-from-source.md | 16 ++++++++++++++++
 doc/train/parallel-training.md     | 13 -------------
 3 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/doc/install/easy-install.md b/doc/install/easy-install.md
index cb529acda3..55720b59e4 100644
--- a/doc/install/easy-install.md
+++ b/doc/install/easy-install.md
@@ -2,7 +2,7 @@
 
 There various easy methods to install DeePMD-kit. Choose one that you prefer. If you want to build by yourself, jump to the next two sections.
 
-After your easy installation, DeePMD-kit (`dp`) and LAMMPS (`lmp`) will be available to execute. You can try `dp -h` and `lmp -h` to see the help. `mpirun` is also available considering you may want to run LAMMPS in parallel.
+After your easy installation, DeePMD-kit (`dp`) and LAMMPS (`lmp`) will be available to execute. You can try `dp -h` and `lmp -h` to see the help. `mpirun` is also available considering you may want to train models or run LAMMPS in parallel.
 
 - [Install off-line packages](#install-off-line-packages)
 - [Install with conda](#install-with-conda)
@@ -27,13 +27,13 @@ conda create -n deepmd deepmd-kit=*=*cpu libdeepmd=*=*cpu lammps-dp -c https://c
 
 Or one may want to create a GPU environment containing [CUDA Toolkit](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver):
 ```bash
-conda create -n deepmd deepmd-kit=*=*gpu libdeepmd=*=*gpu lammps-dp cudatoolkit=11.3 -c https://conda.deepmodeling.org
+conda create -n deepmd deepmd-kit=*=*gpu libdeepmd=*=*gpu lammps-dp cudatoolkit=11.3 horovod -c https://conda.deepmodeling.org
 ```
 One could change the CUDA Toolkit version from `10.1` or `11.3`.
 
 One may speficy the DeePMD-kit version such as `2.0.0` using
 ```bash
-conda create -n deepmd deepmd-kit=2.0.0=*cpu libdeepmd=2.0.0=*cpu lammps-dp=2.0.0 -c https://conda.deepmodeling.org
+conda create -n deepmd deepmd-kit=2.0.0=*cpu libdeepmd=2.0.0=*cpu lammps-dp=2.0.0 horovod -c https://conda.deepmodeling.org
 ```
 
 One may enable the environment using
diff --git a/doc/install/install-from-source.md b/doc/install/install-from-source.md
index b0e6f468b1..7f69427517 100644
--- a/doc/install/install-from-source.md
+++ b/doc/install/install-from-source.md
@@ -92,6 +92,22 @@ Valid subcommands:
     test               test the model
 ```
 
+### Install horovod and mpi4py
+
+[Horovod](https://github.com/horovod/horovod) and [mpi4py](https://github.com/mpi4py/mpi4py) is used for parallel training. For better performance on GPU, please follow tuning steps in [Horovod on GPU](https://github.com/horovod/horovod/blob/master/docs/gpus.rst).
+```bash
+# With GPU, prefer NCCL as communicator.
+HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install horovod mpi4py
+```
+
+If your work in CPU environment, please prepare runtime as below:
+```bash
+# By default, MPI is used as communicator.
+HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 pip install horovod mpi4py
+```
+
+If you don't install horovod, DeePMD-kit will fallback to serial mode.
+
 ## Install the C++ interface 
 
 If one does not need to use DeePMD-kit with Lammps or I-Pi, then the python interface installed in the previous section does everything and he/she can safely skip this section. 
diff --git a/doc/train/parallel-training.md b/doc/train/parallel-training.md
index d619569c8d..5609468a76 100644
--- a/doc/train/parallel-training.md
+++ b/doc/train/parallel-training.md
@@ -10,21 +10,8 @@ Testing `examples/water/se_e2_a` on a 8-GPU host, linear acceleration can be obs
 | 4  | 1.7635 | 56.71*4 | 3.29 |
 | 8  | 1.7267 | 57.91*8 | 6.72 |
 
-To experience this powerful feature, please intall Horovod and [mpi4py](https://github.com/mpi4py/mpi4py) first. For better performance on GPU, please follow tuning steps in [Horovod on GPU](https://github.com/horovod/horovod/blob/master/docs/gpus.rst).
-```bash
-# With GPU, prefer NCCL as communicator.
-HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip3 install horovod mpi4py
-```
-
-If your work in CPU environment, please prepare runtime as below:
-```bash
-# By default, MPI is used as communicator.
-HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 pip install horovod mpi4py
-```
-
 Horovod works in the data-parallel mode resulting a larger global batch size. For example, the real batch size is 8 when `batch_size` is set to 2 in the input file and you lauch 4 workers. Thus, `learning_rate` is automatically scaled by the number of workers for better convergence. Technical details of such heuristic rule are discussed at [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
 
-With dependencies installed, have a quick try!
 ```bash
 # Launch 4 processes on the same host
 CUDA_VISIBLE_DEVICES=4,5,6,7 horovodrun -np 4 \