Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added RMSD optimalAlignment GPU implementation #859

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zhang954527
Copy link

@zhang954527 zhang954527 commented Sep 6, 2022

Description

Added RMSD optimalAlignment ArrayFire-GPU implementation:

Preliminary implementation of RMSD distance calculating using ArrayFire to calculate the time-consuming part on GPU.

Code modifications

The following two time-consuming parts of the optimalAlignment core calculation module in the RMSD class are calculated by GPU:

  1. compute centers
  2. compute second moments wrt centers

In the calculation, the reference coords, align, displace, and rr11 are transferred to the GPU and calculated in advance to avoid the transfer overhead in the calculation iteration.

At the same time, the compulsory keywords and option are added to the action of RMSD CV in plumed.dat:

  1. DEVICEID (default=0) Identifier of the GPU to be used.
  2. GPU (default=off) Calculate RMSD using ARRAYFIRE on an accelerator device

The implementation of ArrayFire GPU is also added to doCoreCalc and getDistance in the RMSDCoreData class. It is currently disabled by default. The modules that are specifically changed to calculate on the GPU are:

  1. compute centers
  2. compute second moments wrt centers
  3. rotation matrix derivatives and components distances

The use of ArrayFire refers to the implementation of the SAXS module. Parts involving ArrayFire are precompiled with #ifdef __PLUMED_HAS_ARRAYFIRE.

Test case and results

The RMSD ArrayFire GPU calculation has been tested for 57,258 atoms, and the accuracy has been verified with the CPU calculation result. The RMSD config in plumed.dat for test case is:

RMSD REFERENCE=structure.pdb TYPE=OPTIMAL LABEL=rmsd GPU=on DEVICEID=0

The calculation time cost summary in log is:

                                              |     RMSD-OPTIMAL-CPU     |     RMSD-OPTIMAL-GPU
Plumed total time cost                        |     6.562                |     7.280
RMSD calculating (forward loop) time cost     |     4.462                |     5.186

In the current program version, with the number of atoms on the order of ten thousand, GPU has no obvious difference in computing speed compared to the CPU. The reason is that although the pure computing part of the GPU can achieve a certain speedup in unit testing, each iteration requires data transfer between the host and the device, which weakens the GPU computing advantage.

In addition, the speedup ratio of GPU compared to CPU is related to the size of computing system and transfer process between device and host. When the number of atoms increases by an order of magnitude, the GPU acceleration ratio will be further improved.

Target release

I would like my code to appear in release master

Type of contribution
  • changes to code or doc authored by PLUMED developers, or additions of code in the core or within the default modules
  • changes to a module not authored by you
  • new module contribution or edit of a module authored by you
Copyright
  • I agree to transfer the copyright of the code I have written to the PLUMED developers or to the author of the code I am modifying.
  • the module I added or modified contains a COPYRIGHT file with the correct license information. Code should be released under an open source license. I also used the command cd src && ./header.sh mymodulename in order to make sure the headers of the module are correct.
Tests
  • I added a new regtest or modified an existing regtest to validate my changes.
  • I verified that all regtests are passed successfully on GitHub Actions.

@GiovanniBussi
Copy link
Member

Thanks @zhang954527 !

Perhaps @carlocamilloni can better judge since he has experience with arrayfire.

What is not clear to me is the advantage of having a GPU implementation that is overall slower than the CPU one. Probably, as you pointed out, the reason is the cost of transfering data to GPU. However, (a) I naively expect this cost to be linear in the number of atoms and (b) the cost of the calculation is also linear. So, I don't expect advantages even if you increase the number of atoms.

What do you think?

I am asking this because this PR is further complicating an already complex C++ class, and I wanted to make sure that the extra price that we will pay in maintaining it will be worth...

Thanks!

@zhang954527
Copy link
Author

Thanks @GiovanniBussi for your comment!

We understand your concerns, the cost of transfering data and the cost of the calculation do increase linearly with the increasing of atoms number.
Actually, the acceleration benefit of GPU compared to CPU comes from the difference between GPU computing speed and CPU computing speed.
When the number of atoms increases, (a) the CPU computing increasd time is higher than the GPU computing increased time, the difference between them increases, and the difference does not increase linearly. So the GPU computing speedup ratio will increase. For the dot product, matrix multiplication, reduction combined operations in RMSD, we do a computing test, the result is as follows:

Atoms number CPU time [s] AF-GPU computing time without transfer [s] Speedup ratio
4,000 3.1e-5 1.82e-4 0.17 x
40,000 3.1e-4 1.90e-4 1.63 x
57,258 4.1e-4 1.99e-4 2.06 x
400,000 3.09e-3 7.59e-4 4.07 x
4,000,000 3.12e-2 6.55e-3 4.76 x
40,000,000 3.94e-1 6.44e-2 6.12 x

(b) The cost of transfering data grows linearly. (c) Therefore, when the performance benefit brought by the difference between the GPU and CPU in the calculation process is higher than the cost of transfering data, the GPU is generally accelerated.

In addition, the arrayfire simplifies the complexity of GPU programming and provides the possibility of acceleration. In the future work, the variables involved in the plumed calculation can be directly read from the GPU memory of the md engine, and the data transfer between the host and the device in each iteration will be cancelled, which will further reflect the acceleration advantage of the GPU. The current PR is to make an interface for this part of the GPU implementation.

We would also like to get some development suggestions on GPU implementation from your side!

Thanks!

@GiovanniBussi
Copy link
Member

@zhang954527 thanks for the quick reply!

Could you please provide a table with speed up including data transfer? I mean:

  • forward loop time with CPU
  • forward loop time with GPU, including transfer

This is what would actually matter. Thanks!

@GiovanniBussi
Copy link
Member

@zhang954527 I actually discussed briefly with Carlo about this. When you report timing, it is important that you discount the initial part related to arrayfire initialization.

@zhang954527
Copy link
Author

zhang954527 commented Sep 7, 2022

@GiovanniBussi Also thanks for your quick reply and discussed with @carlocamilloni!
Sorry for the slow reply this time because of the night time.

This is the table with speed up including data transfer for all optimalAlignment RMSD forward loop operations same to this PR in unit test.
In this test, loop=200, AF-GPU data transfer is included in each loop:

  • host->device: positions, 3*AtomNum*sizeof(double)
  • device->host: rr00, 1*sizeof(double)
            rr01, 9*sizeof(double)

Timing is done using af::timeit() function, which could provide accurate and reliable estimates of both CPU or GPU code.

Atoms number CPU time [s] AF-GPU time with transfer (double) [s] Speedup ratio (double) AF-GPU time with transfer (float) [s] Speedup ratio (float)
4,000 0.010 0.040 0.25 x 0.045 0.22 x
40,000 0.095 0.098 0.97 x 0.092 1.03 x
57,258 0.137 0.136 1.01 x 0.118 1.16 x
400,000 1.074 0.867 1.25 x 0.650 1.65 x
4,000,000 11.186 8.388 1.33 x 6.620 1.69 x
40,000,000 113.915 85.669 1.33 x 66.679 1.71 x

In terms of improving performance, if save the data transfer in each step, it will get a more significant speedup. This requires plumed can be directly read from the GPU memory of the md engine, which may requires major changes.

We are not quite sure what you mean by arrayfire initialization. We only include the necessary data transfer and calculating of arrayfire in the timing. Excludes the arrayfire variables definition, allocation and initial transfer of some variables.

@GiovanniBussi
Copy link
Member

GiovanniBussi commented Sep 7, 2022

OK I see, not dramatic but at least >1. I am still a bit skeptical of the practical utility of RMSD calculations with > 400k aligned atoms, but it might be worth.

For completeness could you please also report:

  • which CPU model, and how many threads you used in PLUMED (should be reported in the log file)
  • which GPU model

Then I will wait for @carlocamilloni checking the code, because I have no experience at all with this library.

Thanks again for your contribution!

@zhang954527
Copy link
Author

Yes, the current conclusion is this. Thanks for the discussion and follow-up code checking!

Here is the relevant information:

  • CPU model: AMD EPYC 7742 64-Core Processor
  • CPU threads: 20
  • MPI process: 1
  • GPU model: NVIDIA A100-SXM4-40GB

The relevant setup logs are as follows:

PLUMED: Version: 2.9.0-dev (git: 07a9a90ee-dirty) compiled on Sep  5 2022 at 18:43:38
PLUMED: Please cite these papers when using PLUMED [1][2]
PLUMED: For further information see the PLUMED web page at http://www.plumed.org
PLUMED: Root: /usr/local/lib/plumed
PLUMED: For installed feature, see /usr/local/lib/plumed/src/config/config.txt
PLUMED: Molecular dynamics engine: gromacs
PLUMED: Precision of reals: 4
PLUMED: Running over 1 node
PLUMED: Number of threads: 20
PLUMED: Cache line size: 512
PLUMED: Number of atoms: 714287
PLUMED: File suffix:
PLUMED: FILE: plumed2.dat
PLUMED: Action DEBUG
PLUMED:   with label @0
PLUMED:   with stride 1
PLUMED:   Detailed timing on
PLUMED:   on plumed log file
PLUMED: Action MOLINFO
PLUMED:   with label @1
PLUMED:   pdb file named structure.pdb contains 1 chains
PLUMED:   chain named   contains residues 27 to 1226 and atoms 1 to 57258
PLUMED: Action RMSD
PLUMED:   with label rmsd
PLUMED:   reference from file structure.pdb
PLUMED:   which contains 57258 atoms
PLUMED:   with indices :
PLUMED: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
PLUMED: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
...
PLUMED: 57251 57252 57253 57254 57255 57256 57257 57258
PLUMED:   method for alignment : OPTIMAL
PLUMED:   using periodic boundary conditions
PLUMED: Action PRINT
PLUMED:   with label @3
PLUMED:   with stride 1
PLUMED:   with arguments rmsd
PLUMED:   on file COLVAR
PLUMED:   with format  %f
PLUMED: END FILE: plumed2.dat
PLUMED: Timestep: 0.001000
PLUMED: KbT has not been set by the MD engine
PLUMED: It should be set by hand where needed
PLUMED: Relevant bibliography:
PLUMED:   [1] The PLUMED consortium, Nat. Methods 16, 670 (2019)
PLUMED:   [2] Tribello, Bonomi, Branduardi, Camilloni, and Bussi, Comput. Phys. Commun. 185, 604 (2014)
PLUMED: Please read and cite where appropriate!
PLUMED: Finished setup

@HanatoK
Copy link
Contributor

HanatoK commented Sep 8, 2022

@zhang954527
I have a question:
Since the data transfer between host memory and device memory (calling cudaMemcpy or similar functions) is also time-consuming, is it possible to avoid it as much as possible? I think a way is computing the eigenvectors and eigenvalues completely by arrayfire without resorting to diagMatSym.

@zhang954527
Copy link
Author

@HanatoK Thanks for your comment!

Yes, we should avoid data transfer as much as possible to decrease extra time-consuming. We also try to compute following part in GPU, but still need to transfer some data, even larger amount of data.

For example, if we calculate the eigenvectors and eigenvalues on GPU, we need to transfer q(4*sizeof(double)) back to CPU. And this part has small amount of computation, because m is only a 4x4 matrix, which is about 0.6% of the overall calculation time of RMSD optimalAlignment. The calculation time may not benefit significantly if this part on GPU.

If we continue to calculate the rotation matrix on GPU, we need to transfer rotation(9*sizeof(double)) back to CPU.

If we continue to calculate the derivatives and dist on GPU, this part does take up a lot of calculation time, the GPU can save considerable time for this stage. But when the function ends, the derivatives matrix (3*n*sizeof(double)) needs to be transferred back to the subsequent programs for use. This is a large amount of transfers, especially compared to the rr00(1*sizeof(double)) and rr01(9*sizeof(double)), which ultimately slows down the calculation. That's what we're worried about. In fact, we tested this case, and its calculation speed is not as fast as the current PR version.

Therefore, the current submission mainly considers: (a) porting the large part of the calculation to the GPU, and (b) minimizing the amount of transfer between the device and the host, which are trade-off.

The above is our consideration, thank you!

@carlocamilloni
Copy link
Member

@zhang954527 thanks! This is a very nice contribution. We have been thinking about developing GPU versions of heavy parts of the code and this pull request is timely for us to start discussing what is the best way to approach the problem (@GiovanniBussi @maxbonomi @gtribello).

The first point is that we are not sure whether arrayfire is the best solution, @maxbonomi has implemented a cryoem CV using both arrayfire and libtorch, possibly libtorch is a better library, furthermore @luigibonati has implemented a module that allows including CV using pytorch.

A second point is the issue of the numerical accuracy, I think that for GPU we should aim to use only FLOAT so that the code can still be used on any hardware, this would also quite speedup your code.

A third point is that of how to implement an interface to the MD codes that can allow reading the GPU memory so to avoid some of the data transfer, here @GiovanniBussi @tonigi are also thinking about alternative approaches possibly based solely on python.

At this point I think that

  1. you may test what happen using FLOAT
  2. I would not merge the pull request, while we can keep it open as WIP so that other people can also test it or get ideas.
  3. open an ISSUE to discuss how to move forward

@HanatoK
Copy link
Contributor

HanatoK commented Sep 9, 2022

@HanatoK Thanks for your comment!

Yes, we should avoid data transfer as much as possible to decrease extra time-consuming. We also try to compute following part in GPU, but still need to transfer some data, even larger amount of data.

For example, if we calculate the eigenvectors and eigenvalues on GPU, we need to transfer q(4*sizeof(double)) back to CPU. And this part has small amount of computation, because m is only a 4x4 matrix, which is about 0.6% of the overall calculation time of RMSD optimalAlignment. The calculation time may not benefit significantly if this part on GPU.

If we continue to calculate the rotation matrix on GPU, we need to transfer rotation(9*sizeof(double)) back to CPU.

If we continue to calculate the derivatives and dist on GPU, this part does take up a lot of calculation time, the GPU can save considerable time for this stage. But when the function ends, the derivatives matrix (3nsizeof(double)) needs to be transferred back to the subsequent programs for use. This is a large amount of transfers, especially compared to the rr00(1sizeof(double)) and rr01(9sizeof(double)), which ultimately slows down the calculation. That's what we're worried about. In fact, we tested this case, and its calculation speed is not as fast as the current PR version.

Therefore, the current submission mainly considers: (a) porting the large part of the calculation to the GPU, and (b) minimizing the amount of transfer between the device and the host, which are trade-off.

The above is our consideration, thank you!

Thank you for your reply!

@tonigi
Copy link
Contributor

tonigi commented Sep 9, 2022

Only thing I can comment is that my hunch is that it's the number of transfers, not their the size, to be the bottleneck.

@GiovanniBussi GiovanniBussi added the wip Do not merge label Sep 9, 2022
@zhang954527
Copy link
Author

@carlocamilloni Thanks for your comment!

We are glad to provide some references for the GPU versions of Plumed. In fact, we also wrote the original CUDA kernel version and compared it with ArrayFire, the results show that the original CUDA kernel has higher acceleration potential than ArrayFire, both for small and large number of molecules. Hence selecting a better library might be a good direction.

We've just tested it on the GPU using FLOAT precision and achieved faster speeds at most atoms number, and we've supplemented the results in the table above.

Thank you for your suggestions and next steps, and for keeping this pull request open as WIP, hopefully our discussion will give others some insight.

Later we will consider opening an ISSUE link to this pull request to help with better discussions. We also hope that we can participate in more discussions on the Plumed GPU version in later work.

Thank you!

@zhang954527
Copy link
Author

@tonigi Thanks for your ideas.

It is true that the number of transfers will affect the time, because the time of its API is longer. At the same time, we think the size of transfer data will still affect the acceleration effect to a certain extent, especially when the number of atoms is large.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip Do not merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants