-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added RMSD optimalAlignment GPU implementation #859
base: master
Are you sure you want to change the base?
Conversation
Thanks @zhang954527 ! Perhaps @carlocamilloni can better judge since he has experience with arrayfire. What is not clear to me is the advantage of having a GPU implementation that is overall slower than the CPU one. Probably, as you pointed out, the reason is the cost of transfering data to GPU. However, (a) I naively expect this cost to be linear in the number of atoms and (b) the cost of the calculation is also linear. So, I don't expect advantages even if you increase the number of atoms. What do you think? I am asking this because this PR is further complicating an already complex C++ class, and I wanted to make sure that the extra price that we will pay in maintaining it will be worth... Thanks! |
Thanks @GiovanniBussi for your comment! We understand your concerns, the cost of transfering data and the cost of the calculation do increase linearly with the increasing of atoms number.
(b) The cost of transfering data grows linearly. (c) Therefore, when the performance benefit brought by the difference between the GPU and CPU in the calculation process is higher than the cost of transfering data, the GPU is generally accelerated. In addition, the arrayfire simplifies the complexity of GPU programming and provides the possibility of acceleration. In the future work, the variables involved in the plumed calculation can be directly read from the GPU memory of the md engine, and the data transfer between the host and the device in each iteration will be cancelled, which will further reflect the acceleration advantage of the GPU. The current PR is to make an interface for this part of the GPU implementation. We would also like to get some development suggestions on GPU implementation from your side! Thanks! |
@zhang954527 thanks for the quick reply! Could you please provide a table with speed up including data transfer? I mean:
This is what would actually matter. Thanks! |
@zhang954527 I actually discussed briefly with Carlo about this. When you report timing, it is important that you discount the initial part related to arrayfire initialization. |
@GiovanniBussi Also thanks for your quick reply and discussed with @carlocamilloni! This is the table with speed up including data transfer for all optimalAlignment RMSD forward loop operations same to this PR in unit test.
Timing is done using
In terms of improving performance, if save the data transfer in each step, it will get a more significant speedup. This requires plumed can be directly read from the GPU memory of the md engine, which may requires major changes. We are not quite sure what you mean by arrayfire initialization. We only include the necessary data transfer and calculating of arrayfire in the timing. Excludes the arrayfire variables definition, allocation and initial transfer of some variables. |
OK I see, not dramatic but at least >1. I am still a bit skeptical of the practical utility of RMSD calculations with > 400k aligned atoms, but it might be worth. For completeness could you please also report:
Then I will wait for @carlocamilloni checking the code, because I have no experience at all with this library. Thanks again for your contribution! |
Yes, the current conclusion is this. Thanks for the discussion and follow-up code checking! Here is the relevant information:
The relevant setup logs are as follows:
|
@zhang954527 |
@HanatoK Thanks for your comment! Yes, we should avoid data transfer as much as possible to decrease extra time-consuming. We also try to compute following part in GPU, but still need to transfer some data, even larger amount of data. For example, if we calculate the If we continue to calculate the If we continue to calculate the Therefore, the current submission mainly considers: (a) porting the large part of the calculation to the GPU, and (b) minimizing the amount of transfer between the device and the host, which are trade-off. The above is our consideration, thank you! |
@zhang954527 thanks! This is a very nice contribution. We have been thinking about developing GPU versions of heavy parts of the code and this pull request is timely for us to start discussing what is the best way to approach the problem (@GiovanniBussi @maxbonomi @gtribello). The first point is that we are not sure whether arrayfire is the best solution, @maxbonomi has implemented a cryoem CV using both arrayfire and libtorch, possibly libtorch is a better library, furthermore @luigibonati has implemented a module that allows including CV using pytorch. A second point is the issue of the numerical accuracy, I think that for GPU we should aim to use only FLOAT so that the code can still be used on any hardware, this would also quite speedup your code. A third point is that of how to implement an interface to the MD codes that can allow reading the GPU memory so to avoid some of the data transfer, here @GiovanniBussi @tonigi are also thinking about alternative approaches possibly based solely on python. At this point I think that
|
Thank you for your reply! |
Only thing I can comment is that my hunch is that it's the number of transfers, not their the size, to be the bottleneck. |
@carlocamilloni Thanks for your comment! We are glad to provide some references for the GPU versions of Plumed. In fact, we also wrote the original CUDA kernel version and compared it with ArrayFire, the results show that the original CUDA kernel has higher acceleration potential than ArrayFire, both for small and large number of molecules. Hence selecting a better library might be a good direction. We've just tested it on the GPU using FLOAT precision and achieved faster speeds at most atoms number, and we've supplemented the results in the table above. Thank you for your suggestions and next steps, and for keeping this pull request open as WIP, hopefully our discussion will give others some insight. Later we will consider opening an ISSUE link to this pull request to help with better discussions. We also hope that we can participate in more discussions on the Plumed GPU version in later work. Thank you! |
@tonigi Thanks for your ideas. It is true that the number of transfers will affect the time, because the time of its API is longer. At the same time, we think the size of transfer data will still affect the acceleration effect to a certain extent, especially when the number of atoms is large. Thanks! |
Description
Added RMSD optimalAlignment ArrayFire-GPU implementation:
Preliminary implementation of RMSD distance calculating using ArrayFire to calculate the time-consuming part on GPU.
Code modifications
The following two time-consuming parts of the
optimalAlignment
core calculation module in theRMSD
class are calculated by GPU:In the calculation, the
reference coords
,align
,displace
, andrr11
are transferred to the GPU and calculated in advance to avoid the transfer overhead in the calculation iteration.At the same time, the compulsory keywords and option are added to the action of RMSD CV in
plumed.dat
:DEVICEID (default=0)
Identifier of the GPU to be used.GPU (default=off)
Calculate RMSD using ARRAYFIRE on an accelerator deviceThe implementation of ArrayFire GPU is also added to
doCoreCalc
andgetDistance
in theRMSDCoreData
class. It is currently disabled by default. The modules that are specifically changed to calculate on the GPU are:The use of ArrayFire refers to the implementation of the SAXS module. Parts involving ArrayFire are precompiled with
#ifdef __PLUMED_HAS_ARRAYFIRE
.Test case and results
The RMSD ArrayFire GPU calculation has been tested for 57,258 atoms, and the accuracy has been verified with the CPU calculation result. The RMSD config in
plumed.dat
for test case is:The calculation time cost summary in log is:
In the current program version, with the number of atoms on the order of ten thousand, GPU has no obvious difference in computing speed compared to the CPU. The reason is that although the pure computing part of the GPU can achieve a certain speedup in unit testing, each iteration requires data transfer between the host and the device, which weakens the GPU computing advantage.
In addition, the speedup ratio of GPU compared to CPU is related to the size of computing system and transfer process between device and host. When the number of atoms increases by an order of magnitude, the GPU acceleration ratio will be further improved.
Target release
I would like my code to appear in release master
Type of contribution
Copyright
COPYRIGHT
file with the correct license information. Code should be released under an open source license. I also used the commandcd src && ./header.sh mymodulename
in order to make sure the headers of the module are correct.Tests