RFC: Further optimize the optimal rotation code path #560

HanatoK · 2023-07-25T16:06:39Z

I think the optimal rotation code path can still be optimized further as follows:

Use rotation matrix instead of quaternion to rotate the vectors. This could make the rotation operation 1.5~3 times faster on my computer. A distilled test example can be seen in https://godbolt.org/z/qnKv8E6jT;
Use fixed size array for S, S_eigvec and S_backup. cvm::matrix2d allocates the memory dynamically but we already know that the matrices are all 4x4 at compile time;

Any ideas?

Update: for 1, it should be 0.5~2 times faster.

The text was updated successfully, but these errors were encountered:

jhenin · 2023-07-25T16:30:07Z

Thanks, those seem valid to me.

I think this is promising because calling rotate() many times for the same quaternion does redundant work, especially compared with caching the rotation matrix and applying it directly to many positions.
I would expect only minor gains from this, as the resizing is done once in calc_optimal_rotation_impl and then becomes trivial because these are persistent member data, so there is no reallocation on subsequent calls: it's really allocated once and for all. That said, I would rather have the resize() happen in the constructor for a cleaner code. Then the run-time should be nearly indistinguishable from fixed arrays: we don't construct very many rotation objects per second (at least I can't think of a workflow that would).

giacomofiorin · 2023-07-25T16:34:04Z

I agree on both points.

HanatoK · 2023-07-25T18:14:27Z

Thanks @jhenin and @giacomofiorin ! I will try the optimizations if I have time.

HanatoK · 2023-07-26T10:00:54Z

@giacomofiorin
Is it also possible to parallelize the loops over atoms? I am not familiar with the SMP but it seems it only parallelizes the computation of CVCs and biases, but in this way, the threads may have imbalance workloads. Let's say we have one distance CVC and one RMSD CVC, which are computed by two threads via SMP, then the thread calculating distance CVC may finish faster than the other. Another similar case is running metadynamics with a harmonic wall restraint. The harmonic wall restraint is easier to compute than the metadynamics bias, since the latter needs to iterate over either the grid to project a new hill, or iterate all history hills to calculate the biasing force at a certain position. Is it possible to utilize SMP in a fine-grained manner, such as looping over all atoms?

giacomofiorin · 2023-09-20T23:06:02Z

Answering your last point (before the issue gets closed when #585 is merged) parallelization over atoms is desirable, but will take some refactoring.

Remember that we are dealing with a broad range of use cases between (1) a single expensive rotation (e.g. large protein) and (2) many rotations/RMSDs with respect to multiple points (e.g. path CV). Also, when going to large numbers of threads good SMP parallelization requires a clever data structure layout (ask those with more experience than either of us). This means yet more refactoring because the current layout is nowhere near clever! :-)

Continuous testing has helped tremendously, especially considering that we cannot plan around the release cycles of major packages. I feel that we're getting closer to making that kind of refactoring manageable, but it remains more than one or two PRs away.

A practical suggestion: would you be open to consider taking some of the tests that you already wrote in the namd/tests/library folder and making them agnostic from NAMD (i.e. top-level tests folder)? If most of the functionality could be run with using just the library, it would make it much easier to troubleshoot any errors. Running a full-blown MD engine is important for benchmarking, but makes debugging harder.

HanatoK · 2023-09-26T12:43:06Z

@giacomofiorin

I will try to make some of the tests MD-engine agnostic.

Regarding the threading, is it possible to use a thread pool? I think it might be possible to pull a thread from the pool to do either (1) or (2) that you mentioned, but do not know if there are other performance overheads.

HanatoK added optimization non-urgent Long-term improvement labels Jul 25, 2023

HanatoK mentioned this issue Sep 16, 2023

Improve rotation #585

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Further optimize the optimal rotation code path #560

RFC: Further optimize the optimal rotation code path #560

HanatoK commented Jul 25, 2023 •

edited

Loading

jhenin commented Jul 25, 2023

giacomofiorin commented Jul 25, 2023

HanatoK commented Jul 25, 2023

HanatoK commented Jul 26, 2023

giacomofiorin commented Sep 20, 2023 •

edited

Loading

HanatoK commented Sep 26, 2023

RFC: Further optimize the optimal rotation code path #560

RFC: Further optimize the optimal rotation code path #560

Comments

HanatoK commented Jul 25, 2023 • edited Loading

jhenin commented Jul 25, 2023

giacomofiorin commented Jul 25, 2023

HanatoK commented Jul 25, 2023

HanatoK commented Jul 26, 2023

giacomofiorin commented Sep 20, 2023 • edited Loading

HanatoK commented Sep 26, 2023

HanatoK commented Jul 25, 2023 •

edited

Loading

giacomofiorin commented Sep 20, 2023 •

edited

Loading