Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Work towards reusable CVCs #700

Draft
wants to merge 5 commits into
base: reusable-cvcs
Choose a base branch
from

Conversation

HanatoK
Copy link
Member

@HanatoK HanatoK commented Jun 29, 2024

Well, I believe that no one would really like this PR. This PR uses a lot of hacks and workarounds in the backend to achieve reusable CVCs in SMP with limitations as less as possible (although I am not a big fan of distributing CVCs over threads), including:

  1. Build a toy AST and determine the parallelization scheme by the depth of the node. At first glance, colvardeps looks like an AST but after playing with it I feel it is not a real one, and it really just checks the dependencies of features, and there is no true AST. It would be better if Colvars could be redesigned with a true AST and a dependency checker for it. The dependency checker should not own the AST;
  2. Bypass the colvar class and take the cvc objects out to build the AST. I think that the colvar class should be completely removed;
  3. I don't know why the smp_lock and smp_unlock in colvarproxy_namd are implemented as creating and destroying locks, so I have changed them;
  4. Implement the chain rule in a dirty manner (see colvar::cvc::modify_children_cvcs_atom_gradients and propagate_colvar_force). When calling calc_gradients and apply_force of a CVC consisting of sub-CVCs, it now propagates the gradients and forces to all its sub-CVCs;
  5. To avoid race condition in propagating the atom gradients when reusing CVCs, I have to use smp_lock. However, it is very coarse-grained so I expect an additional performance penalty. I thought there should be a lock tied to each atom group but found none.

In summary, I think that Colvars should be fundamentally changed to achieve better support of reusable components and parallelization.

This PR tries to solve #232, extends #644 and finishes:

  • Reusing the computation of the individual "nodes" in a pair of path CVs ("s" and "z").

@HanatoK HanatoK requested review from giacomofiorin and jhenin June 29, 2024 22:52
@HanatoK
Copy link
Member Author

HanatoK commented Jun 30, 2024

After some thoughts I feel there is no way to make explicit gradients working correctly with reusable components, so I will disable it in this PR.

This does not work as Colvars was not designed with automatic
differentiation in mind.
@HanatoK
Copy link
Member Author

HanatoK commented Jul 1, 2024

The problem of calc_gradients()

Colvars was not designed with automatic differentiation (AD) in mind. At first glance, it seems to perform the forward AD because for each CVC, the calc_gradients() is executed just after calc_value, but the colvar class, supposed to be a function of CVC, does not have explicit gradients with respect to the atoms. Conversely, colvar::communicate_forces just computes the gradients with respect to CVCs on-the-fly, which seems to be a backward AD implementation. Furthermore, the colvarvalue class has no gradient field, which is opposed to many other implementations like torch.tensor and PLMD::Value, and the CVC class does not store the gradients, either. All these factors make either forward AD or backward AD using calc_gradients() for CVCs of sub-CVCs difficult. The apply_force() code path is less broken as it looks consistent with backward AD and the backward propagation in PLUMED.

The problem of colvardeps

From the perspective of compiler or interpreter, when constructing the CVC object in colvar::cvc::cvc, Colvars is still in the stage of syntax analysis, parsing the syntax of the config file and trying to build a tree of colvardeps, but it ought to be noted that before running cvc::init the syntax analysis is not done. The weird design is that init_dependencies is called after the constructor of CVC and before cvc::init and checks the dependencies. In general, feature dependencies are a semantic thing, and that means that Colvars interleaves the syntax analysis with the semantic analysis, which, in my opinion, is a bad design.

@HanatoK HanatoK changed the title More band-aid fixes to Colvars [RFC] More band-aid fixes to Colvars Jul 1, 2024
@HanatoK HanatoK marked this pull request as ready for review July 1, 2024 19:25
AST

Giacomo said that variables_active_smp is used for making the colvar
object appearing in the loop multiple times for the original
parallelization scheme. As I understand, this means that I can use
variables_active directly to build the AST instead of the duplicated
items in variables_active_smp.
if (it_find == cvc_info_map.end()) {
cvc_info_map.insert({parent, cvc_info{
// TODO: Here calc_cvc_values calls cvcs[i]->is_enabled() , which is is_enabled(int f = f_cv_active)
// I know both f_cv_active and f_cvc_active are 0 but are they the same option??
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that f_cv_active and f_cvc_active have the same numerical value has no consequence, they should never be used in the same context, because they are respectively only meaningful in a colvar or cvc object. The colvardeps data of these two classes are non-overlapping. The relationship between them is a vertical (parent-child) dependency.

If we were to merge the two levels, then these two features would merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that in

if (!cvcs[i]->is_enabled()) continue;
, is_enabled() is called, and since there is no function parameter passed, so I think it would call
inline bool is_enabled(int f = f_cv_active) const {

which checks f_cv_active instead of f_cvc_active. My code is to follow what the original calls in calc_cvc_values and I am not sure if I need to follow it the same way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, sorry. That is somewhat sloppy writing that I didn't remember well. It does rely on the "active" property being number 0.

// NOTE that all feature enums should start with f_*_active

But a class inheriting from colvardeps could also override is_enabled() and change this convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much effort do you think it would be to remove the default argument for this virtual function? Remember that this is one of the "issues" that clang-tidy was complaining about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be very little effort. I'm happy to do that if that helps in any way.

it != cv->variables_active_smp()->end(); ++it) {
// TODO: Bad design! What will happen if CVC a is in a "colvar" block
// that does not support total_force_calc, but is then reused in
// another block that requires total_force_calc even if it supports Jacobian itself???
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the problem here. Assuming a cvc can have several parents: the colvar that does require a total force can enable it in its children cvcs, even if other parents don't require (or support) it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I just thought that if a colvar does not require the total force, then it would disable the corresponding feature of the children, but it seems the code do not check the dependency in that way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the other way: disabled by default, and enabled on request - then disabled again if the refcount falls to zero.

lambda_fn, NULL, CKLOOP_NONE, NULL);
}
cvm::decrease_depth();
return cvm::get_error();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelization and checking the error

When I wrote and test new code, I found that NAMD may not exit cleanly (some other threads were still running) if one thread was exited. I think the implementation of colvarproxy_namd::error or at least its use in the parallel region, needs to be revised to use a std::atomic<bool> to setup an error "bit" instead of calling NAMD_die directly. Then after all threads are finished and joined, we can use cvm::get_error() to check the error bit and exit if there is an error.

@jhenin
Copy link
Member

jhenin commented Jul 4, 2024

The problem of colvardeps

From the perspective of compiler or interpreter, when constructing the CVC object in colvar::cvc::cvc, Colvars is still in the stage of syntax analysis, parsing the syntax of the config file and trying to build a tree of colvardeps, but it ought to be noted that before running cvc::init the syntax analysis is not done. The weird design is that init_dependencies is called after the constructor of CVC and before cvc::init and checks the dependencies. In general, feature dependencies are a semantic thing, and that means that Colvars interleaves the syntax analysis with the semantic analysis, which, in my opinion, is a bad design.

There seems to be a misunderstanding here. To clarify, the spirit of init_dependencies is not that it "checks dependencies" - rather, it initializes the static dependency tree between features. The proper dependency checking happens with calls to enable(), which happen either during the semantic analysis of the input, or throughout the run in case of dynamic dependencies.

@HanatoK
Copy link
Member Author

HanatoK commented Jul 4, 2024

The problem of colvardeps

From the perspective of compiler or interpreter, when constructing the CVC object in colvar::cvc::cvc, Colvars is still in the stage of syntax analysis, parsing the syntax of the config file and trying to build a tree of colvardeps, but it ought to be noted that before running cvc::init the syntax analysis is not done. The weird design is that init_dependencies is called after the constructor of CVC and before cvc::init and checks the dependencies. In general, feature dependencies are a semantic thing, and that means that Colvars interleaves the syntax analysis with the semantic analysis, which, in my opinion, is a bad design.

There seems to be a misunderstanding here. To clarify, the spirit of init_dependencies is not that it "checks dependencies" - rather, it initializes the static dependency tree between features. The proper dependency checking happens with calls to enable(), which happen either during the semantic analysis of the input, or throughout the run in case of dynamic dependencies.

Thanks for the clarifications. You are right that init_dependencies only declares the dependencies. However, I think that the problem is still calling enable that tries to check the dependencies while still initializing the children CVCs, so for CVC of sub-CVCs, I cannot do the checking here:

// TODO: I don't know why I cannot check this
// if (is_enabled(f_cvc_gradient))
sub_cv->enable(f_cvc_gradient);

Also, I cannot use add_child in colvardeps to declare that a sub-CVC is a child of the other, as add_child also does dependency checking, so I think that it is a bad design to check dependencies while calling the initialization function. Calling the init means that Colvars is still parsing options, and the abstract syntax tree is not completely built, but the dependencies are semantic things that should be done after the AST is completely built.

In my opinion, the following structure should be separated from colvardeps,

colvars/src/colvardeps.h

Lines 155 to 162 in ff35f9c

/// pointers to objects this object depends on
/// list should be maintained by any code that modifies the object
/// this could be secured by making lists of colvars / cvcs / atom groups private and modified through accessor functions
std::vector<colvardeps *> children;
/// pointers to objects that depend on this object
/// the size of this array is in effect a reference counter
std::vector<colvardeps *> parents;

to form the AST, and colvardeps should only be a feature-dependency checker, and not own the AST. Adding new children to the AST should not trigger the checker. In other words, it is better that colvardeps acts somewhat like an LLVM pass.

@jhenin
Copy link
Member

jhenin commented Jul 10, 2024

Thanks @HanatoK , I now understand your point better and I fully agree!

To move the AST out of colvardeps and allow same-level dependencies, we need to deal with the fact that children objects of a CVC can be CVCs or atom groups, so those should be described by different parts of the feature tree.

@jhenin
Copy link
Member

jhenin commented Aug 26, 2024

@HanatoK - this is a large and disruptive change set. I think there are items in there that we can agree on. Having an AST alongside the colvardeps class is one.

Your initial point 2 was: Bypass the colvar class and take the cvc objects out to build the AST. I think that the colvar class should be completely removed;
The colvar class is the main workhorse at the moment, but I suppose you mean merging the cvc and colvar classes into one. I generally agree with that idea, although that will give a large class, where only a small subset of features will be used by most instances.

@giacomofiorin
Copy link
Member

@jhenin's comment is good. There are multiple small fixes here that are definitely worth integrating, but they are mixed with others that pertain to broader design issues.

I also agree that the AST could be the most immediate goal. We had discussed in the past the possibility of adding such functionality to colvardeps, but it also looked like that would amount to feature creep for that class.

If you are aware of suitable classes in the C++ standard library, can you suggest them?

@HanatoK
Copy link
Member Author

HanatoK commented Sep 25, 2024

@jhenin's comment is good. There are multiple small fixes here that are definitely worth integrating, but they are mixed with others that pertain to broader design issues.

I also agree that the AST could be the most immediate goal. We had discussed in the past the possibility of adding such functionality to colvardeps, but it also looked like that would amount to feature creep for that class.

If you are aware of suitable classes in the C++ standard library, can you suggest them?

As far as I understand, the AST is just constructed as the same way as colvardeps that has pointers to children and parents nodes. colvardeps itself is actually an AST, but the issue is that it mixes up two goals, namely (i) determining the order of calculations of component and biases (the dependencies between colvardeps and its derived objects), and (ii) checking the feature-level compatibilities/dependencies. In other words, I think it is better to

  • Hold a root node of colvardeps in colvarmodule instead of separating std::vector<colvar *> and std::vector<colvarbias *> . The order of calculation can then be determined by traversing the colvardeps root node, calculating the children objects at first and then parent objects;
  • Make colvardeps focus on goal (i). colvardeps::add_child should just modify children and child->parents and not check the feature-level dependencies, and the same to remove_child. A tree of colvardeps* is constructed when reading the config file, but we skip checking the feature-level dependencies;
  • Make an independent feature-level dependency checker that takes the colvardeps* as a parameter. After a tree of colvardeps* is constructed or modified, this checker is expected to traverse the tree and check both the "static" and "dynamic" feature-level dependencies between children and parents;
  • For the feature-level dependencies, it might be better to avoid enable and disable, and derived classes of colvarcomp ought to use provide and require. The dynamic feature dependencies could be determined by the above dependency checker or another separate function that can traverse the tree of colvardeps*.

I mark the PR as draft as it seems too disruptive and might need further discussions.

@HanatoK HanatoK marked this pull request as draft September 25, 2024 15:31
@jhenin
Copy link
Member

jhenin commented Sep 27, 2024

@HanatoK I've now looked at this in more detail. This has a lot of merit. What it would need to be merge-ready is at least:

  1. Fewer TODOs and self-described hacks;
  2. passing regression tests (now I see some failures in the NAMD-based library tests);
  3. documentation of the user-facing changes.

One question: can this be done without implementing the AST design you described just above? If yes, I would rather separate these issues as much as possible and merge code that works to avoid letting this branch diverge too much.

Side note: I think you should be able to set your GH account so that the CI actions run in your fork, which would give us CI for this PR.

@HanatoK
Copy link
Member Author

HanatoK commented Sep 27, 2024

1. Fewer TODOs and self-described hacks;

Well, actually this PR was not intended to be merged into the master branch directly. Instead, I developed the code based on #644, generalized the idea of reusable CVCs and tried my best to solve the SMP issue.

2. passing regression tests (now I see some failures in the NAMD-based library tests);

This PR is based on #644 so I don't want it diverges from the reusable-cvcs branch. The NAMD tests fail because there is no devel branch from the NAMD's repository. reusable-cvcs should be rebased at first, and then I would rebase this PR.

3. documentation of the user-facing changes.

I tagged this PR with "request for comments" because I didn't think there would be a consensus about how to reuse CVCs, and I opened this PR mainly as a proof of concept to show that how to achieve reusable CVCs while retaining the idea of "distributing the CVCs over SMP threads". I will add the documentation if we have a consensus about reusability and parallelization.

One question: can this be done without implementing the AST design you described just above? If yes, I would rather separate these issues as much as possible and merge code that works to avoid letting this branch diverge too much.

I don't think so. It seems to me that @giacomofiorin likes the idea of moving some "special" CVCs out of the SMP loops and computing them serially. However, I think there will be more and more "special" CVCs relying on other CVCs and biases based on other biases in the future, so it is better to take a unified approach. As I have said in #709 (comment), there could be two approaches for parallelization, namely (i) distributing CVCs and biases over SMP threads, and (ii) using SMP threads to parallelize fine-grained loops like projecting hills, rotating atoms, calculating COMs and more. To achieve general reusability in (i), it has to use an AST or other similar things to determine the order of calculations of CVCs. Approach (ii) could be simpler as it only requires that the calculations of CVCs follow the order of their definitions appeared in the config file, which is what PLUMED does, although I suspect that building an AST could be more useful.

Side note: I think you should be able to set your GH account so that the CI actions run in your fork, which would give us CI for this PR.

It does run in my fork (see https://github.com/HanatoK/colvars/actions/runs/9766071847/job/26958200729).

@giacomofiorin
Copy link
Member

I don't think so. It seems to me that @giacomofiorin likes the idea of moving some "special" CVCs out of the SMP loops and computing them serially. However, I think there will be more and more "special" CVCs relying on other CVCs and biases based on other biases in the future, so it is better to take a unified approach.

I totally agree with having a unified approach, which would be a win-win for users and developers alike. But that would take significant effort, during which we would need to keep master release-ready.

Although GROMACS and LAMMPS have somewhat predictable release schedules, they are also not in sync: one in the Winter, one in Summer. I guess this may be related to the times of the year when people prefer to be inside working in Stockholm vs. Albuquerque :-D And NAMD and VMD do not follow a regular schedule.

I do not see the two approaches as antithetical, either: (i) was just easier to implement than (ii), and there was also less pressure to implement (ii) in NAMD, which already provided at least centers of mass computed using domain decomposition.

@HanatoK
Copy link
Member Author

HanatoK commented Sep 27, 2024

I do not see the two approaches as antithetical, either: (i) was just easier to implement than (ii), and there was also less pressure to implement (ii) in NAMD, which already provided at least centers of mass computed using domain decomposition.

My impression is that parallelizing both the calculations of CVs and their inner loops (COM, COG, and rotating positions) could make the code much more complicated. Approach (i) only works well with CPUs, while approach (ii) could be easily ported to GPUs or similar accelerators. Calculating two CVs simultaneously on two "GPU threads" of the same GPU device could kill the performance.

@HanatoK
Copy link
Member Author

HanatoK commented Oct 24, 2024

It just reminds me that if we want to compute CVCs in parallel, we may need to lock the get_group_force_object() in

colvars/src/colvaratoms.cpp

Lines 1487 to 1489 in 2c6c712

cvm::atom_group::group_force_object cvm::atom_group::get_group_force_object() {
return cvm::atom_group::group_force_object(this);
}

Because CVCs A and B may rely on C, and in C the atom group has a fitting group. In such case, it would be better to implement a lock attached to the atom group, the first one (either A or B) accessing the apply_force() of C should obtain the lock, while the second one should wait for the lock.

@jhenin jhenin changed the title [RFC] More band-aid fixes to Colvars [RFC] Work towards reusable CVCs Oct 24, 2024
@jhenin
Copy link
Member

jhenin commented Dec 11, 2024

How do we move forward on this?
We can either:

Either way, this remains a somewhat unwieldy branch but I think parts of it can be extracted and merged separately to make this one easier to follow.

@HanatoK
Copy link
Member Author

HanatoK commented Dec 11, 2024

How do we move forward on this? We can either:

* merge this into the reusable-cvcs branch and continue work on [Allow reusing computation and data between CVC objects #644](https://github.com/Colvars/colvars/pull/644)

* merge [Allow reusing computation and data between CVC objects #644](https://github.com/Colvars/colvars/pull/644) and rebase this on master

Either way, this remains a somewhat unwieldy branch but I think parts of it can be extracted and merged separately to make this one easier to follow.

In my opinion, it really depends on how you and @giacomofiorin want Colvars to use the SMP threads. If you want to keep the current parallelization scheme that distributes CVCs among threads, then after possibly merging #644 I will continue to work on this. If you want to use the threads to parallelize inner loops (COM, COG, and rotating positions), then I could greatly simplify what will have to be introduced in this PR such as AST and potential locks, and probably abandon this one and work on a new PR. Although I prefer to the latter parallelization scheme, it is eventually up to you and @giacomofiorin to make a final decision. Before knowing a concrete answer about the future of SMP threads in Colvars, it is difficult for me to move this PR forward.

@giacomofiorin
Copy link
Member

@HanatoK I do not understand why it has to be either one scheme, or the other.

@HanatoK
Copy link
Member Author

HanatoK commented Dec 11, 2024

@HanatoK I do not understand why it has to be either one scheme, or the other.

@giacomofiorin I have commented on this issue (see my comment above or #700 (comment)). For example, if there are 16 threads available, and a mixture of both scheme could be like using 4 threads to parallelize the CVC objects, and each of them use another 4 threads to parallelize the inner loops. As you can imagine, this is complicated and difficult to perform load balancing.

@jhenin
Copy link
Member

jhenin commented Dec 11, 2024

You mean doing load balancing internally to Colvars? So far we've let charm++/openMP do the load balancing, and that hasn't been so bad.

@HanatoK
Copy link
Member Author

HanatoK commented Dec 11, 2024

You mean doing load balancing internally to Colvars? So far we've let charm++/openMP do the load balancing, and that hasn't been so bad.

@jhenin Sorry, this might not be a load-balancing problem. Let's say if you distribute two threads on two CVCs, for example, a RMSD and a distance of two atoms. As you know, the former one is more computationally expensive, while the latter is cheaper. If you simply use CkLoop_Parallelize to distribute the threads, then the latter one has to wait the former one to complete. In other words, the latter thread keeps idling for a long CPU time.

@giacomofiorin
Copy link
Member

@jhenin Sorry, this might not be a load-balancing problem. Let's say if you distribute two threads on two CVCs, for example, a RMSD and a distance of two atoms. As you know, the former one is more computationally expensive, while the latter is cheaper. If you simply use CkLoop_Parallelize to distribute the threads, then the latter one has to wait the former one to complete. In other words, the latter thread keeps idling for a long CPU time.

This is an excellent point in favor of adding atom-level parallelism, which we currently do not have. However, distributing the CVCs would remain the more efficient approach when there are many of them, such as in path variables or linear combinations.

I appreciate that it would take additional work to combine an existing feature with a new feature, but it's also difficult to accept the argument that we must remove the existing feature to implement the new one.

@HanatoK
Copy link
Member Author

HanatoK commented Dec 11, 2024

@jhenin Sorry, this might not be a load-balancing problem. Let's say if you distribute two threads on two CVCs, for example, a RMSD and a distance of two atoms. As you know, the former one is more computationally expensive, while the latter is cheaper. If you simply use CkLoop_Parallelize to distribute the threads, then the latter one has to wait the former one to complete. In other words, the latter thread keeps idling for a long CPU time.

This is an excellent point in favor of adding atom-level parallelism, which we currently do not have. However, distributing the CVCs would remain the more efficient approach when there are many of them, such as in path variables or linear combinations.

I appreciate that it would take additional work to combine an existing feature with a new feature, but it's also difficult to accept the argument that we must remove the existing feature to implement the new one.

@giacomofiorin In case of multiple cheap CVs, like hundreds of distances, I think it is possible to implement a vector CVC distances, and parallelize the pair-wise distances inside it. If you really want to combine both parallelization scheme, a thread pool may be the best approach, but I don't know how it is supposed to work with CHARM++. Anyway, could I assume that you will continue on the current parallelization scheme?

@giacomofiorin
Copy link
Member

@giacomofiorin In case of multiple cheap CVs, like hundreds of distances, I think it is possible to implement a vector CVC distances, and parallelize the pair-wise distances inside it.

Do you mean parallelizing distancePairs or something else?

If you really want to combine both parallelization scheme, a thread pool may be the best approach, but I don't know how it is supposed to work with CHARM++.

AFAIK Charm++ should give you support for that. If not, using CKloop_parallelize() on some higher-level function should work.

Anyway, could I assume that you will continue on the current parallelization scheme?

I would only propose not to remove that scheme, because that would result in a regression for several users. But it'd be totally okay to implement atom-level schemes for some specific CVCs, and if we can't make the two schemes work together immediately have a flag to select one vs. the other?

@HanatoK
Copy link
Member Author

HanatoK commented Dec 11, 2024

Do you mean parallelizing distancePairs or something else?

Yes. For dihedrals we could have a similar thing like dihedralTuples.

AFAIK Charm++ should give you support for that. If not, using CKloop_parallelize() on some higher-level function should work.

The problem of CKloop_parallelize is that if you use CKloop_parallelize to run a loop in parallel, and then I don't know what will happen if the inner function call CKloop_parallelize again.

I would only propose not to remove that scheme, because that would result in a regression for several users. But it'd be totally okay to implement atom-level schemes for some specific CVCs, and if we can't make the two schemes work together immediately have a flag to select one vs. the other?

OK. I can try my best to see how to adapt this PR in the most complicated case.

@giacomofiorin
Copy link
Member

Yes. For dihedrals we could have a similar thing like dihedralTuples.

Sounds good. Could perhaps share some code with dihedralPCA?

The problem of CKloop_parallelize is that if you use CKloop_parallelize to run a loop in parallel, and then I don't know what will happen if the inner function call CKloop_parallelize again.

Sorry, I was imprecise: I mean that we should call CKloop_parallelize only at one level (no recursions).

OK. I can try my best to see how to adapt this PR in the most complicated case.

Feel free to disable the existing SMP if it conflicts with a new improvement, as long as the user can override the default behavior as needed.

Because Colvars has to comply with competing release schedules, we need to implement things incrementally.

@jhenin
Copy link
Member

jhenin commented Dec 11, 2024

Taking a step back, since we are purely talking about optimization here, I'd like to start from use cases. What expensive CVs would users really like to use? I can think of path CVs with many RMSDs (although the fit is now faster thanks to your optimizations @HanatoK ), or generally any CV that is often used with large atom groups. What real-world cases do you have in mind where scaling is limiting?

@HanatoK
Copy link
Member Author

HanatoK commented Dec 12, 2024

Taking a step back, since we are purely talking about optimization here, I'd like to start from use cases. What expensive CVs would users really like to use? I can think of path CVs with many RMSDs (although the fit is now faster thanks to your optimizations @HanatoK ), or generally any CV that is often used with large atom groups. What real-world cases do you have in mind where scaling is limiting?

@jhenin My experience is that one of the bad cases is a mixture of computationally slow and fast CVs. For example, in the BFEE calculation we have to restrain various Euler and positional angles with fitting groups, which are slow, and then sample along the distance, which is fast. Another example is restraining the unfolding direction of GB1, and computing the 2D PMF along radius of gyration and hbonds.

In my opinion, distributing CVCs among threads only benefit the cases that use many computationally cheap CVCs. If you want to accelerate the case of multiple RMSDs, I would still suggest parallelizing the inner loops due to CPU cache locality.

@jhenin
Copy link
Member

jhenin commented Dec 12, 2024

In the cases you mention, it seems there are fewer CVCs in total than cores on a typical CPU, so with our current scheme, the time to complete a Colvars update is the computation time for the most expensive CVC,, which is the best we can do until we accelerate individual CVCs.

What would be a clear waste of time is for an expensive CVC to wait for another expensive CVC to return while there are idle threads because cheap CVCs are finished. Isn't avoiding that precisely the job of OpenMP and charm++?

Edit: answering my own question - CkLoop clearly does things in a dynamic way. It seems OpenMP parallel for uses static scheduling by default, however we could switch to dynamic scheduling to help with this.

@HanatoK
Copy link
Member Author

HanatoK commented Dec 12, 2024

In the cases you mention, it seems there are fewer CVCs in total than cores on a typical CPU, so with our current scheme, the time to complete a Colvars update is the computation time for the most expensive CVC,, which is the best we can do until we accelerate individual CVCs.

What would be a clear waste of time is for an expensive CVC to wait for another expensive CVC to return while there are idle threads because cheap CVCs are finished. Isn't avoiding that precisely the job of OpenMP and charm++?

Edit: answering my own question - CkLoop clearly does things in a dynamic way. It seems OpenMP parallel for uses static scheduling by default, however we could switch to dynamic scheduling to help with this.

I am not quite sure about your situation. If you refer to the case, for example, that there are 2 threads working on 3 CVs, one is fast while the other two are slow, then I suppose that one of threads would continue working on the slow one after finishing the fast one, without waiting for another thread to finish. This should be the case for both OpenMP (there are nice figures for OpenMP in https://ppc.cs.aalto.fi/ch3/for/) and CkLoop.

I think the case that the number of CPU threads is more than the CVs is common, also it is common that the number of CPU threads more than the number of biases. It is also better to utilize the CPU threads as much as possible to accelerate the calculation of biases, for example, projecting new hills in metadynamics.

@HanatoK
Copy link
Member Author

HanatoK commented Dec 12, 2024

I want to emphasize that even in case of calculating two slow CVs (RMSDs for example) with two CPU threads, it would be better to parallelize the inner loops instead of distributing two CPU threads on the CVs separately. In general, CPU cores have a shared L3 cache. Running lstopo on my laptop shows:
screenshot_20241211_213825
If you parallelize, for example, the loop of calculating the rotated positions from the unrotated frame, then the CPU can guess that you need to whole array of atom positions and fetch it from a block of memory to the L3 cache. However, if you use two threads to compute two CVC objects separately, then there are possibly more cache misses (see also https://stackoverflow.com/a/6331459).

@jhenin
Copy link
Member

jhenin commented Dec 12, 2024

Good point about cache memory. Data from several CVCs is less contiguous and might give more cache misses than data from a single parallel CVC. However, depending on data size that is out of our control, all the data for several CVCs might fit into the cache at once, or on the contrary,the data for a single CVC might not fit. This is very hard to assess without precise benchmarks, I'm beginning to think we need to adopt a set of standard benchmarks.

In the case you describe with a small number of slow CVCs, then parallelizing each CVC over atoms is indeed the way to go. But that has to be implemented separately for each CVC. Arguably, it's not needed for the atom-group based CVCs, which in NAMD are already taken care of, and could benefit from similar MPI-parallel mechanisms in other engines. We could establish a list of potentially expensive CVCs that would need this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants