Poor parallel scaling efficiency due to MPI_gather_all #4

erlendd · 2016-12-16T18:32:03Z

Per e-mail conversation with Gabor I'm posting about this issue here. Basically the parallel scaling of pymatnest is relatively poor, and performance drops off at a relatively low number of CPU cores.

Taking Archer as an example (Archer is a Cray machine very similar to Titan in the US) with 1152 walkers I see a drop-off in parallel scaling after just 12 cores (1/2 a node), while with 11520 walkers I can only scale up to 48 cores. From looking at the code it seems very likely the problem lies with over-use of the MPI_gather_all routine, as this causes a lot of congestion between nodes (on Archer each node is 24 cores). Gabor informed me that he had trouble going beyond 96 cores (4 nodes).

I've posted my (brief) results from my tests on Archer here, with some discussion of the cause (see the pure MPI_gather_all test towards the end):
https://gist.github.com/erlendd/c236f393ed597187c612599cb472cd4b

noambernstein · 2016-12-16T20:15:03Z

Your writeup seems to focus on number of walkers, but that is irrelevant unless it's less than number of MPI tasks. More walkers just means more iterations, not more work per iteration, and it's the latter that's parallelized. The main effect on parallel efficiency is that the length of the walk each MPI task does at each iteration is
actual_walk_length =~ number_of_model_calls_expected * n_cull / n_mpi_tasks,
and it must be >= 20 for decent parallel efficiency, otherwise the cost per walk is too small, and various other overhead dominates. Also the shorter the walk the bigger an issue load balance becomes.

I couldn't find in your writeup what your number_of_model_calls_expected was, so I can't tell if your parallel scaling is expected or not.

Other points:

There really is a load balance issue, which I have a patch to address. Note that you can never get below actual_walk_length = length of shortest type of move, usually the atomic trajectories which are typically (in my runs anyway) length 8, so you can never parallelize more than n_mpi=n_model_calls_expected/8.
There's no obvious way to avoid doing a collective operation, because you need to get the new maximum energy from all the nodes. There are ways of reducing the amount of data involved (either not applying the allgather to all the energies, just to relevant ones, or doing a gather to root followed by a broadcast of just the highest energies), but not getting rid of the operation, at least as far as I've been able to come up with so far. If you have any ideas, I'd be happy to consider them.

erlendd · 2016-12-16T20:33:07Z

Thanks for the fast reply, again.

Ah ok, I'd assumed that parallelism was very walkers as the number of walkers needs to be an integer multiple of the number of cores. But thinking about nested sampling algo it makes sense that walkers are not parallelised, as only one is considered at a time. Why am I seeing improved scaling as the number of walkers is increased in the write-up I posted?

A gather to root followed by broadcasting the energies would probably improve things. There are places in the code where gather all is used on one variable then the next variable, this will usually be slower than merging the data and doing one call to gather all.

I'll check my input file when I'm back at a computer.

noambernstein · 2016-12-16T22:14:59Z

I'm only aware of a single allgather that happens every iteration, and that's the maximum energy calculation. There are sendrecvs that are used to send cloned configurations around, but I think that's unavoidable given the current architecture (the alternative would be lots of many-to-one between nodes and some sort of root process, which isn't likely to be efficient either). There are also allgathers associated with infrequent things like saving snapshots or trajectories, but unless you prove otherwise I'm going to claim that they don't happen often enough to matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor parallel scaling efficiency due to MPI_gather_all #4

Poor parallel scaling efficiency due to MPI_gather_all #4

erlendd commented Dec 16, 2016

noambernstein commented Dec 16, 2016

erlendd commented Dec 16, 2016

noambernstein commented Dec 16, 2016

Poor parallel scaling efficiency due to MPI_gather_all #4

Poor parallel scaling efficiency due to MPI_gather_all #4

Comments

erlendd commented Dec 16, 2016

noambernstein commented Dec 16, 2016

erlendd commented Dec 16, 2016

noambernstein commented Dec 16, 2016