Skip to content
This repository has been archived by the owner on Oct 23, 2020. It is now read-only.

MPAS speedup of model initialisation #1350

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

climbfuji
Copy link
Contributor

The model initialisation for MPAS can take a long time for large meshes and large number of MPI tasks. A detailed profiling exercise has shown that most of the time is spent in the reading of the METIS graph decomposition file by the master MPI task and the scattering of this information to all MPI tasks. This issue will be dealt with in a separate PR. The second item on the list is the setup of the blocks and halos, more precisely in the calls to mpas_dmpar_get_exch_list, which performs a large number of calls to mpas_binary_search.

Adding threading support to the loops calling mpas_binary_search and a simple modification of mpas_binary_search itself can reduce the model initialisation times greatly. This is addressed in this PR. For full details, see the attached PDF document: report_mpas_heinzeller.pdf

Since threading is handled differently in the MPAS cores, it would be great if the maintainers of the different cores could check if this PR breaks any of their functionality or has adverse impacts on the runtimes.

@climbfuji
Copy link
Contributor Author

Following up on today's telecon: As described in full detail in the PDF in the PR description, the measurements were taken on the Leibniz Centre for Supercomputing (LRZ) SuperMUC HPC. The nodes are Intel Sandybridge nodes with 16 physical cores (32 virtual cores with hyperthreading) and 16GB of memory per node.

Timings were obtained for different meshes and different numbers of nodes and specifically for the section of the bootstrapping process encompassing the lines from "call mpas_block_creator_build_cell_halos" to "call mpas_block_creator_build_edge_halos", labelled as mpas timer "setup blocks and halos" in the attached PDF.

(1) Uniform 2km mesh with 147 million grid cells on 2048 nodes, 16 MPI tasks per node and 2 OpenMP tasks per node (to make use of hyperthreading): the time for the relevant section of the bootstrapping code decreases from 187s to 118s.
(2) Uniform 120km mesh on a single node, 16 MPI tasks per node and 2 OpenMP tasks per node (to make use of hyperthreading): the time for the relevant section of the bootstrapping code increases from 1s to 1.5s.

Generally, the larger the mesh and the parallelisation, the more benefit we get from this PR.

Copy link
Member

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use threading in the landice core, so I'm simply approving with the assumption that the cores that do use threading will review it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants