MPAS speedup of model initialisation #1350

climbfuji · 2017-06-05T23:38:49Z

The model initialisation for MPAS can take a long time for large meshes and large number of MPI tasks. A detailed profiling exercise has shown that most of the time is spent in the reading of the METIS graph decomposition file by the master MPI task and the scattering of this information to all MPI tasks. This issue will be dealt with in a separate PR. The second item on the list is the setup of the blocks and halos, more precisely in the calls to mpas_dmpar_get_exch_list, which performs a large number of calls to mpas_binary_search.

Adding threading support to the loops calling mpas_binary_search and a simple modification of mpas_binary_search itself can reduce the model initialisation times greatly. This is addressed in this PR. For full details, see the attached PDF document: report_mpas_heinzeller.pdf

Since threading is handled differently in the MPAS cores, it would be great if the maintainers of the different cores could check if this PR breaks any of their functionality or has adverse impacts on the runtimes.

… search is sorted

… use OpenMP threading to speedup calls to mpas_binary_search

climbfuji · 2017-06-12T18:30:18Z

Following up on today's telecon: As described in full detail in the PDF in the PR description, the measurements were taken on the Leibniz Centre for Supercomputing (LRZ) SuperMUC HPC. The nodes are Intel Sandybridge nodes with 16 physical cores (32 virtual cores with hyperthreading) and 16GB of memory per node.

Timings were obtained for different meshes and different numbers of nodes and specifically for the section of the bootstrapping process encompassing the lines from "call mpas_block_creator_build_cell_halos" to "call mpas_block_creator_build_edge_halos", labelled as mpas timer "setup blocks and halos" in the attached PDF.

(1) Uniform 2km mesh with 147 million grid cells on 2048 nodes, 16 MPI tasks per node and 2 OpenMP tasks per node (to make use of hyperthreading): the time for the relevant section of the bootstrapping code decreases from 187s to 118s.
(2) Uniform 120km mesh on a single node, 16 MPI tasks per node and 2 OpenMP tasks per node (to make use of hyperthreading): the time for the relevant section of the bootstrapping code increases from 1s to 1.5s.

Generally, the larger the mesh and the parallelisation, the more benefit we get from this PR.

matthewhoffman

We don't use threading in the landice core, so I'm simply approving with the assumption that the cores that do use threading will review it.

climbfuji added 3 commits June 5, 2017 16:32

Speedup of mpas_binary_search making use of the fact that the list to…

995f050

… search is sorted

Threading of binary searches in stream manager

8ca3843

Improved runtime performance in key routine mpas_dmpar_get_exch_list:…

53372d1

… use OpenMP threading to speedup calls to mpas_binary_search

mgduda added enhancement Framework labels Jun 12, 2017

mgduda requested review from mark-petersen, matthewhoffman and akturner June 12, 2017 17:28

mgduda added the Don't Merge label Jul 10, 2017

matthewhoffman approved these changes Sep 5, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPAS speedup of model initialisation #1350

MPAS speedup of model initialisation #1350

climbfuji commented Jun 5, 2017

climbfuji commented Jun 12, 2017

matthewhoffman left a comment

MPAS speedup of model initialisation #1350

Are you sure you want to change the base?

MPAS speedup of model initialisation #1350

Conversation

climbfuji commented Jun 5, 2017

climbfuji commented Jun 12, 2017

matthewhoffman left a comment

Choose a reason for hiding this comment