Using Dask distributed may slow down preprocessor steps running before concatenate #2073

sloosvel · 2023-05-30T16:16:56Z

With the great introduction of the possibility of configuring Dask distributed in #2049, I tried to run a test recipe using our model's high res configuration. The model files are splitted in chunks of one year.

High res data with the same resolution is also available in DKRZ in (for example) here: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ/HighResMIP/EC-Earth-Consortium/EC-Earth3P-HR/control-1950/r1i1p2f1/Omon/tos/gn/v20181119

So preprocessor steps (fix_file can be ignored) load, check_metadata and concatenate have to be applied to multiple iris cubes before all the data is gathered in a single cube after concatenation. High res data in irregular grids also has the extra challenge that the coordinate and bound arrays are quite heavy and if they get realised multiple times, it can make your run hang due to memory issues (see SciTools/iris#5115).

Running a recipe with two variables (2 tasks, without a dask.yml file), 92 years, in a single default node in our infrastructure tends to take in these steps:

2023-05-30 11:09:34,169 UTC [434129] DEBUG   Running preprocessor step load
... Less than a minute later
2023-05-30 11:10:17,494 UTC [434129] DEBUG   Running preprocessor step fix_metadata
.... Almost a minute later
2023-05-30 11:11:09,875 UTC [434129] DEBUG   Running preprocessor step concatenate
.... 5 minutes later
2023-05-30 11:16:03,287 UTC [434129] DEBUG   Running preprocessor function 'cmor_check_metadata' on the data

Whereas in a SLURMCluster with this configuration (not sure if it's an optimal configuration) using two regular nodes:

  cluster:
    type: dask_jobqueue.SLURMCluster
    account: bsc32
    cores: 8
    memory: 32GB 
    local_directory: "$TMPDIR"
    n_workers: 8
    processes: 4
    job_extra_directives: ["--exclusive"]
    job_directives_skip: ['--mem', '-p']
    walltime: '08:00:00'
    interface: "eno1" # for some reason I cannot use ib0 nor ib1

2023-05-30 15:57:53,666 UTC [74260] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step load
... Less than a minute later
2023-05-30 15:58:39,421 UTC [74260] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step fix_metadata
... More than a minute later
2023-05-30 16:00:15,874 UTC [74260] DEBUG   Running preprocessor step concatenate
... 7 minutes later
2023-05-30 16:07:10,059 UTC [74259] DEBUG   Running preprocessor step cmor_check_data

And using even more resources`with 4 nodes but keeping the other parameters the same (again, not sure if optimal) :

  cluster:
    type: dask_jobqueue.SLURMCluster
    account: bsc32
    cores: 8
    memory: 32GB
    local_directory: "$TMPDIR"
    n_workers: 16
    processes: 4
    job_extra_directives: ["--exclusive"]
    job_directives_skip: ['--mem', '-p']
    walltime: '08:00:00'
    interface: "eno1"

2023-05-30 14:10:56,757 UTC [5985] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step load
...Less than a minute later
2023-05-30 14:11:39,935 UTC [5985] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step fix_metadata
... Two minutes later
2023-05-30 14:13:46,064 UTC [5985] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step concatenate
... Eight minutes later
2023-05-30 14:21:06,361 UTC [5985] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step cmor_check_metadata

Our VHR data (which is not available on ESGF) behaves even worse because the files are splitted in chunks of one month for monthly variables. So you can get stuck concatenating files for 30 min. My guess is that the loop over the cubes maybe does not scale well? All examples are run in nodes that got requested exclusively for the jobs. But I also don't know if the cluster configuration is just plain bad. I tried many other configurations (less memory, more cores, more number of processes, more nodes, a combination of more everything) and none seemed to get better though.

I also had to run this with the changes in SciTools/iris#5142, otherwise it would not have been possible to use our default nodes. Requesting higher memory nodes is not always a straight forward solution because it may leave your jobs in the queue for several days.

The text was updated successfully, but these errors were encountered:

bouweandela · 2023-06-06T12:51:32Z

`interface: "eno1" # for some reason I cannot use ib0 nor ib1`

Maybe your compute cluster doesn't have Infiniband? Could you have a look at the network interfaces you have available on the compute nodes? You can list them with srun ip link show (assuming you're using slurm). Network interfaces starting with en are ethernet interfaces, so may not be the fastest.

sloosvel · 2023-06-09T07:34:43Z

It should have infiniband according to the technical specs, and it shows in the network interfaces:

5: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 256
    link/infiniband a0:00:02:20:fe:80:00:00:00:00:00:00:5c:f3:fc:00:00:05:47:5c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
6: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc mq state DOWN mode DEFAULT group default qlen 256
    link/infiniband a0:00:03:00:fe:80:00:00:00:00:00:00:5c:f3:fc:00:00:05:47:5d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

But setting either configuration, though ib1 seems to be down, leads to this error

RuntimeError: Cluster failed to start: interface 'ib0' doesn't have an IPv4 address

I guess it would be best to ask the sysadmins.

In any case, whether the connection between nodes is faster or not using infiniband, I think the two by two concatenation in concatenate may not be very optimal when dealing multiple files.

bouweandela · 2023-06-09T12:38:10Z

Ethernet networks are a lot slower than Infiniband, so communicating the data will be slower and could explain the performance issue reported here. My recommendation would be to try and get Infiniband configured. Are you running ESMValCore on the head node or on a compute node? If you're running it on the head node, it is possible that it has a different network setup from the compute nodes. In that case, you can configure the network interface separately for the head node which is running the scheduler and for the workers as in this example: dask/dask-jobqueue#382 (comment).

valeriupredoi · 2023-06-14T11:52:33Z

not to be laughed at please - but isn't the network card more important than the actual cable/connection type? 😁 Also, we should not assume any Infini or Efini or Fini types of connection in our suggested settings (Efini is a Mazda, not a connection OK 😁 )

sloosvel · 2023-06-21T11:01:05Z

Well I managed to use infiniband by running the recipe in a compute node instead of in the login node, not much improvement either. In any case, I think there are many issues in the concatenation. The graph in the dashboard looks quite bad when the concatenation starts.

For instance from our side, we are realising the data if the cubes overlap:

ESMValCore/esmvalcore/preprocessor/_io.py

Line 533 in ac98c69

if c1_delta.data.shape == ():

I think that it would be worth it to include the concatenation in the list of performance issues that are pending to be solved

valeriupredoi · 2023-06-21T11:50:42Z

that's an overlap-gated cube @sloosvel - not the biggest problem you have on your hands, it's usually a few years-long at most 😁

valeriupredoi · 2023-06-21T11:51:06Z

I think that it would be worth it to include the concatenation in the list of performance issues that are pending to be solved

Indeed!

sloosvel · 2023-06-21T12:21:32Z

that's an overlap-gated cube @sloosvel - not the biggest problem you have on your hands, it's usually a few years-long at most

It's happening to us that we have a cube with 18000 timesteps getting realised...

valeriupredoi · 2023-06-21T12:29:23Z

that's an overlap-gated cube @sloosvel - not the biggest problem you have on your hands, it's usually a few years-long at most

It's happening to us that we have a cube with 18000 timesteps getting realised...

oh boy, that's one chunky overlap; you guys using millisecond means data? 🤣

bouweandela · 2023-06-22T07:49:36Z

we are realising the data if the cubes overlap

That is something that would need to be fixed first, distributed really doesn't work well if the computation is not fully lazy. I made a pull request to fix it: #2109.

sloosvel · 2023-07-24T10:47:34Z

As discussed in past meetings, the main issue with the concatenation of many cubes, specially if they contain HR data, comes from the fact that auxiliary coordinate array values need to be compared to ensure they are equal. This comparison happens sequentially in iris and requires to compute the values. We agreed that maybe we could discuss with the iris team to consider performing this comparison only once between all arrays, instead of sequentially. In the meanwhile, since iris allows to ignore the checks on auxiliary arrays, among other additional data in a cube, we agreed on tying the strictness of the concatenation checks to the check_level flag.

I will open a PR with the changes.

bouweandela · 2023-08-08T10:20:23Z

Did you also open the iris issue? I couldn't find it yet.

bouweandela · 2024-05-15T06:32:49Z

Iris issue opened SciTools/iris#5750

bouweandela · 2024-07-03T12:59:08Z

@sloosvel I'm working on parallel coordinate comparison in Iris in SciTools/iris#5926, would you have time to try it out and provide me with some feedback?

sloosvel added question Further information is requested dask related to improvements using Dask labels May 30, 2023

bouweandela mentioned this issue Jun 22, 2023

Make preprocessor lazy #674

Open

62 tasks

sloosvel mentioned this issue Jul 24, 2023

Relax concatenation checks for --check_level=relax and --check_level=ignore #2144

Merged

9 tasks

bouweandela added this to ESiWACE3 ESMValTool service Feb 12, 2024

bouweandela moved this to Todo in ESiWACE3 ESMValTool service Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Dask distributed may slow down preprocessor steps running before concatenate #2073

Using Dask distributed may slow down preprocessor steps running before concatenate #2073

sloosvel commented May 30, 2023

bouweandela commented Jun 6, 2023 •

edited

Loading

sloosvel commented Jun 9, 2023

bouweandela commented Jun 9, 2023

valeriupredoi commented Jun 14, 2023

sloosvel commented Jun 21, 2023

valeriupredoi commented Jun 21, 2023

valeriupredoi commented Jun 21, 2023

sloosvel commented Jun 21, 2023

valeriupredoi commented Jun 21, 2023

bouweandela commented Jun 22, 2023 •

edited

Loading

sloosvel commented Jul 24, 2023

bouweandela commented Aug 8, 2023

bouweandela commented May 15, 2024

bouweandela commented Jul 3, 2024

Using Dask distributed may slow down preprocessor steps running before concatenate #2073

Using Dask distributed may slow down preprocessor steps running before concatenate #2073

Comments

sloosvel commented May 30, 2023

bouweandela commented Jun 6, 2023 • edited Loading

sloosvel commented Jun 9, 2023

bouweandela commented Jun 9, 2023

valeriupredoi commented Jun 14, 2023

sloosvel commented Jun 21, 2023

valeriupredoi commented Jun 21, 2023

valeriupredoi commented Jun 21, 2023

sloosvel commented Jun 21, 2023

valeriupredoi commented Jun 21, 2023

bouweandela commented Jun 22, 2023 • edited Loading

sloosvel commented Jul 24, 2023

bouweandela commented Aug 8, 2023

bouweandela commented May 15, 2024

bouweandela commented Jul 3, 2024

bouweandela commented Jun 6, 2023 •

edited

Loading

bouweandela commented Jun 22, 2023 •

edited

Loading