Parallelization on linux leads to crash #55

wenzmo · 2024-04-12T12:09:29Z

One of my collegues is wokring mainly with two functions. One is the ctmm.guess and other is ctmm.fit.
Because he has a lot of individuals he tried to parallize the whole process like this:

doParallel::registerDoParallel(cl)

model_fit <- function(i)
{
  GUESS <-
    ctmm.guess(data[[i]],
               CTMM = ctmm(error = TRUE),
               interactive = FALSE)
  ctmm.select(data[[i]], GUESS) # this function has a built-in parallelization argument
}

After a certain time the our linux server with 196 threads crashes because of the heavily usage of cores.
We found out that the makeCluster(10) is not limmiting the processes to 10 cores.

The problem seems to be hidden in the ctmm.guess funtion. Inside is the variogramm function which has the argument 'fast=TRUE'.
I found the 'fast=TRUE' in the parallel.R script which seems to be the script managing the parallelization for the whole package.
This argument says if the operating system is a linux system it should use all logical cores (if Windows set number of cores to 1).

There are several problems comming with this:

I learned: never use all your cores by default! (This is maybe out-dated)
Its not mentioned in the vignettes/readme/description. You have to search for it.
It has a huge impact how fast your code is running depending on the operating system.
It is impossible parallelize with the ctmm.guess function due to the possible overflow of the system.

What I suggest is to mention this somewhere in the vignettes/readme/description or add an argument to select on your own how many cores you want to use. But definitely chage this to 'detectCors( ) - 1' (or even more).

chfleming · 2024-04-24T01:04:04Z

@wenzmo , The fast argument in variogram() is for FFT usage and is unrelated to the fast argument in the parallelization code, which is for choosing low-overhead fork versus high-overhead socket parallelization. I don't believe this argument can be passed from the former functions to the latter functions.

The detectCores() function is to determine the maximum possible number of cores, for limiting the user's choice and for interpreting negative arguments like cores=-1, which means "all cores but 1".

It should be the case that in all ctmm functions, the user must select the number of cores. In parallelized ctmm functions, the default is cores=1, which means no parallelization. The code you have quoted doesn't activate any parallelization in ctmm functions. ctmm.guess is not parallelized, but ctmm.select is somewhat parallelized IIRC, but you have to set the cores argument to something other than 1 to activate that. But, if you have multiple datasets to run, then its better for you the user to parallelize at the level of datasets and not parallelize within ctmm functions (which is default usage, as you have quoted).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization on linux leads to crash #55

Parallelization on linux leads to crash #55

wenzmo commented Apr 12, 2024

chfleming commented Apr 24, 2024

Parallelization on linux leads to crash #55

Parallelization on linux leads to crash #55

Comments

wenzmo commented Apr 12, 2024

chfleming commented Apr 24, 2024