Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pipeline manager #181

Merged
merged 8 commits into from
Jan 8, 2025
Merged

Add pipeline manager #181

merged 8 commits into from
Jan 8, 2025

Conversation

xgarrido
Copy link
Collaborator

@xgarrido xgarrido commented Jan 2, 2025

This PR adds a pspipe-run binary to ease the sequential execution of the different python scripts needed when analysing data (such as ACT DR6). The configuration of the pipeline is done via a yaml file. Several examples for ACT DR6 data are proposed within data_analysis/yaml directory.

There are several options for the pipeline.yml file that relates to slurm configuration @ NERSC. By default, the location of python script is where pspipe has been installed but you can set a different path to the python scripts with the variable script_base_dir. You can also set the location of the global.dict file and where all the pipeline products will be stored. Both parameters can also be set via the command line.

The variables block (see yaml/pipeline_dr6.yml) allows you to overload values from the original dict file without changing the content of this file. This way the dict file always remains the same and the values are only change the time of the pipeline execution. The current dict file used by the run is in any case stored within the product_dir directory.

Finally, you have to define a pipeline section with the needed python modules. For each module, you can also set different option such as the number of tasks ntasks and the number of CPUs per task cpus_per_task. The script checks if the module has been already run and skips it. You can force the re-execution of a module by adding the option force: true at the module level. You can also ask for a minimal amount of time needed to run the module : in case the remaining allocation time is not enough at NERSC, then the program will tell you to re-allocate time. Here is an example of such block and its options

  get_covariance_blocks:
    force: true
    slurm:
      nodes: 2
      ntasks: 8
      cpus_per_task: 64
      minimal_needed_time: 03:00:00

The yaml/pipeline_dust.yml also shows how to handle different options for the same module name (using a matrix block in a similar way of what github does for CI).

@xgarrido xgarrido added the enhancement New feature or request label Jan 2, 2025
@thibautlouis thibautlouis merged commit 2827b72 into master Jan 8, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants