Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speeding up/parallelizing make_chains.py with Slurm: NOTE: Error submitting process 'execute_jobs (##)' for execution -- Execution is retried #58

Open
SomePersonSomeWhereInTheWorld opened this issue Apr 30, 2024 · 3 comments

Comments

@SomePersonSomeWhereInTheWorld

I'm trying to help a researcher speed up make_chains.py results for a mammal. Using this closed issue regarding parallelization, as inspiration, we'd like to speed up the process via our Slurm cluster running RHEL 8. I tried requesting a node via an interactive srun session and starting with 16 CPU with --ntasks and -c. Using --executor local as suggested in the closed thread was painfully slow. The user there mention --cluster_parameters but that results in: make_chains.py: error: unrecognized arguments: --cluster_parameters cpus=16

./make_chains.py MesAur_chr_folded mm10  /path/to/me/make_lastz_chains/MesAur_chr_folded.2bit /path/to/me/make_lastz_chains/mm10.2bit --pd test_out_1 -f --chaining_memory 16   --cluster_executor slurm 
# Make Lastz Chains #
Version 2.0.8
Commit: 187e313afc10382fe44c96e47f27c4466d63e114
Branch: main

* found run_lastz.py at /path/to/me/make_lastz_chains/standalone_scripts/run_lastz.py
* found run_lastz_intermediate_layer.py at /path/to/me/make_lastz_chains/standalone_scripts/run_lastz_intermediate_layer.py
* found chain_gap_filler.py at /path/to/me/make_lastz_chains/standalone_scripts/chain_gap_filler.py
* found faToTwoBit at /cluster/opt/lastz/1.04.15/faToTwoBit
* found twoBitToFa at /cluster/opt/lastz/1.04.15/twoBitToFa
* found pslSortAcc at /cluster/opt/lastz/1.04.15/pslSortAcc
* found axtChain at /cluster/opt/lastz/1.04.15/axtChain
* found axtToPsl at /cluster/opt/lastz/1.04.15/axtToPsl
* found chainAntiRepeat at /cluster/opt/lastz/1.04.15/chainAntiRepeat
* found chainMergeSort at /cluster/opt/lastz/1.04.15/chainMergeSort
* found chainCleaner at /cluster/opt/lastz/1.04.15/chainCleaner
* found chainSort at /cluster/opt/lastz/1.04.15/chainSort
* found chainScore at /cluster/opt/lastz/1.04.15/chainScore
* found chainNet at /cluster/opt/lastz/1.04.15/chainNet
* found chainFilter at /cluster/opt/lastz/1.04.15/chainFilter
* found lastz at /cluster/opt/lastz/1.04.15/lastz
* found nextflow at /cluster/opt/nextflow/23.10.1/nextflow
All necessary executables found.
Making chains for /path/to/me/make_lastz_chains/MesAur_chr_folded.2bit and /path/to/me/make_lastz_chains/mm10.2bit files, saving results to /path/to/me/make_lastz_chains/test_out_1
Pipeline started at 2024-04-30 11:24:17.231861
* Setting up genome sequences for target
genomeID: MesAur_chr_folded
input sequence file: /path/to/me/make_lastz_chains/MesAur_chr_folded.2bit
is 2bit: True
planned genome dir location: /path/to/me/make_lastz_chains/test_out_1/target.2bit
Created symlink from /path/to/me/make_lastz_chains/MesAur_chr_folded.2bit to /path/to/me/make_lastz_chains/test_out_1/target.2bit
For MesAur_chr_folded (target) sequence file: /path/to/me/make_lastz_chains/test_out_1/target.2bit; chrom sizes saved to: /path/to/me/make_lastz_chains/test_out_1/target.chrom.sizes
* Setting up genome sequences for query
genomeID: mm10
input sequence file: /path/to/me/make_lastz_chains/mm10.2bit
is 2bit: True
planned genome dir location: /path/to/me/make_lastz_chains/test_out_1/query.2bit
Created symlink from /path/to/me/make_lastz_chains/mm10.2bit to /path/to/me/make_lastz_chains/test_out_1/query.2bit
For mm10 (query) sequence file: /path/to/me/make_lastz_chains/test_out_1/query.2bit; chrom sizes saved to: /path/to/me/make_lastz_chains/test_out_1/query.chrom.sizes

### Partition Step ###

# Partitioning for target
Saving partitions and creating 238 buckets for lastz output
In particular, 19 partitions for bigger chromosomes
And 219 buckets for smaller scaffolds
Saving target partitions to: /path/to/me/make_lastz_chains/test_out_1/target_partitions.txt
# Partitioning for query
Saving partitions and creating 65 buckets for lastz output
In particular, 64 partitions for bigger chromosomes
And 1 buckets for smaller scaffolds
Saving query partitions to: /path/to/me/make_lastz_chains/test_out_1/query_partitions.txt
Num. target partitions: 19
Num. query partitions: 64
Num. lastz jobs: 1216

### Lastz Alignment Step ###

LASTZ: making jobs
LASTZ: saved 15470 jobs to /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_joblist.txt
Parallel manager: pushing job /cluster/opt/nextflow/23.10.1/nextflow /path/to/me/make_lastz_chains/parallelization/execute_joblist.nf --joblist /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_joblist.txt -c /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_config.nf
N E X T F L O W  ~  version 23.10.1
Launching `/path/to/me/make_lastz_chains/parallelization/execute_joblist.nf` [maniac_thompson] DSL2 - revision: 0483b29723
[84/955b71] process > execute_jobs (27) [  0%] 28 of 3913, failed: 28, retries: 28
[c5/32a7bd] NOTE: Error submitting process 'execute_jobs (18)' for execution -- Execution is retried (1)
[26/dd5dc9] NOTE: Error submitting process 'execute_jobs (4)' for execution -- Execution is retried (1)

May I request assistance here to get the correct syntax?

P.S.. I can confirm the suggested shabang fix in this thread also works to start the sample jobs.

@MichaelHiller
Copy link
Collaborator

@kirilenkobm Could you pls have a look if the --cluster_parameters is a retired parameter?
Thx

@SomePersonSomeWhereInTheWorld
Copy link
Author

Here is the top part of the .nextflow.log. Is there another option I need to use?

Apr-30 12:32:43.670 [main] DEBUG nextflow.cli.Launcher - $> nextflow /path/to/me/make_lastz_chains/parallelization/execute_joblist.nf --joblist /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_joblist.txt -c /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_config.nf
Apr-30 12:32:43.723 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 23.10.1
Apr-30 12:32:43.740 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/cluster/home/me/.nextflow/plugins; core-plugins: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]
Apr-30 12:32:43.749 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Apr-30 12:32:43.750 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Apr-30 12:32:43.752 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
Apr-30 12:32:43.766 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
Apr-30 12:32:43.784 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_config.nf
Apr-30 12:32:43.785 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/lastz_config.nf
Apr-30 12:32:43.804 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Apr-30 12:32:44.203 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 from script declararion
Apr-30 12:32:44.218 [main] INFO  nextflow.cli.CmdRun - Launching `/path/to/me/make_lastz_chains/parallelization/execute_joblist.nf` [ridiculous_mcnulty] DSL2 - revision: 0483b29723
Apr-30 12:32:44.219 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[]
Apr-30 12:32:44.219 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins resolved requirement=[]
Apr-30 12:32:44.227 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /cluster/home/me/.nextflow/secrets/store.json
Apr-30 12:32:44.230 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@10f7c76] - activable => nextflow.secret.LocalSecretsProvider@10f7c76
Apr-30 12:32:44.275 [main] DEBUG nextflow.Session - Session UUID: 5fae7dbe-8c74-4805-926b-aa6223f5ae87
Apr-30 12:32:44.275 [main] DEBUG nextflow.Session - Run name: ridiculous_mcnulty
Apr-30 12:32:44.276 [main] DEBUG nextflow.Session - Executor pool size: 24
Apr-30 12:32:44.282 [main] DEBUG nextflow.file.FilePorter - File porter settings maxRetries=3; maxTransfers=50; pollTimeout=null
Apr-30 12:32:44.285 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=72; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
Apr-30 12:32:44.382 [main] DEBUG nextflow.cli.CmdRun - 
  Version: 23.10.1 build 5891
  Created: 12-01-2024 22:01 UTC (17:01 EDT)
  System: Linux 4.18.0-193.el8.x86_64
  Runtime: Groovy 3.0.19 on Java HotSpot(TM) 64-Bit Server VM 20.0.1+9-29
  Encoding: UTF-8 (UTF-8)
  Process: 322176@g261 [10.197.17.16]
  CPUs: 24 - Mem: 50 GB (47.8 GB) - Swap: 0 (0)
Apr-30 12:32:44.424 [main] DEBUG nextflow.Session - Work-dir: /path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/work [lustre]
Apr-30 12:32:44.424 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /path/to/me/make_lastz_chains/parallelization/bin
Apr-30 12:32:44.434 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[]
Apr-30 12:32:44.442 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Apr-30 12:32:44.458 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory
Apr-30 12:32:44.468 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 25; maxThreads: 1000
Apr-30 12:32:44.551 [main] DEBUG nextflow.Session - Session start
Apr-30 12:32:44.692 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Apr-30 12:32:44.801 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Apr-30 12:32:44.802 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Apr-30 12:32:44.809 [main] DEBUG nextflow.executor.Executor - [warm up] executor > slurm
Apr-30 12:32:44.814 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'slurm' > capacity: 1000; pollInterval: 5s; dumpInterval: 5m 
Apr-30 12:32:44.816 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: slurm)
Apr-30 12:32:44.817 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
Apr-30 12:32:44.869 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: execute_jobs
Apr-30 12:32:44.869 [main] DEBUG nextflow.Session - Igniting dataflow network (2)
Apr-30 12:32:44.874 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > execute_jobs
Apr-30 12:32:44.874 [main] DEBUG nextflow.script.ScriptRunner - Parsed script files:
  Script_f6c411a586096bcb: /path/to/me/make_lastz_chains/parallelization/execute_joblist.nf
Apr-30 12:32:44.874 [main] DEBUG nextflow.script.ScriptRunner - > Awaiting termination 
Apr-30 12:32:44.874 [main] DEBUG nextflow.Session - Session await
Apr-30 12:32:45.049 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=execute_jobs (5); work-dir=/path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/work/4e/e79fd76c40079431a8db2ed4875930
  error [nextflow.exception.ProcessFailedException]: Error submitting process 'execute_jobs (5)' for execution
Apr-30 12:32:45.057 [Task submitter] INFO  nextflow.processor.TaskProcessor - [4e/e79fd7] NOTE: Error submitting process 'execute_jobs (5)' for execution -- Execution is retried (1)
Apr-30 12:32:45.091 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=execute_jobs (1); work-dir=/path/to/me/make_lastz_chains/test_out_1/temp_lastz_run/work/3d/0dd6dde2634f3b7abf79184db82243

@SomePersonSomeWhereInTheWorld
Copy link
Author

@MichaelHiller am I understanding the documentation correctly?

To run the pipeline on a Slurm cluster, for instance, add the --executor slurm option. Refer to the Nextflow documentation for a list of supported executors.

The Nextflow Slurm page says:

To enable the SLURM executor, set process.executor = 'slurm' in the nextflow.config file.
Resource requests and other job characteristics can be controlled via the following process directives:
clusterOptions
cpus
memory
queue
time

I know --cluster_executor slurm works. So if in an interactive or non-interactive, i.e., SBATCH, job, if --ntasks is specified, does make_chains.py consider Slurm options as noted in the Nextlow docs?

FWIW I do not see --cluster_parameters on the Full list of the pipeline CLI parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants