Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The number of tasks submitted by SLURM exceeded the limit #64

Open
aaannaw opened this issue Jul 11, 2024 · 16 comments
Open

The number of tasks submitted by SLURM exceeded the limit #64

aaannaw opened this issue Jul 11, 2024 · 16 comments

Comments

@aaannaw
Copy link

aaannaw commented Jul 11, 2024

Hello, professor
I was running the pipeline to align my genome assemblies with mm10 genome via slurm: ./make_chains.py target query mm10.fasta Bsu.softmask.fasta --pd mm-Bsu -f --chaining_memory 30 --cluster_queue pNormal --executor slurm --nextflow_executable /data/01/user157/software/bin/nextflow and I encounter an error after running the command for several minnites:

[fe/5bafab] NOTE: Error submitting process 'execute_jobs (206)' for execution -- Execution is retried (3)
[ff/d8223b] NOTE: Error submitting process 'execute_jobs (212)' for execution -- Execution is retried (3)
[4a/34ad45] NOTE: Error submitting process 'execute_jobs (209)' for execution -- Execution is retried (3)
 ERROR ~ Error executing process > 'execute_jobs (91)'                                                                                           
Caused by:
Failed to submit process to grid scheduler for execution                                                                                                                                                                                            Command executed:                                                                                                                               
sbatch .command.run                                                                                                                                                                                                                                                                           Command exit status:  
1                                                                                                      
Command output: 
sbatch: error: QOSMaxSubmitJobPerUserLimit                                                             
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)                                                                                                                                                
Work dir: 
/data/01/p1/user157/software/make_lastz_chains/mm-Bsu/temp_lastz_run/work/23/a09dba9e82d536f1f39b26de92d7d0
 Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`                                                                                                                                        
 -- Check '.nextflow.log' file for details

The error is because our server limits the maximum number of submitted tasks per person to 100 and I find the default chunk size will generated 1955 jobs, which is well over 100 limited jobs.
be3aefc01c900c8caee0c39593d377c
Thus, I attempted to increase the chunk size like this: ./make_chains.py target query mm10.fasta Bsu.softmask.fasta --pd mm-Bsu -f --chaining_memory 30 --cluster_queue pNormal --executor slurm --nextflow_executable /data/01/user157/software/bin/nextflow --seq1_chunk 500000000 --seq2_chunk 500000000. However, this still generated 270 jobs as following.
a8e97638859f6f9db82b67f20660b4f
This is unbelievable. I checked and found that when the number of scaffolds is too much, up to 100 scaffolds are put in a chunk for comparison, even though they don't add up to the chunk size. I don't know what's going on here.
Anyway, I think there should exist the method, without increasing the chunk size (as I understand that increasing the chunk size would increases the runtime), that allow me to submit multiple lines command per task, which would guarantee that I would complete 1955 commands with less than 100 tasks submitted!
Looking forward with your suggestions!
Best wishes!
Na Wan

@MichaelHiller
Copy link
Collaborator

100 jobs per user is very very restrictive. I typically submit a few thousand jobs.

To get the number down to less than 100, you likely have to increase the chunksize AND the seq limit parameter (no of scaffolds that can be bundled in one job) further. Hope that helps

@ohdongha
Copy link

Pardon me for hitchhiking.

100 jobs per user is very very restrictive. I typically submit a few thousand jobs.

@MichaelHiller In the legacy version (v1.0.0), there was a parameter EXECUTOR_QUEUESIZE, which I believe could limit the number of jobs submitted at once:

  --executor_queuesize EXECUTOR_QUEUESIZE
                        Controls NextFlow queueSize parameter: maximal number of
                        jobs in the queue (default 2000)

I realize that v.2.0.8 does not have this parameter. Was there a reason to remove this parameter?

It would be convenient to have a parameter to limit the number of jobs submitted at once. We could create several thousands of jobs and let it run 100 at a time. It will take time for sure, but we won't need to worry about the number of jobs, etc.

@ohdongha
Copy link

ohdongha commented Jul 11, 2024

One thing we could try is to add a generic NextFlow config file with executor.queueSize set to 100 (for slurm in this case).
https://www.nextflow.io/docs/latest/config.html#configuration-file

Perhaps the parameter can be added to $HOME/.nextflow/config so that all NextFlow processes can use it (unless overridden by another config file or arguments with higher priority).

@aaannaw
Copy link
Author

aaannaw commented Jul 11, 2024

One thing we could try is to add a generic NextFlow config file with executor.queueSize set to 100 (for slurm in this case). https://www.nextflow.io/docs/latest/config.html#configuration-file

Perhaps the parameter can be added to $HOME/.nextflow/config so that all NextFlow processes can use it (unless overridden by another config file or arguments with higher priority).

Some genomes (e.g. GCF_001194135.2) have large numbers of very small scaffolds. It could help to remove all scaffolds <2Kb or <1.5Kb since I am not sure if we will get anything useful from aligning them.

Hello, @ohdongha
This advice is for me? However, I can not find the config file in the directory .nextflow. Thus I create the config file and edit it as following:

executor {
  name = 'slurm'
  queueSize = 100  // Set your desired queue size here
}

However, I got the same error: sbatch: error: QOSMaxSubmitJobPerUserLimit although I have run "source ~/.zshrc"
Maybe could give me any suggestions?

@MichaelHiller
Copy link
Collaborator

Sorry, I am not so familiar with NextFlow, but the queueSize parameter could be a good idea.
@kirilenkobm Could you pls comment on why this was removed? Maybe it is no longer compatible with the newer NextFlow version that we had to updated to?

@aaannaw
Copy link
Author

aaannaw commented Jul 11, 2024

I attempted to running the older version (v1.0.0) to solve the problem with ./make_chains.py target query mm10.fasta 1.Hgl.softmask.fasta --pd mm-Hgl --force_def --chaining_memory 70 --executor_partition pNormal --executor slurm --executor_queuesize 100. However, again, I got the error:

N E X T F L O W  ~  version 23.10.1Nextflow DSL1 is no longer supported — Update your script to DSL2, or use Nextflow 22.10.x or earlier                                           
/data/00/user/user157/miniconda3/lib/python3.9/site-packages/py_nf/py_nf.py:404: UserWarning: Nextflow pipeline lastz_targetquery failed! Execute function returns 1.
  warnings.warn(msg)
Uncaught exception from user code:
        Command failed:
        /data/01/p1/user157/software/make_lastz_chains-1.0.0/mm-Hgl/TEMP_run.lastz/doClusterRun.sh                                              
        HgAutomate::run("/data/01/p1/user157/software/make_lastz_chains-1.0.0/mm-Hgl/T"...) called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/HgRemoteScript.pm line 117
        HgRemoteScript::execute(HgRemoteScript=HASH(0x55b2dac4aa10)) called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 423
        main::doLastzClusterRun() called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/HgStepManager.pm line 169        
        HgStepManager::execute(HgStepManager=HASH(0x55b2dad4d188)) called at /data/01/p1/user157/software/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 877
Error!!! Output file /data/01/p1/user157/software/make_lastz_chains-1.0.0/mm-Hgl/target.query.allfilled.chain.gz not found!                     
The pipeline crashed. Please contact developers by creating an issue at:                                                                        
https://github.com/hillerlab/make_lastz_chains

@ohdongha
Copy link

ohdongha commented Jul 11, 2024

version 23.10.1Nextflow DSL1 is no longer supported — Update your script to DSL2, or use Nextflow 22.10.x or earlier

@aaannaw For this, one workaround that worked for me was to include this (as a global parameter on the node you run nextflow) when running make_lastz_chain v1:

export NXF_VER=22.10.0 

Note: I am not sure if this worked for me because I installed an older version of nextflow first and then updated it with nextflow self-update. After updating, I had the same error as you when running the legacy make_lastz_chains. Setting the variable above solved the problem for me.

Note2: maybe it will work as long as the node can download jar for the older nextflow version: nextflow-io/nextflow#1613

@aaannaw
Copy link
Author

aaannaw commented Jul 12, 2024

@ohdongha
I installed nextflow v22.10.8 and there is the parameter --executor_queuesize.

./make_chains.py -h                                                                                                                           
usage: make_chains.py [-h] [--project_dir PROJECT_DIR] [--DEF DEF] [--force_def] [--continue_arg CONTINUE_ARG] [--executor EXECUTOR]
                      [--executor_queuesize EXECUTOR_QUEUESIZE] [--executor_partition EXECUTOR_PARTITION]
                      [--cluster_parameters CLUSTER_PARAMETERS] [--lastz LASTZ] [--seq1_chunk SEQ1_CHUNK] [--seq2_chunk SEQ2_CHUNK]
                      [--blastz_h BLASTZ_H] [--blastz_y BLASTZ_Y] [--blastz_l BLASTZ_L] [--blastz_k BLASTZ_K]                                   
                      [--fill_prepare_memory FILL_PREPARE_MEMORY] [--chaining_memory CHAINING_MEMORY] [--chain_clean_memory CHAIN_CLEAN_MEMORY]
                      target_name query_name target_genome query_genome

Now I run the pipeline with the command:
./make_chains.py target query mm10.fasta 1.Hgl.softmask.fasta --pd mm-Hgl --force_def --chaining_memory 70 --executor_partition pNormal --executor slurm --executor_queuesize 100
However, I get the similar error:

executor >  slurm (100)
[87/924a24] process > execute_jobs (100) [  0%] 0 of 1496

executor >  slurm (100)
[87/924a24] process > execute_jobs (100) [  0%] 0 of 1496
WARN: [SLURM] queue (pNormal) status cannot be fetched
- cmd executed: squeue --noheader -o %i %t -t all -p pNormal -u user157
- exit status : 1
- output      :
  slurm_load_jobs error: Unexpected message received


executor >  slurm (100)
[87/924a24] process > execute_jobs (100) [  0%] 0 of 1496
WARN: [SLURM] queue (pNormal) status cannot be fetched
- cmd executed: squeue --noheader -o %i %t -t all -p pNormal -u user157
- exit status : 1
- output      :
- slurm_load_jobs error: Unexpected message received

It displays "[SLURM] queue (pNormal) status cannot be fetched" but the pNormal partition is corrected:

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
pLiu         up   infinite      1  down* lz17
pLiu         up   infinite     11    mix lz[00,02-09,18-19]
pLiu         up   infinite      1  alloc lz20
pNormal*     up   infinite      3   drng lz[32,35,40]
pNormal*     up   infinite     11    mix lz[25-28,30,33-34,36-39]
pNormal*     up   infinite      1   down lz29
pBig         up   infinite      2    mix lz[10,31]
pBig         up   infinite      1   idle lz11

The log file is attached. Could you give me any suggestions?

1.make_chains.log

@MichaelHiller
Copy link
Collaborator

This is likely an issue with your cluster. Can you test submitting any other jobs via Nextflow?
The error message is unfortunately completely useless.
Maybe @kirilenkobm can have look?

@aaannaw
Copy link
Author

aaannaw commented Jul 12, 2024

@MichaelHiller
As the size of the log file exceeded the limit, I only intercepted the first 1000 lines to show above. At the end of the log file there is this error reported to show in the image below, I doubt that this parameter does limit the number of tasks within 100.
image

This is likely an issue with your cluster. Can you test submitting any other jobs via Nextflow? The error message is unfortunately completely useless. Maybe @kirilenkobm can have look?

@ohdongha
Copy link

@ohdongha I installed nextflow v22.10.8 and there is the parameter --executor_queuesize.

@aaannaw If --executor slurm does not work, perhaps you could try this (see also #60 (comment)): just submit the entire run to a single computing node with many CPU cores (threads) and set --executor local --executor_queuesize N where N is the number of CPU cores in that single node. I typically add --chaining_memory 200000 (for larger genomes) and ask for a node with >=32 cores and >=200GB RAM. It could take a day or two in wall clock time for larger genomes.

I also plan to try running v.2.0.8 on multiple computing nodes on our HPC system, submitting jobs from the login (head) node that has permission to do so. I will see how it goes.

@aaannaw
Copy link
Author

aaannaw commented Jul 14, 2024

@ohdongha
Sorry for the delayed response. I have submitted the entire run to a single computing node with 40 CPUs. In my server, 40 is the number of CPU cores for single node.
image

However, it seems that the run did not work with parallel way by using all CPUs.
image

After running 39 hours, only 14% of process is finished, as shown in the make_chains.log.
image

@ohdongha
Copy link

ohdongha commented Jul 14, 2024

@aaannaw

However, it seems that the run did not work with parallel way by using all CPUs.
After running 39 hours, only 14% of process is finished, as shown in the make_chains.log.

For the parallel run, you may need to check the wall time and CPU time if your system reports them after the job is done.

In my case, a recent alignment of human vs. Chinese hamster, for example, took 21.7 hours in wall time and 506.0 hours in CPU time, which means (506/21.7=) 23.3 CPU cores have been used on average. I asked for a node with 32 CPUs for this run. I guess the ratio was not closer to 32 because, after the first lastz step, other steps may have run as fewer parallel jobs or even a single job (e.g., the cleanChain step).

You may want to check this ratio first, perhaps using a smaller genome pair that creates fewer lastz jobs (but more than 40).

If the run is slow, you may also want to check if the two genomes have been masked enough. Michael always emphasizes to use RepeatModeler + RepeatMasker. Masking further with windowmasker may also help. Repeats that have escaped the masking step will increase the runtime and generate a lot of short and useless "pile-up" alignments.

@aaannaw
Copy link
Author

aaannaw commented Jul 15, 2024

@ohdongha
I am sure that our genomes are masked with repreatMaske, repeatMolderler, TRF and LTR. I'm trying to determine if the genome provided for mm10 needs to be additionally masked for repeat sequences?
image

@aaannaw
Copy link
Author

aaannaw commented Jul 15, 2024

@ohdongha
Perhaps the required input is hard-masked file but my masked input is soft-masked file.

@ohdongha
Copy link

ohdongha commented Jul 15, 2024

@aaannaw

Perhaps the required input is hard-masked file but my masked input is soft-masked file.

Soft-masked fasta files should be fine (and perhaps needed for the fillChain step). I use the soft-masked, and I see a substantial reduction of runtime and number of chains (and very often just a slight reduction in alignment coverage of CDS, etc.) when I apply more aggressive masking (with windowmasker).

I checked the UCSC mm10 (fasta), and it has ~43.9% of all nucleotides soft-masked. That is close to what I have previously used for mouse GRCm38 (~44.5% masked by windowmasker with -t_thres parameter set to the equivalent of 97.5%).

It is hard to know if the slow progress is due to repeats or the SLURM node not firing up all gears. I guess some tests, e.g., aligning a smaller genome pair, may help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants