This repository has been archived by the owner on May 22, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 4
/
NF+LSF.txt
169 lines (106 loc) · 5.57 KB
/
NF+LSF.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
Generally useful.
https://www.nextflow.io/docs/latest/index.html
Nextflow documentation
https://gitter.im/nextflow-io/nextflow
a chat channel with lot of Nextflow developers, and most importantly Paolo Di Tommaso.
https://github.com/nextflow-io/patterns
Nextflow patterns
http://scratchy.internal.sanger.ac.uk/index.php/IRODS_for_Sequencing_Users
IRODS information
http://mediawiki.internal.sanger.ac.uk/index.php/Using_the_Farm
Using the farm
TODO:
once the pipeline is more stable, we should start not pulling a branch, but
pulling a tag, i.e. a frozen moment in time. This will assist towards
replication.
Steps to run pipeline.
- get interactive farm node.
- start screen session.
screen -S name
- start conda environment
source activate rnaseq1.6 (conda)
- make appropriate directory in /lustre/scratch117/cellgen/cellgeni
mkdir stijn/test/foo # for testing
mkdir tic-99 # for non-datasharing ticket
For datasharing use the datasharing toplevel directory
mkdir datasharing/tic-74
We will keep adding toplevel names
- The main requirement at the moment: a samples file AND a study ID,
so --samplefile FNAME --studyid STUDYID. IRODS will be queried for the
sample names in the samples file FNAME using the study ID as well. The
pipeline can also be run on a set of fastq files using --fastqdir DIRNAME;
in that case a sample file still needs to be given, and for each SNAME in
the sample file the pipeline expects files SNAME_1.fastq.gz and
SNAME_2.fastq.gz to be present in the fastq directory. With the parameter
--singleend true, the pipeline expects SNAME.fastq.gz to be present.
We can get sample names either from the user, or from a studyid. Use myrods-study.sh
in our 'utils' repository; this may become a submodule of some of our other
repositories (outstanding task).
myrods-study.sh <STUDYID>
=====================================================================
I recommend storing the nextflow invocation in a file that
is sourced or is a bash script to be run. I customarily call this file RESUME.
Anything that's set either in the nextflow file as
a subordinate part of the toplevel 'params' variable, or is loaded
as a config file can be overridden.
For simple toplevel parameters (e.g. params.somevalue = "foo") this
can be done on the command line.
params.somevalue = "foo" # e.g. in main.nf
--somevalue foo # command line.
For more elaborate config settings determining 'params' or 'process' this
can be done by loading a json config file on the command line using -c
MYCONFIGFILE. I would describe the process as similar to a cascading style
sheet; Nextflow merges all its config files, but has a hierarchy of
precedence. This is the documentation:
https://www.nextflow.io/docs/latest/config.html
The hierarchy is this (not clearly stated in documentary - need
to verify this):
-c <config file> overrides
nextflow.config current directory overrides
script base directory overrides
$HOME/.nextflow.config
Use -C <config file> to only use that config file and no other. In our
current set-up this unlikely to be of use.
- nextflow pull <repo> -r <branch>
nextflow keeps its files in $HOME/.nextflow/assets. To update code you can
push it to github and then pull it, or you can run a file directly:
nextflow run /nfs/users/nfs_s/svd/git/cellgeni/rnaseq-noqc/main.nf
Do not use the -r <branch> option when doing this.
- nextflow run <repo> -r <branch> nextflow-params pipeline-params
- Nextflow params have single dash
- Pipeline parameters have double dash (always followed by an argument)
If a modality needs to be set use e.g. '--scratch true' (if scratch is a pipeline option).
- After the pipeline has finished:
nextflow log # shows names of all nextflow runs in this directory.
nextflow log agitated_mcnulty -f script > out.script.txt
Where agitated_mcnulty is the name of the run. This creates a list
of all the commands that were run.
Then, to rsync to warehouse, from the parent folder
copy-to-wh.sh tic-38
(from our utils repository). This will copy everything but for those in a
hardcoded exception list in the script. This exception list currently
consists of the directories
work/ donotcopy/ .nextflow/
Anything that you want to keep without copying it to warehouse, simply move
it to donotcopy. The target directory will have the same name; if it
exists the script will exit, but rsync can be forced with the -F option.
Use -D (dry run) to see what command will be run without actually running
it.
- Useful tools to inspect the pipeline state or debug software errors:
- console output
- results/trace.txt
with information about jobs, with tags. You can then go to the
directory where the job was executed e.g.
cd work/a5/82c6d0499783f2fc67c15838e3ac44/
ls -a # shows Nextflow admin and scripts.
- bjobs
- bjobs -l
- customised bjobs: e.g. this bash function:
bj () {
bjobs -o 'jobid queue stat exec_host run_time job_name memlimit delimiter=" "' | perl -pe 's/second(\(s\))?//' | column -t
}
- .nextflow directory and .nextflow.log
=====================================================================
Running nextflow on k8s
Creating a node that has a nextflow docker image.
-