-
Notifications
You must be signed in to change notification settings - Fork 1
How to run
These are the different inputs needed to run the Bifrost pipeline:
- The Bifrost nextflow scripts
- A computer specific profile file (should be linked to in nextflow.conf, see conf directory)
- An input file describing program options and the input data set (see conf directory for templates)
Sample configuration files can be found in the conf directory.
Your input config file should either be in the same directory as the nextflow script, or in the directory from which you are running the script.
You need to specify the data yo want to run on. This is done in your template config file. This is done by specifying a 'pattern' that your files fit with. To figure out how to do this, think of your file names as having two parts:
- prefix: this is the first part of the file name, and is the part that is shared between all of the fastq files for each samples.
- common file ending: this is the part that comes after the part that is shared by the fastq files for each sample.
Let's look at an example: "../testdata/short/*L00{1,2}_R{1,2}_001.short.fastq.gz"
The fastq files that these match are:
../testdata/short/Angen-bacDNA2-78-2013-01-4718_S29_L001_R1_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-78-2013-01-4718_S29_L001_R2_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-78-2013-01-4718_S29_L002_R1_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-78-2013-01-4718_S29_L002_R2_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-79-2013-01-4835_S30_L001_R1_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-79-2013-01-4835_S30_L001_R2_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-79-2013-01-4835_S30_L002_R1_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-79-2013-01-4835_S30_L002_R2_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-92-2013-01-5057_S44_L001_R1_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-92-2013-01-5057_S44_L001_R2_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-92-2013-01-5057_S44_L002_R1_001.short.fastq.gz ../testdata/short/Angen-bacDNA2-92-2013-01-5057_S44_L002_R2_001.short.fastq.gz
First: this is a relative path. That is, the path to these fastq files are specified in relation to where I am in the filesystem right now. Second: the star denotes the part of the file names that each sample have in common. Third: the part after that is the remainder of the file name. I have to be able to describe which parts of this that can vary, and how. I do that by putting those parts in curly brackets. Note, in this case I have four files per sample, and thus I have two parts of the file name that can vary. Thus I have two different curly bracket sets.
Please note, below the specification of the pattern, you are also to specify the sample set size - this is how many fastq files there are for each sample. In this case, there are four, thus the correct value is four. If there are two, there should only be one set of curly brackets, and the set size should be set to 2.
In case you want to run on just a couple of your samples, the easiest way is to create a new directory somewhere, and then create symbolic links to the files you want to run.
mkdir name_of_directory
cd name_of_directory
#then, per fastq file that you want to run on, do
ln -s path/to/file/fastq_filename .
Please note the dot (period) at the end of the last command.
Run ls template_input_pattern
. If the pattern is correct, you should see a
list of your files.
Location: you should be in the directory where you want the output produced.
File name aliases used for example command line:
- script_file.nf: the bifrost nextflow file that should be run.
- input.config: this file is a copy of one of the track template files found in the conf directory. This file has been edited by the user to include the path to the input files, as well as any software option changes. This file needs to be modified for each new data set the pipeline is run on.
- profile: this is the computer system specific executor file. The full name of the file is profile.config, but in the command line below, only the part before .config should be given. This file contains information regarding where software is, how much CPU and memory to use, and so on. Users should only need to edit this file once when starting working on a new computer system.
- path_to_bifrost_software: location of where the Bifrost software lives (i.e. the last directory name should be Bifrost).
- output_directory_name: name of the output directory that will be created.
path_to_bifrost_software/run_track.sh track_file.nf input.config profile output_directory_name
The output will be found in the output directory as specified in the command line. To find out more about what output is produced per track, see the Current tracks page.