Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove input read indices, make reference index optional and combine same sample together #59

Merged
merged 27 commits into from
Nov 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
3f89203
Remove index file from the samplesheet and update checking script
gq1 Oct 5, 2023
cef527b
Remove indexfile from Input_check workflow
gq1 Oct 5, 2023
67826e3
add sample name to the meta data
gq1 Oct 5, 2023
2d30d82
nf-core modules install samtools/merge
gq1 Oct 5, 2023
953658f
Add input merge sub workflow
gq1 Oct 5, 2023
0d0d06e
comments
gq1 Oct 5, 2023
224c7f2
patch samtools_merge module to allow using fasta.gz file with gzi ind…
gq1 Oct 5, 2023
36f4c67
use original sample name for id if just 1, otherwise add _combined.
gq1 Oct 6, 2023
a808cf2
Path samtools_merge module again add indexing, emit crai index file a…
gq1 Oct 6, 2023
c646b8b
emit crai files as well
gq1 Oct 6, 2023
c4a762f
nf-core modules install samtools/sort
gq1 Oct 6, 2023
9fa3696
Add an option to sort input if not sorted.
gq1 Oct 6, 2023
5cbea5b
combine merged bam/cram together, add with their index files as well.
gq1 Oct 6, 2023
3ac73e9
add filtered to distinguish the samtools input and output name
gq1 Oct 6, 2023
bc6e5a0
use the merged read for the rest of pipeline
gq1 Oct 6, 2023
a7131a8
covert all input files into channels, and make reference fasta index …
gq1 Oct 6, 2023
0265cbf
use the first for the reference fasta channel
gq1 Oct 6, 2023
d4d38cc
move write-index flag to the config file
gq1 Oct 6, 2023
ea046b9
make sure work file when no interval file given
gq1 Oct 7, 2023
7506e81
formating and documents
gq1 Oct 7, 2023
4dacfa8
[automated] Fix linting with Prettier
nf-core-bot Oct 7, 2023
ebaa7e4
Update conf/test.config
gq1 Oct 13, 2023
1b2c45e
nf-core modules update samtools/merge
gq1 Oct 27, 2023
a76298f
Update samtools/merge module and remove its patch.
gq1 Nov 1, 2023
3cf953d
remove sort_input params. Always sort the input before merging.
gq1 Nov 1, 2023
d6fa00f
only validate the sample sheet not transform the sample names
gq1 Nov 2, 2023
3e8705c
update fai file for the full test and formating
gq1 Nov 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,12 @@ On release, automated continuous integration tests run the pipeline on a full-si

## Pipeline summary

The pipleline takes aligned PacBio sample reads (CRAM/BAM files and their index files) from a CSV file and the reference file in FASTA format, and then uses DeepVariant tool to make variant calling.
The pipeline takes aligned PacBio sample reads (CRAM/BAM files) from a CSV file and the reference file in FASTA format, and then uses DeepVariant tool to make variant calling.

Steps involved:

- Split fasta file into smaller files, normally one sequence per file unless the sequences are too small.
- Merge input BAM/CRAM files together if they have the same sample names.
- Filter out reads using the `-F 0x900` option to only retain the primary alignments.
- Run DeepVariant using filtered BAM/CRAM files against each of split fasta files.
- Merge all VCF and GVCF files generated by DeepVariant by sample together for each input BAM/CRAM file.
Expand Down
9 changes: 5 additions & 4 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
sample,datatype,datafile,indexfile
sample1,pacbio,/path/to/data/file/file1.bam,/path/to/index/file/file1.bam.bai
sample2,pacbio,/path/to/data/file/file2.cram,/path/to/index/file/file2.cram.crai
sample3,pacbio,/path/to/data/file/file3.bam,/path/to/index/file/file3.bam.csi
sample,datatype,datafile
sample1,pacbio,/path/to/data/file/file1.bam
sample2,pacbio,/path/to/data/file/file2.cram
sample3,pacbio,/path/to/data/file/file3-1.bam
sample3,pacbio,/path/to/data/file/file3-2.cram
9 changes: 5 additions & 4 deletions assets/samplesheet_test.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
sample,datatype,datafile,indexfile
icCanRufa1_crai,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.cram,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.cram.crai
icCanRufa1_bai,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.bam,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.bam.bai
icCanRufa1_csi,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.bam,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.bam.csi
sample,datatype,datafile
icCanRufa1_cram,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.cram
icCanRufa1_bam,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.bam
icCanRufa1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.cram
icCanRufa1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.bam
4 changes: 2 additions & 2 deletions assets/samplesheet_test_full.csv
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
sample,datatype,datafile,indexfile
icCanRufa1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1.cram,/lustre/scratch123/tol/resources/nextflow/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1.cram.crai
sample,datatype,datafile
icCanRufa1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1.cram
7 changes: 1 addition & 6 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,8 @@
"type": "string",
"pattern": "^\\S+\\.(bam|cram)$",
"errorMessage": "Data file for reads cannot contain spaces and must have extension 'cram' or 'bam'"
},
"indexfile": {
"type": "string",
"pattern": "^\\S+\\.(bai|csi|crai)$",
"errorMessage": "Data index file for reads cannot contain spaces and must have extension 'bai', 'csi' or 'crai'"
}
},
"required": ["sample", "datatype", "datafile", "indexfile"]
"required": ["sample", "datatype", "datafile"]
}
}
51 changes: 14 additions & 37 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env python

"""Provide a command line tool to validate and transform tabular samplesheets."""
"""Provide a command line tool to validate tabular samplesheets."""


import argparse
Expand Down Expand Up @@ -35,7 +35,6 @@ def __init__(
sample_col="sample",
type_col="datatype",
file_col="datafile",
index_col="indexfile",
**kwargs,
):
"""
Expand All @@ -48,20 +47,17 @@ def __init__(
the read data (default "datatype").
file_col (str): The name of the column that contains the file path for
the read data (default "datafile").
index_col (str): The name of the column that contains the index file
for the data (default "indexfile").

"""
super().__init__(**kwargs)

self._sample_col = sample_col
self._type_col = type_col
self._file_col = file_col
self._index_col = index_col
self._seen = set()
self.modified = []
self.validated = []

def validate_and_transform(self, row):
def validate(self, row):
"""
Perform all validations on the given row.

Expand All @@ -73,9 +69,8 @@ def validate_and_transform(self, row):
self._validate_sample(row)
self._validate_type(row)
self._validate_data_file(row)
self._validate_index_file(row)
self._seen.add((row[self._sample_col], row[self._file_col]))
self.modified.append(row)
self.validated.append(row)

def _validate_sample(self, row):
"""Assert that the sample name exists and convert spaces to underscores."""
Expand All @@ -98,17 +93,6 @@ def _validate_data_file(self, row):
raise AssertionError("Data file is required.")
self._validate_data_format(row[self._file_col])

def _validate_index_file(self, row):
"""Assert that the indexfile is non-empty and has the right format."""
if len(row[self._index_col]) <= 0:
raise AssertionError("Data index file is required.")
if row[self._file_col].endswith("bam") and not (
row[self._index_col].endswith("bai") or row[self._index_col].endswith("csi")
):
raise AssertionError("bai or csi index file should be given for bam file.")
if row[self._file_col].endswith("cram") and not row[self._index_col].endswith("crai"):
raise AssertionError("crai index file shuld be given for cram file.")

def _validate_data_format(self, filename):
"""Assert that a given filename has one of the expected read data file extensions."""
if not any(filename.endswith(extension) for extension in self.DATA_VALID_FORMATS):
Expand All @@ -121,17 +105,9 @@ def validate_unique_samples(self):
"""
Assert that the combination of sample name and data filename is unique.

In addition to the validation, also rename all samples to have a suffix of _T{n}, where n is the
number of times the same sample exist, but with different files, e.g., multiple runs per experiment.

"""
if len(self._seen) != len(self.modified):
if len(self._seen) != len(self.validated):
raise AssertionError("The combination of sample name and data file must be unique.")
seen = Counter()
for row in self.modified:
sample = row[self._sample_col]
seen[sample] += 1
row[self._sample_col] = f"{sample}_T{seen[sample]}"


def read_head(handle, num_lines=10):
Expand Down Expand Up @@ -162,7 +138,7 @@ def sniff_format(handle):
peek = read_head(handle)
handle.seek(0)
sniffer = csv.Sniffer()
# same input file could retrun random true or false
# same input file could return random true or false
# disable it now
# the following validation should be enough
# if not sniffer.has_header(peek):
Expand All @@ -188,16 +164,17 @@ def check_samplesheet(file_in, file_out):
This function checks that the samplesheet follows the following structure,
see also the `variantcalling samplesheet`_::

sample,datatype,datafile,indexfile
sample1,pacbio,/path/to/data/file/file1.bam,/path/to/index/file/file1.bam.bai
sample2,pacbio,/path/to/data/file/file2.cram,/path/to/index/file/file2.cram.crai
sample3,pacbio,/path/to/data/file/file3.bam,/path/to/index/file/file3.bam.csi
sample,datatype,datafile
sample1,pacbio,/path/to/data/file/file1.bam
sample2,pacbio,/path/to/data/file/file2.cram
sample3,pacbio,/path/to/data/file/file3-1.bam
sample3,pacbio,/path/to/data/file/file3-2.cram

.. _variantcalling samplesheet:
https://raw.githubusercontent.com/sanger-tol/variantcalling/main/assets/samplesheet.csv

"""
required_columns = {"sample", "datatype", "datafile", "indexfile"}
required_columns = {"sample", "datatype", "datafile"}
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
with file_in.open(newline="") as in_handle:
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
Expand All @@ -210,7 +187,7 @@ def check_samplesheet(file_in, file_out):
checker = RowChecker()
for i, row in enumerate(reader):
try:
checker.validate_and_transform(row)
checker.validate(row)
except AssertionError as error:
logger.critical(f"{str(error)} On line {i + 2}.")
sys.exit(1)
Expand All @@ -220,7 +197,7 @@ def check_samplesheet(file_in, file_out):
with file_out.open(mode="w", newline="") as out_handle:
writer = csv.DictWriter(out_handle, header, delimiter=",")
writer.writeheader()
for row in checker.modified:
for row in checker.validated:
writer.writerow(row)


Expand Down
5 changes: 5 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,11 @@ process {

withName: '.*:INPUT_FILTER_SPLIT:SAMTOOLS_VIEW' {
ext.args = '--output-fmt cram --write-index -F 0x900'
ext.prefix = { "${meta.id}_filtered" }
}

withName: '.*:INPUT_MERGE:SAMTOOLS_MERGE' {
ext.args = '--write-index'
}

withName: '.*:DEEPVARIANT_CALLER:DEEPVARIANT' {
Expand Down
6 changes: 3 additions & 3 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@ params {
// Fasta references
fasta = 'https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/assembly/GCA_947369205.1_OX376310.1_CANBKR010000003.1.fasta.gz'

// Reference index file
fai = 'https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/assembly/GCA_947369205.1_OX376310.1_CANBKR010000003.1.fasta.gz.fai'
gzi = 'https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/assembly/GCA_947369205.1_OX376310.1_CANBKR010000003.1.fasta.gz.gzi'
// Reference index file (optional)
// fai = 'https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/assembly/GCA_947369205.1_OX376310.1_CANBKR010000003.1.fasta.gz.fai'
// fai = 'https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/assembly/GCA_947369205.1_OX376310.1_CANBKR010000003.1.fasta.gz.gzi'

// Interval bed file
interval = 'https://tolit.cog.sanger.ac.uk/test-data/Cantharis_rufa/analysis/icCanRufa1/read_mapping/pacbio/GCA_947369205.1.unmasked.pacbio.icCanRufa1_0_3.bed'
Expand Down
3 changes: 1 addition & 2 deletions conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,5 @@ params {
fasta = '/lustre/scratch124/tol/projects/darwin/data/insects/Cantharis_rufa/assembly/release/icCanRufa1.1/insdc/GCA_947369205.1.fasta.gz'

// Reference index file
fai = '/lustre/scratch124/tol/projects/darwin/data/insects/Cantharis_rufa/assembly/release/icCanRufa1.1/insdc/GCA_947369205.1.fasta.gz.fai'
gzi = '/lustre/scratch124/tol/projects/darwin/data/insects/Cantharis_rufa/assembly/release/icCanRufa1.1/insdc/GCA_947369205.1.fasta.gz.gzi'
fai = '/lustre/scratch124/tol/projects/darwin/data/insects/Cantharis_rufa/assembly/release/icCanRufa1.1/insdc/GCA_947369205.1.fasta.gz.gzi'
}
33 changes: 16 additions & 17 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

## Introduction

The pipleline takes aligned sample reads (CRAM/BAM files and their index files) from a CSV file and a reference file in FASTA format, and then use DeepVariant to call variants.
The pipeline takes aligned sample reads (CRAM/BAM files) from a CSV file and a reference file in FASTA format, and then use DeepVariant to call variants.

## Samplesheet input

You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use the `input` parameter to specify the samplesheet location. It has to be a comma-separated file with at least 4 columns, and a header row as shown in the examples below.
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use the `input` parameter to specify the samplesheet location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.

```bash
--input '[path to samplesheet file]'
Expand All @@ -17,29 +17,28 @@ You will need to create a samplesheet with information about the samples you wou
The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. Below is an example for the same sample sequenced across 3 lanes:

```console
sample,datatype,datafile,indexfile
sample1,pacbio,sample1_1.cram,sample1_1.cram.crai
sample1,pacbio,sample1_2.cram,sample1_3.cram.crai
sample1,pacbio,sample1_3.cram,sample1_3.cram.crai
sample,datatype,datafile
sample1,pacbio,sample1_1.cram
sample1,pacbio,sample1_2.cram
sample1,pacbio,sample1_3.cram
```

### Full samplesheet

A final samplesheet file consisting of both BAM or CRAM will look like this. Currently this pipeline only supports Pacbio aligned data.

```console
sample,datatype,datafile,indexfile
sample1,pacbio,/path/to/data/file/file1.bam,/path/to/index/file/file1.bam.bai
sample2,pacbio,/path/to/data/file/file2.cram,/path/to/index/file/file2.cram.crai
sample3,pacbio,/path/to/data/file/file3.bam,/path/to/index/file/file3.bam.csi
sample,datatype,datafile
sample1,pacbio,/path/to/data/file/file1.bam
sample2,pacbio,/path/to/data/file/file2.cram
sample3,pacbio,/path/to/data/file/file3.bam
```

| Column | Description |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). |
| `datatype` | Sequencing data type. Must be `pacbio`. |
| `datafile` | The location for either BAM or CRAM file. |
| `indexfile` | The location for BAM or CRAM index file – BAI, CSI or CRAI. |
| Column | Description |
| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). |
| `datatype` | Sequencing data type. Must be `pacbio`. |
| `datafile` | The location for either BAM or CRAM file. |

An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.

Expand All @@ -62,7 +61,7 @@ work # Directory containing the nextflow working files
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
```

The pipeline will split the intput fasta file into smaller files to run DeepVariant parallel. You can set the minimum split fasta file size from the command line. For example to set the minimum size as 10K using `--split_fasta_cutoff 10000`.
The pipeline will split the input fasta file into smaller files to run DeepVariant parallel. You can set the minimum split fasta file size from the command line. For example to set the minimum size as 10K using `--split_fasta_cutoff 10000`.

### Updating the pipeline

Expand Down
11 changes: 11 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,17 @@
"git_sha": "fd742419940e01ba1c5ecb172c3e32ec840662fe",
"installed_by": ["modules"]
},
"samtools/merge": {
"branch": "master",
"git_sha": "e7ce60acc8a33fa17429e966364657a63016e870",
"installed_by": ["modules"],
"patch": "modules/nf-core/samtools/merge/samtools-merge.diff"
},
"samtools/sort": {
"branch": "master",
"git_sha": "a0f7be95788366c1923171e358da7d049eb440f9",
"installed_by": ["modules"]
},
"samtools/view": {
"branch": "master",
"git_sha": "3ffae3598260a99e8db3207dead9f73f87f90d1f",
Expand Down
6 changes: 6 additions & 0 deletions modules/nf-core/samtools/merge/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading