Skip to content

Commit

Permalink
update to 1.2
Browse files Browse the repository at this point in the history
  • Loading branch information
pdimens committed Jul 5, 2024
1 parent 337f50b commit 1e084d1
Show file tree
Hide file tree
Showing 18 changed files with 159 additions and 81 deletions.
2 changes: 1 addition & 1 deletion Modules/Align/index.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
icon: quote
order: 5
order: 11
2 changes: 1 addition & 1 deletion Modules/SV/index.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
icon: project-roadmap
order: 4
order: 1
2 changes: 1 addition & 1 deletion Modules/SV/leviathan.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,9 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op
| argument | short name | type | default | required | description |
|:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------|
| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) |
| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes |
| `--genome` | `-g` | file path | | yes | Genome assembly that was used to create alignments |
| `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group |
| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes |

### Single-sample variant calling
When **not** using a population grouping file via `--populations`, variants will be called per-sample.
Expand Down
4 changes: 2 additions & 2 deletions Modules/SV/naibr.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,11 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op
| argument | short name | type | default | required | description |
|:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------|
| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) |
| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes |
| `--genome` | `-g` | file path | | **yes** | Genome assembly for phasing bam files |
| `--vcf` | `-v` | file path | | **conditionally** | Phased vcf file for phasing bam files |
| `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules |
| `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group |
| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes |
| `--vcf` | `-v` | file path | | **conditionally** | Phased vcf file for phasing bam files |

### Molecule distance
The `--molecule-distance` option is used to let the program determine how far apart alignments on a contig with the same
Expand Down
2 changes: 1 addition & 1 deletion Modules/Simulate/index.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
icon: flame
order: 5
order: 3
10 changes: 5 additions & 5 deletions Modules/Simulate/simulate-linkedreads.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,14 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op
|:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------|
| `HAP1_GENOME` | | file path | | **yes** | Haplotype 1 of the diploid genome to simulate reads |
| `HAP2_GENOME` | | file path | | **yes** | Haplotype 1 of the diploid genome to simulate reads |
| `--outer-distance` | `-d` | integer | 350 | | Outer distance between paired-end reads (bp) |
| `--distance-sd` | `-i` | integer | 15 | | Standard deviation of read-pair distance |
| `--barcodes` | `-b` | file path | [10X barcodes](https://github.com/aquaskyline/LRSIM/blob/master/4M-with-alts-february-2016.txt) | | File of linked-read barcodes to add to reads |
| `--read-pairs` | `-n` | number | 600 | | Number (in millions) of read pairs to simulate |
| `--mutation-rate` | `-r` | number | 0.001 | | Random mutation rate for simulating reads (0 - 1.0) |
| `--distance-sd` | `-s` | integer | 15 | | Standard deviation of read-pair distance |
| `--molecule-length` | `-l` | integer | 100 | | Mean molecule length (kbp) |
| `--patitions` | `-p` | integer | 1500 | | Number (in thousands) of partitions/beads to generate |
| `--molecules-per` | `-m` | integer | 10 | | Average number of molecules per partition |
| `--mutation-rate` | `-r` | number | 0.001 | | Random mutation rate for simulating reads (0 - 1.0) |
| `--outer-distance` | `-d` | integer | 350 | | Outer distance between paired-end reads (bp) |
| `--patitions` | `-p` | integer | 1500 | | Number (in thousands) of partitions/beads to generate |
| `--read-pairs` | `-n` | number | 600 | | Number (in millions) of read pairs to simulate |

## Mutation Rate
The read simulation is two-part: first `dwgsim` generates forward and reverse FASTQ files from the provided genome haplotypes
Expand Down
60 changes: 30 additions & 30 deletions Modules/Simulate/simulate-variants.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,11 @@ specific variants to simulate. There are also these unifying options among the d
| argument | short name | type | description |
| :-----|:-----|:-----|:-----|
| `INPUT_GENOME` | | file path | The haploid genome to simulate variants onto. **REQUIRED** |
| `--prefix` | | string | Naming prefix for output files (default: `sim.{module_name}`)|
| `--exclude-chr` | `-e` | file path | Text file of chromosomes to avoid, one per line |
| `--centromeres` | `-c` | file path | GFF3 file of centromeres to avoid |
| `--exclude-chr` | `-e` | file path | Text file of chromosomes to avoid, one per line |
| `--genes` | `-g` | file path | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) |
| `--heterozygosity` | `-z` | float between [0,1] | [% heterozygosity to simulate diploid later](#heterozygosity) (default: `0`) |
| `--prefix` | | string | Naming prefix for output files (default: `sim.{module_name}`)|
| `--randomseed` | | integer | Random seed for simulation |

==- 🟣 snps and indels
Expand All @@ -64,28 +64,29 @@ Given software limitations, simulating many SNPs (>10,000) will be noticeably sl

A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA ([source](https://www.genome.gov/genetics-glossary/Single-Nucleotide-Polymorphisms)).
An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA ([insertions](https://www.genome.gov/genetics-glossary/Insertion), [deletions](https://www.genome.gov/genetics-glossary/Deletion)).
The snp and indel variants are combined in this module because `simuG` allows simulating them together. The
ratio parameters control different things for snp and indel variants and have special meanings when setting
the value to either `9999` or `0` :
- `--titv-ratio`
- `9999`: transitions only
- `0`: transversions only
- `--indel-ratio`
- `9999`: insertions only
- `0`: deletions only
The snp and indel variants are combined in this module because `simuG` allows simulating them together.

{.compact}
| argument | short name | type | default | description |
|:------------------|:----------:|:-----------|:-------:|:-------------------------------------------------------------|
| `--snp-vcf`| `-s` | file path | | VCF file of known snps to simulate |
| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate |
| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate |
| `--indel-count` | `-m` | integer | 0 | Number of random indels to simluate |
| `--titv-ratio` | `-r` | float | 0.5 | Transition/Transversion ratio for snps |
| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate |
| `--indel-ratio` | `-d` | float | 1 | Insertion/Deletion ratio for indels |
| `--indel-size-alpha` | `-a` | float | 2.0 | Exponent Alpha for power-law-fitted indel size distribution|
| `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution |
| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate |
| `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`|
| `--snp-vcf`| `-s` | file path | | VCF file of known snps to simulate |
| `--titv-ratio` | `-r` | float | 0.5 | Transition/Transversion ratio for snps |

The ratio parameters for snp and indel variants and have special meanings when setting
the value to either `0` or `9999` :

{.compact}
| ratio | `0` meaning | `9999` meaning |
|:---- |:---|:---|
| `--indel-ratio` | deletions only | insertions only |
| `--titv-ratio` | transversions only | transitions only |

==- 🔵 inversions
### inversion
Expand All @@ -94,35 +95,34 @@ Inversions are when a section of a chromosome appears in the reverse orientation
{.compact}
| argument | short name | type | default | description |
|:------------------|:----------:|:-----------|:-------:|:----------------|
| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate |
| `--count`| `-n` | integer | 0 | Number of random inversions to simluate |
| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) |
| `--max-size` | `-x` | integer | 100000 | Maximum inversion size (bp) |
| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) |
| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate |

==- 🟢 copy number variants
### cnv
A copy number variation (CNV) is when the number of copies of a particular gene varies
between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation))
The ratio parameters control different things and have special meanings when setting
the value to either `9999` or `0` :
- `--dup-ratio`
- `9999`: tandem duplications only
- `0`: dispersed duplications only
- `--gain-ratio`
- `9999`: gain only
- `0`: loss only
between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation)).

{.compact}
| argument | short name | type | default | description |
|:------------------|:----------:|:-----------|:-------:|:----------------|
| `--vcf` | `-v` | file path | | VCF file of known copy number variants to simulate |
| `--count` | `-n` | integer | 0 | Number of random cnv to simluate |
| `--min-size` | `-m` | integer | 1000 | Minimum cnv size (bp) |
| `--max-size`| `-x` | integer |100000 | Maximum cnv size (bp) |
| `--max-copy` | `-y` | integer | 10 | Maximum number of copies |
| `--dup-ratio` | `-d` | float | 1 | Tandem/Dispersed duplication ratio |
| `--gain-ratio` |`-l` | float | 1 | Relative ratio of DNA gain over DNA loss |
| `--max-size`| `-x` | integer |100000 | Maximum cnv size (bp) |
| `--max-copy` | `-y` | integer | 10 | Maximum number of copies |
| `--min-size` | `-m` | integer | 1000 | Minimum cnv size (bp) |

The ratio parameters special meanings when setting the value to either `0` or `9999` :

{.compact}
| ratio | `0` meaning | `9999` meaning |
|:---- |:---|:---|
| `--dup-ratio` | dispersed duplications only | tandem duplications only |
| `--gain-ratio` | loss only | gain only |

==- 🟡 translocations
### translocation
Expand All @@ -131,8 +131,8 @@ A translocation occurs when a chromosome breaks and the fragmented pieces re-att
{.compact}
| argument | short name | type | default | description |
|:------------------|:----------:|:-----------|:-------:|:----------------|
| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate |
| `--count`| `-n` | integer | 0 | Number of random inversions to simluate |
| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate |

===

Expand Down
66 changes: 66 additions & 0 deletions Modules/deconvolve.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
label: Deconvolve
description: Resolve clashing barcodes from different molecules
icon: tag
order: 10
---

# :icon-tag: Resolve clashing barcodes from different molecules

=== :icon-checklist: You will need
- paired-end reads from an Illumina sequencer in FASTQ format [!badge variant="secondary" text="gzip recommended"]
- **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] or [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"]
- **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] or [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"]
- **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"]
===



Running [!badge corners="pill" text="deconvolve"] is **optional**. In the alignment
workflows ([!badge corners="pill" text="align bwa"](Align/bwa.md)
[!badge corners="pill" text="align strobe"](Align/strobe.md)), Harpy already uses a distance-based approach to
deconvolve barcodes and assign `MI` tags (Molecular Identifier), whereas the
[!badge corners="pill" text="align ema"](Align/ema.md) workflow has the
deconvolution occur within the `ema` aligner itself. This workflow uses a reference-free method,
[QuickDeconvolution](https://github.com/RolandFaure/QuickDeconvolution), which uses k-mers to look at "read clouds" (all reads with the same linked-read barcode)
and decide which ones likely originate from different molecules. Regardless of whether you run
this workflow or not, [!badge corners="pill" text="harpy align"](Align/Align.md) will still perform its own deconvolution.

!!!danger Won't work with EMA
Reads with deconvolved barcodes will not work with [!badge corners="pill" text="align ema"](Align/ema.md),
since EMA expects barcodes to have a specific, un-hyphenated format. If deconvolving, use either
[!badge corners="pill" text="align bwa"](Align/bwa.md) or [!badge corners="pill" text="align strobe"](Align/strobe.md)
for sequence alignment.
!!!


!!! Also in harpy qc
This method of deconvolution is also available as an option in the [!badge corners="pill" text="qc"](qc.md) workflow
!!!

```bash usage
harpy deconvolve OPTIONS... INPUTS...
```

## :icon-terminal: Running Options
{.compact}
| argument | short name | type | default | required | description |
|:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------|
| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) |
| `--density` | `-d` | integer | 3 | | On average, $\frac{1}{2^d}$ kmers are indexed |
| `--dropout` | `-a` | integer | 0 | | Minimum cloud size to deconvolve |
| `--kmer-length` | `-k` | integer | 21 | | Size of k-mers to search for similarities |
| `--window-size` | `-w` | integer | 40 | | Size of window guaranteed to contain at least one kmer |

## Resulting Barcodes
After deconvolution, some barcodes may have a hyphenated suffix like `-1` or `-2` (e.g. `A01C33B41D93-1`).
This is how deconvolution methods create unique variants of barcodes to denote that identical barcodes
do not come from the same original molecules. QuickDeconvolution adds the `-0` suffix to barcodes it was unable
to deconvolve.

## Harpy Deconvolution Nuances
Some of the downstream linked-read tools Harpy uses expect linked read barcodes to either look like the 16-base 10X
variety or a standard haplotag (AxxCxxBxxDxx). Their pattern-matching would not recognize barcodes deconvoluted with
hyphens. To remedy this, `MI` assignment in [!badge corners="pill" text="align bwa"](Align/bwa.md)
and [!badge corners="pill" text="align strobe"](Align/strobe.md) will assign the deconvolved (hyphenated) barcode to a `DX:Z`
tag and restore the original barcode as the `BX:Z` tag.
19 changes: 9 additions & 10 deletions Modules/demultiplex.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@
label: Demultiplex
description: Demultiplex raw sequences into haplotag barcoded samples
icon: versions
#visibility: hidden
order: 6
order: 9
---

# :icon-versions: Demultiplex Raw Sequences
Expand All @@ -29,14 +28,14 @@ harpy demultiplex gen1 --threads 20 --schema demux.schema Plate_1_S001_R*.fastq.
In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="demultiplex"] module is configured using these command-line arguments:

{.compact}
| argument | short name | type | default | required | description |
|:------------------|:----------:|:-----------|:-------:|:--------:|:------------------------------------------------------------------------|
| `R1_FQ` | | file path | | **yes** | The forward multiplexed FASTQ file |
| `R2_FQ` | | file path | | **yes** | The reverse multiplexed FASTQ file |
| `I1_FQ` | | file path | | **yes** | The forward FASTQ index file provided by the sequencing facility |
| `I2_FQ` | | file path | | **yes** | The reverse FASTQ index file provided by the sequencing facility |
| `METHOD` | | choice | | **yes** | Haplotag technology of the sequences [`gen1`] |
| `--schema` | `-s` | file path | | **yes** | Tab-delimited file of sample\<tab\>barcode |
| argument | short name | type | required | description |
|:------------------|:----------:|:-----------|:--------:|:------------------------------------------------------------------------|
| `METHOD` | | choice | **yes** | Haplotag technology of the sequences [`gen1`] |
| `R1_FQ` | | file path | **yes** | The forward multiplexed FASTQ file |
| `R2_FQ` | | file path | **yes** | The reverse multiplexed FASTQ file |
| `I1_FQ` | | file path | **yes** | The forward FASTQ index file provided by the sequencing facility |
| `I2_FQ` | | file path | **yes** | The reverse FASTQ index file provided by the sequencing facility |
| `--schema` | `-s` | file path | **yes** | Tab-delimited file of sample\<tab\>barcode |

## Haplotag Types
==- Generation 1 - `gen1`
Expand Down
Loading

0 comments on commit 1e084d1

Please sign in to comment.