update to 1.2

pdimens · Jul 5, 2024 · 1e084d1 · 1e084d1
1 parent 337f50b
commit 1e084d1
Show file tree

Hide file tree

Showing 18 changed files with 159 additions and 81 deletions.
diff --git a/Modules/Align/index.yml b/Modules/Align/index.yml
@@ -1,2 +1,2 @@
 icon: quote
-order: 5
+order: 11
diff --git a/Modules/SV/index.yml b/Modules/SV/index.yml
@@ -1,2 +1,2 @@
 icon: project-roadmap
-order: 4
+order: 1
diff --git a/Modules/SV/leviathan.md b/Modules/SV/leviathan.md
@@ -66,9 +66,9 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op
 | argument         | short name | type          | default | required | description                                        |
 |:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------|
 | `INPUTS`         |            | file/directory paths  |         | **yes**  | Files or directories containing [input BAM files](/commonoptions.md#input-arguments)     |
+| `--extra-params` |    `-x`    | string        |         |    no             | Additional naibr arguments, in quotes              |
 | `--genome`       |    `-g`    | file path     |         |    yes | Genome assembly that was used to create alignments    |
 | `--populations`  |    `-p`    | file path     |         |    no             | Tab-delimited file of sample\<*tab*\>group         |
-| `--extra-params` |    `-x`    | string        |         |    no             | Additional naibr arguments, in quotes              |
 
 ### Single-sample variant calling
 When **not** using a population grouping file via `--populations`, variants will be called per-sample. 

diff --git a/Modules/SV/naibr.md b/Modules/SV/naibr.md
@@ -66,11 +66,11 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op
 | argument         | short name | type          | default | required | description                                        |
 |:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------|
 | `INPUTS`         |            | file/directory paths  |         | **yes**  | Files or directories containing [input BAM files](/commonoptions.md#input-arguments)     |
+| `--extra-params` |    `-x`    | string        |         |    no             | Additional naibr arguments, in quotes              |
 | `--genome`       |    `-g`    | file path     |         | **yes** | Genome assembly for phasing bam files     |
-| `--vcf`          |    `-v`    | file path     |         | **conditionally** | Phased vcf file for phasing bam files     |
 | `--molecule-distance` |  `-m` | integer       |  100000 |    no             | Base-pair distance threshold to separate molecules |
 | `--populations`  |    `-p`    | file path     |         |    no             | Tab-delimited file of sample\<*tab*\>group         |
-| `--extra-params` |    `-x`    | string        |         |    no             | Additional naibr arguments, in quotes              |
+| `--vcf`          |    `-v`    | file path     |         | **conditionally** | Phased vcf file for phasing bam files     |
 
 ### Molecule distance
 The `--molecule-distance` option is used to let the program determine how far apart alignments on a contig with the same

diff --git a/Modules/Simulate/index.yml b/Modules/Simulate/index.yml
@@ -1,2 +1,2 @@
 icon: flame
-order: 5
+order: 3
diff --git a/Modules/Simulate/simulate-linkedreads.md b/Modules/Simulate/simulate-linkedreads.md
@@ -48,14 +48,14 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op
 |:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------|
 | `HAP1_GENOME`       |            | file path |       | **yes**  | Haplotype 1 of the diploid genome to simulate reads   |
 | `HAP2_GENOME`       |            | file path |       | **yes**  | Haplotype 1 of the diploid genome to simulate reads   |
-| `--outer-distance`  |    `-d`    | integer   | 350   |   | Outer distance between paired-end reads (bp)                 |
-| `--distance-sd`     |    `-i`    | integer   |  15   |   | Standard deviation of read-pair distance                     |
 | `--barcodes`        |    `-b`    | file path |  [10X barcodes](https://github.com/aquaskyline/LRSIM/blob/master/4M-with-alts-february-2016.txt)   |        | File of linked-read barcodes to add to reads   |
-| `--read-pairs`      |    `-n`    | number    |  600  |   | Number (in millions) of read pairs to simulate               |
-| `--mutation-rate`   |    `-r`    | number    | 0.001 |   | Random mutation rate for simulating reads (0 - 1.0)          |
+| `--distance-sd`     |    `-s`    | integer   |  15   |   | Standard deviation of read-pair distance                     |
 | `--molecule-length` |    `-l`    | integer   |  100  |   | Mean molecule length (kbp)                                   |
-| `--patitions`       |    `-p`    | integer   |  1500 |   | Number (in thousands) of partitions/beads to generate        |
 | `--molecules-per`   |    `-m`    | integer   |   10  |   | Average number of molecules per partition                    |
+| `--mutation-rate`   |    `-r`    | number    | 0.001 |   | Random mutation rate for simulating reads (0 - 1.0)          |
+| `--outer-distance`  |    `-d`    | integer   | 350   |   | Outer distance between paired-end reads (bp)                 |
+| `--patitions`       |    `-p`    | integer   |  1500 |   | Number (in thousands) of partitions/beads to generate        |
+| `--read-pairs`      |    `-n`    | number    |  600  |   | Number (in millions) of read pairs to simulate               |
 
 ## Mutation Rate
 The read simulation is two-part: first `dwgsim` generates forward and reverse FASTQ files from the provided genome haplotypes

diff --git a/Modules/Simulate/simulate-variants.md b/Modules/Simulate/simulate-variants.md
@@ -49,11 +49,11 @@ specific variants to simulate. There are also these unifying options among the d
 | argument | short name | type |  description |
 | :-----|:-----|:-----|:-----|
 | `INPUT_GENOME`           |    | file path  |  The haploid genome to simulate variants onto. **REQUIRED**   |
-| `--prefix` | | string |  Naming prefix for output files (default: `sim.{module_name}`)|
-| `--exclude-chr` | `-e` | file path | Text file of chromosomes to avoid, one per line |
 | `--centromeres` | `-c` | file path | GFF3 file of centromeres to avoid |
+| `--exclude-chr` | `-e` | file path | Text file of chromosomes to avoid, one per line |
 | `--genes` | `-g` | file path |  GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) |
 | `--heterozygosity` | `-z` | float between [0,1] |  [% heterozygosity to simulate diploid later](#heterozygosity) (default: `0`) |
+| `--prefix` | | string |  Naming prefix for output files (default: `sim.{module_name}`)|
 | `--randomseed` |  | integer |   Random seed for simulation |
 
 ==- 🟣 snps and indels
@@ -64,28 +64,29 @@ Given software limitations, simulating many SNPs (>10,000) will be noticeably sl
 
 A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA ([source](https://www.genome.gov/genetics-glossary/Single-Nucleotide-Polymorphisms)).
 An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA ([insertions](https://www.genome.gov/genetics-glossary/Insertion), [deletions](https://www.genome.gov/genetics-glossary/Deletion)).
-The snp and indel variants are combined in this module because `simuG` allows simulating them together. The
-ratio parameters control different things for snp and indel variants and have special meanings when setting
-the value to either `9999` or `0` :
-- `--titv-ratio`
-    - `9999`: transitions only
-    - `0`: transversions only
-- `--indel-ratio`
-    - `9999`: insertions only
-    - `0`: deletions only
+The snp and indel variants are combined in this module because `simuG` allows simulating them together. 
 
 {.compact}
 | argument          | short name | type       | default |  description                                                 |
 |:------------------|:----------:|:-----------|:-------:|:-------------------------------------------------------------|
-| `--snp-vcf`| `-s` | file path | | VCF file of known snps to simulate |
-| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate |
-| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate |
 | `--indel-count` |  `-m` | integer | 0 | Number of random indels to simluate |
-| `--titv-ratio` | `-r` | float  | 0.5 | Transition/Transversion ratio for snps |
+| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate |
 | `--indel-ratio` | `-d` | float  |  1 | Insertion/Deletion ratio for indels |
 | `--indel-size-alpha` | `-a` | float |  2.0 | Exponent Alpha for power-law-fitted indel size distribution|
 | `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution |
+| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate |
 | `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`|
+| `--snp-vcf`| `-s` | file path | | VCF file of known snps to simulate |
+| `--titv-ratio` | `-r` | float  | 0.5 | Transition/Transversion ratio for snps |
+
+The ratio parameters for snp and indel variants and have special meanings when setting
+the value to either `0` or `9999` :
+
+{.compact}
+| ratio | `0` meaning | `9999` meaning   |
+|:---- |:---|:---|
+| `--indel-ratio` | deletions only | insertions only |
+| `--titv-ratio` | transversions only | transitions  only |
 
 ==- 🔵 inversions
 ### inversion
@@ -94,35 +95,34 @@ Inversions are when a section of a chromosome appears in the reverse orientation
 {.compact}
 | argument          | short name | type       | default |  description     |
 |:------------------|:----------:|:-----------|:-------:|:----------------|
-| `--vcf` | `-v` | file path |  |  VCF file of known inversions to simulate |
 | `--count`| `-n` | integer | 0 |  Number of random inversions to simluate |
-| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) |
 | `--max-size` | `-x` | integer | 100000 | Maximum inversion size (bp) |
+| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) |
+| `--vcf` | `-v` | file path |  |  VCF file of known inversions to simulate |
 
 ==- 🟢 copy number variants
 ### cnv
 A copy number variation (CNV) is when the number of copies of a particular gene varies
-between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation))
-The ratio parameters control different things and have special meanings when setting
-the value to either `9999` or `0` :
-- `--dup-ratio`
-    - `9999`: tandem duplications only
-    - `0`: dispersed duplications only
-- `--gain-ratio`
-    - `9999`: gain only
-    - `0`: loss only
+between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation)).
 
 {.compact}
 | argument          | short name | type       | default |  description     |
 |:------------------|:----------:|:-----------|:-------:|:----------------|
 | `--vcf` | `-v` | file path | | VCF file of known copy number variants to simulate |
 | `--count` | `-n` | integer | 0 | Number of random cnv to simluate |
-| `--min-size` | `-m` | integer |  1000 | Minimum cnv size (bp) |
-| `--max-size`|   `-x` | integer |100000 | Maximum cnv size (bp) |
-| `--max-copy` |  `-y` | integer | 10 | Maximum number of copies |
 | `--dup-ratio` | `-d` | float |  1 | Tandem/Dispersed duplication ratio |
 | `--gain-ratio` |`-l` | float |  1 | Relative ratio of DNA gain over DNA loss |
+| `--max-size`|   `-x` | integer |100000 | Maximum cnv size (bp) |
+| `--max-copy` |  `-y` | integer | 10 | Maximum number of copies |
+| `--min-size` | `-m` | integer |  1000 | Minimum cnv size (bp) |
 
+The ratio parameters special meanings when setting the value to either `0` or `9999` :
+
+{.compact}
+| ratio | `0` meaning | `9999` meaning   |
+|:---- |:---|:---|
+| `--dup-ratio` | dispersed duplications only | tandem duplications only |
+| `--gain-ratio` | loss only | gain only |
 
 ==- 🟡 translocations
 ### translocation
@@ -131,8 +131,8 @@ A translocation occurs when a chromosome breaks and the fragmented pieces re-att
 {.compact}
 | argument          | short name | type       | default |  description     |
 |:------------------|:----------:|:-----------|:-------:|:----------------|
-| `--vcf` | `-v` | file path |  |  VCF file of known inversions to simulate |
 | `--count`| `-n` | integer | 0 |  Number of random inversions to simluate |
+| `--vcf` | `-v` | file path |  |  VCF file of known inversions to simulate |
 
 ===
 

diff --git a/Modules/deconvolve.md b/Modules/deconvolve.md
@@ -0,0 +1,66 @@
+---
+label: Deconvolve
+description: Resolve clashing barcodes from different molecules 
+icon: tag
+order: 10
+---
+
+# :icon-tag: Resolve clashing barcodes from different molecules
+
+===  :icon-checklist: You will need
+- paired-end reads from an Illumina sequencer in FASTQ format [!badge variant="secondary" text="gzip recommended"]
+    - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] or [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] 
+    - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] or [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] 
+    - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"]
+===
+
+
+
+Running [!badge corners="pill" text="deconvolve"] is **optional**. In the alignment
+workflows ([!badge corners="pill" text="align bwa"](Align/bwa.md) 
+[!badge corners="pill" text="align strobe"](Align/strobe.md)), Harpy already uses a distance-based approach to
+deconvolve barcodes and assign `MI` tags (Molecular Identifier), whereas the
+[!badge corners="pill" text="align ema"](Align/ema.md) workflow has the
+deconvolution occur within the `ema` aligner itself. This workflow uses a reference-free method,
+[QuickDeconvolution](https://github.com/RolandFaure/QuickDeconvolution), which uses k-mers to look at "read clouds" (all reads with the same linked-read barcode)
+and decide which ones likely originate from different molecules. Regardless of whether you run 
+this workflow or not, [!badge corners="pill" text="harpy align"](Align/Align.md) will still perform its own deconvolution.
+
+!!!danger Won't work with EMA
+Reads with deconvolved barcodes will not work with [!badge corners="pill" text="align ema"](Align/ema.md),
+since EMA expects barcodes to have a specific, un-hyphenated format. If deconvolving, use either
+[!badge corners="pill" text="align bwa"](Align/bwa.md) or [!badge corners="pill" text="align strobe"](Align/strobe.md)
+for sequence alignment.
+!!!
+
+
+!!! Also in harpy qc
+This method of deconvolution is also available as an option in the [!badge corners="pill" text="qc"](qc.md) workflow
+!!!
+
+```bash usage
+harpy deconvolve OPTIONS... INPUTS...
+```
+
+## :icon-terminal: Running Options
+{.compact}
+| argument              | short name | type            | default | required | description                                                          |
+|:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------|
+| `INPUTS`           |            | file/directory paths  |         | **yes**  | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments)    |
+| `--density`        |  `-d`      | integer       |    3   |   | On average, $\frac{1}{2^d}$ kmers are indexed  |
+| `--dropout`        |  `-a`      | integer       |    0   |   | Minimum cloud size to deconvolve  |
+| `--kmer-length`    |  `-k`      | integer       |    21   |   | Size of k-mers to search for similarities  |
+| `--window-size`    |  `-w`      | integer       |    40   |   | Size of window guaranteed to contain at least one kmer  |
+
+## Resulting Barcodes
+After deconvolution, some barcodes may have a hyphenated suffix like `-1` or `-2` (e.g. `A01C33B41D93-1`).
+This is how deconvolution methods create unique variants of barcodes to denote that identical barcodes
+do not come from the same original molecules. QuickDeconvolution adds the `-0` suffix to barcodes it was unable
+to deconvolve.
+
+## Harpy Deconvolution Nuances
+Some of the downstream linked-read tools Harpy uses expect linked read barcodes to either look like the 16-base 10X
+variety or a standard haplotag (AxxCxxBxxDxx). Their pattern-matching would not recognize barcodes deconvoluted with
+hyphens. To remedy this, `MI` assignment in [!badge corners="pill" text="align bwa"](Align/bwa.md)
+and [!badge corners="pill" text="align strobe"](Align/strobe.md) will assign the deconvolved (hyphenated) barcode to a `DX:Z`
+tag and restore the original barcode as the `BX:Z` tag.
diff --git a/Modules/demultiplex.md b/Modules/demultiplex.md
@@ -2,8 +2,7 @@
 label: Demultiplex
 description: Demultiplex raw sequences into haplotag barcoded samples
 icon: versions
-#visibility: hidden
-order: 6
+order: 9
 ---
 
 # :icon-versions: Demultiplex Raw Sequences
@@ -29,14 +28,14 @@ harpy demultiplex gen1 --threads 20 --schema demux.schema Plate_1_S001_R*.fastq.
 In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="demultiplex"] module is configured using these command-line arguments:
 
 {.compact}
-| argument          | short name | type       | default | required | description                                                             |
-|:------------------|:----------:|:-----------|:-------:|:--------:|:------------------------------------------------------------------------|
-| `R1_FQ`           |            | file path  |         | **yes**  | The forward multiplexed FASTQ file                                      |
-| `R2_FQ`           |            | file path  |         | **yes**  | The reverse multiplexed FASTQ file                                      |
-| `I1_FQ`           |            | file path  |         | **yes**  | The forward FASTQ index file provided by the sequencing facility        |
-| `I2_FQ`           |            | file path  |         | **yes**  | The reverse FASTQ index file provided by the sequencing facility        |
-| `METHOD`          |            | choice     |         | **yes**  | Haplotag technology of the sequences  [`gen1`]                          |
-| `--schema`        |    `-s`    | file path  |         | **yes**  | Tab-delimited file of sample\<tab\>barcode                              |
+| argument          | short name | type       | required | description                                                             |
+|:------------------|:----------:|:-----------|:--------:|:------------------------------------------------------------------------|
+| `METHOD`          |            | choice     | **yes**  | Haplotag technology of the sequences  [`gen1`]                          |
+| `R1_FQ`           |            | file path  | **yes**  | The forward multiplexed FASTQ file                                      |
+| `R2_FQ`           |            | file path  | **yes**  | The reverse multiplexed FASTQ file                                      |
+| `I1_FQ`           |            | file path  | **yes**  | The forward FASTQ index file provided by the sequencing facility        |
+| `I2_FQ`           |            | file path  | **yes**  | The reverse FASTQ index file provided by the sequencing facility        |
+| `--schema`        |    `-s`    | file path  | **yes**  | Tab-delimited file of sample\<tab\>barcode                              |
 
 ## Haplotag Types
 ==- Generation 1 - `gen1`