diff --git a/Modules/Align/index.yml b/Modules/Align/index.yml index f4d3a5328..b38bcc4d9 100644 --- a/Modules/Align/index.yml +++ b/Modules/Align/index.yml @@ -1,2 +1,2 @@ icon: quote -order: 5 \ No newline at end of file +order: 11 \ No newline at end of file diff --git a/Modules/SV/index.yml b/Modules/SV/index.yml index 78a6be290..564943d7f 100644 --- a/Modules/SV/index.yml +++ b/Modules/SV/index.yml @@ -1,2 +1,2 @@ icon: project-roadmap -order: 4 \ No newline at end of file +order: 1 \ No newline at end of file diff --git a/Modules/SV/leviathan.md b/Modules/SV/leviathan.md index 375a5842f..77f990852 100644 --- a/Modules/SV/leviathan.md +++ b/Modules/SV/leviathan.md @@ -66,9 +66,9 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op | argument | short name | type | default | required | description | |:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------| | `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | +| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes | | `--genome` | `-g` | file path | | yes | Genome assembly that was used to create alignments | | `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | -| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes | ### Single-sample variant calling When **not** using a population grouping file via `--populations`, variants will be called per-sample. diff --git a/Modules/SV/naibr.md b/Modules/SV/naibr.md index b7cae6e37..6af0d4a89 100644 --- a/Modules/SV/naibr.md +++ b/Modules/SV/naibr.md @@ -66,11 +66,11 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op | argument | short name | type | default | required | description | |:-----------------|:----------:|:--------------|:-------:|:--------:|:---------------------------------------------------| | `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | +| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes | | `--genome` | `-g` | file path | | **yes** | Genome assembly for phasing bam files | -| `--vcf` | `-v` | file path | | **conditionally** | Phased vcf file for phasing bam files | | `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | | `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | -| `--extra-params` | `-x` | string | | no | Additional naibr arguments, in quotes | +| `--vcf` | `-v` | file path | | **conditionally** | Phased vcf file for phasing bam files | ### Molecule distance The `--molecule-distance` option is used to let the program determine how far apart alignments on a contig with the same diff --git a/Modules/Simulate/index.yml b/Modules/Simulate/index.yml index 7fa938bf3..30d86e216 100644 --- a/Modules/Simulate/index.yml +++ b/Modules/Simulate/index.yml @@ -1,2 +1,2 @@ icon: flame -order: 5 \ No newline at end of file +order: 3 \ No newline at end of file diff --git a/Modules/Simulate/simulate-linkedreads.md b/Modules/Simulate/simulate-linkedreads.md index 84f99c94b..d43d1c07e 100644 --- a/Modules/Simulate/simulate-linkedreads.md +++ b/Modules/Simulate/simulate-linkedreads.md @@ -48,14 +48,14 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op |:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------| | `HAP1_GENOME` | | file path | | **yes** | Haplotype 1 of the diploid genome to simulate reads | | `HAP2_GENOME` | | file path | | **yes** | Haplotype 1 of the diploid genome to simulate reads | -| `--outer-distance` | `-d` | integer | 350 | | Outer distance between paired-end reads (bp) | -| `--distance-sd` | `-i` | integer | 15 | | Standard deviation of read-pair distance | | `--barcodes` | `-b` | file path | [10X barcodes](https://github.com/aquaskyline/LRSIM/blob/master/4M-with-alts-february-2016.txt) | | File of linked-read barcodes to add to reads | -| `--read-pairs` | `-n` | number | 600 | | Number (in millions) of read pairs to simulate | -| `--mutation-rate` | `-r` | number | 0.001 | | Random mutation rate for simulating reads (0 - 1.0) | +| `--distance-sd` | `-s` | integer | 15 | | Standard deviation of read-pair distance | | `--molecule-length` | `-l` | integer | 100 | | Mean molecule length (kbp) | -| `--patitions` | `-p` | integer | 1500 | | Number (in thousands) of partitions/beads to generate | | `--molecules-per` | `-m` | integer | 10 | | Average number of molecules per partition | +| `--mutation-rate` | `-r` | number | 0.001 | | Random mutation rate for simulating reads (0 - 1.0) | +| `--outer-distance` | `-d` | integer | 350 | | Outer distance between paired-end reads (bp) | +| `--patitions` | `-p` | integer | 1500 | | Number (in thousands) of partitions/beads to generate | +| `--read-pairs` | `-n` | number | 600 | | Number (in millions) of read pairs to simulate | ## Mutation Rate The read simulation is two-part: first `dwgsim` generates forward and reverse FASTQ files from the provided genome haplotypes diff --git a/Modules/Simulate/simulate-variants.md b/Modules/Simulate/simulate-variants.md index 759d05731..ba1cfb169 100644 --- a/Modules/Simulate/simulate-variants.md +++ b/Modules/Simulate/simulate-variants.md @@ -49,11 +49,11 @@ specific variants to simulate. There are also these unifying options among the d | argument | short name | type | description | | :-----|:-----|:-----|:-----| | `INPUT_GENOME` | | file path | The haploid genome to simulate variants onto. **REQUIRED** | -| `--prefix` | | string | Naming prefix for output files (default: `sim.{module_name}`)| -| `--exclude-chr` | `-e` | file path | Text file of chromosomes to avoid, one per line | | `--centromeres` | `-c` | file path | GFF3 file of centromeres to avoid | +| `--exclude-chr` | `-e` | file path | Text file of chromosomes to avoid, one per line | | `--genes` | `-g` | file path | GFF3 file of genes to avoid simulating over (see `snpindel` for caveat) | | `--heterozygosity` | `-z` | float between [0,1] | [% heterozygosity to simulate diploid later](#heterozygosity) (default: `0`) | +| `--prefix` | | string | Naming prefix for output files (default: `sim.{module_name}`)| | `--randomseed` | | integer | Random seed for simulation | ==- 🟣 snps and indels @@ -64,28 +64,29 @@ Given software limitations, simulating many SNPs (>10,000) will be noticeably sl A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA ([source](https://www.genome.gov/genetics-glossary/Single-Nucleotide-Polymorphisms)). An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA ([insertions](https://www.genome.gov/genetics-glossary/Insertion), [deletions](https://www.genome.gov/genetics-glossary/Deletion)). -The snp and indel variants are combined in this module because `simuG` allows simulating them together. The -ratio parameters control different things for snp and indel variants and have special meanings when setting -the value to either `9999` or `0` : -- `--titv-ratio` - - `9999`: transitions only - - `0`: transversions only -- `--indel-ratio` - - `9999`: insertions only - - `0`: deletions only +The snp and indel variants are combined in this module because `simuG` allows simulating them together. {.compact} | argument | short name | type | default | description | |:------------------|:----------:|:-----------|:-------:|:-------------------------------------------------------------| -| `--snp-vcf`| `-s` | file path | | VCF file of known snps to simulate | -| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate | -| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate | | `--indel-count` | `-m` | integer | 0 | Number of random indels to simluate | -| `--titv-ratio` | `-r` | float | 0.5 | Transition/Transversion ratio for snps | +| `--indel-vcf` | `-i` | file path | | VCF file of known indels to simulate | | `--indel-ratio` | `-d` | float | 1 | Insertion/Deletion ratio for indels | | `--indel-size-alpha` | `-a` | float | 2.0 | Exponent Alpha for power-law-fitted indel size distribution| | `--indel-size-constant` | `-l` | float | 0.5 | Exponent constant for power-law-fitted indel size distribution | +| `--snp-count` | `-n` | integer | 0 | Number of random snps to simluate | | `--snp-gene-constraints` | `-y` | string | | How to constrain randomly simulated SNPs {`noncoding`,`coding`,`2d`,`4d`} when using `--genes`| +| `--snp-vcf`| `-s` | file path | | VCF file of known snps to simulate | +| `--titv-ratio` | `-r` | float | 0.5 | Transition/Transversion ratio for snps | + +The ratio parameters for snp and indel variants and have special meanings when setting +the value to either `0` or `9999` : + +{.compact} +| ratio | `0` meaning | `9999` meaning | +|:---- |:---|:---| +| `--indel-ratio` | deletions only | insertions only | +| `--titv-ratio` | transversions only | transitions only | ==- 🔵 inversions ### inversion @@ -94,35 +95,34 @@ Inversions are when a section of a chromosome appears in the reverse orientation {.compact} | argument | short name | type | default | description | |:------------------|:----------:|:-----------|:-------:|:----------------| -| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate | | `--count`| `-n` | integer | 0 | Number of random inversions to simluate | -| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) | | `--max-size` | `-x` | integer | 100000 | Maximum inversion size (bp) | +| `--min-size` | `-m` | integer | 1000 | Minimum inversion size (bp) | +| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate | ==- 🟢 copy number variants ### cnv A copy number variation (CNV) is when the number of copies of a particular gene varies -between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation)) -The ratio parameters control different things and have special meanings when setting -the value to either `9999` or `0` : -- `--dup-ratio` - - `9999`: tandem duplications only - - `0`: dispersed duplications only -- `--gain-ratio` - - `9999`: gain only - - `0`: loss only +between individuals ([source](https://www.genome.gov/genetics-glossary/Copy-Number-Variation)). {.compact} | argument | short name | type | default | description | |:------------------|:----------:|:-----------|:-------:|:----------------| | `--vcf` | `-v` | file path | | VCF file of known copy number variants to simulate | | `--count` | `-n` | integer | 0 | Number of random cnv to simluate | -| `--min-size` | `-m` | integer | 1000 | Minimum cnv size (bp) | -| `--max-size`| `-x` | integer |100000 | Maximum cnv size (bp) | -| `--max-copy` | `-y` | integer | 10 | Maximum number of copies | | `--dup-ratio` | `-d` | float | 1 | Tandem/Dispersed duplication ratio | | `--gain-ratio` |`-l` | float | 1 | Relative ratio of DNA gain over DNA loss | +| `--max-size`| `-x` | integer |100000 | Maximum cnv size (bp) | +| `--max-copy` | `-y` | integer | 10 | Maximum number of copies | +| `--min-size` | `-m` | integer | 1000 | Minimum cnv size (bp) | +The ratio parameters special meanings when setting the value to either `0` or `9999` : + +{.compact} +| ratio | `0` meaning | `9999` meaning | +|:---- |:---|:---| +| `--dup-ratio` | dispersed duplications only | tandem duplications only | +| `--gain-ratio` | loss only | gain only | ==- 🟡 translocations ### translocation @@ -131,8 +131,8 @@ A translocation occurs when a chromosome breaks and the fragmented pieces re-att {.compact} | argument | short name | type | default | description | |:------------------|:----------:|:-----------|:-------:|:----------------| -| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate | | `--count`| `-n` | integer | 0 | Number of random inversions to simluate | +| `--vcf` | `-v` | file path | | VCF file of known inversions to simulate | === diff --git a/Modules/deconvolve.md b/Modules/deconvolve.md new file mode 100644 index 000000000..94499bce9 --- /dev/null +++ b/Modules/deconvolve.md @@ -0,0 +1,66 @@ +--- +label: Deconvolve +description: Resolve clashing barcodes from different molecules +icon: tag +order: 10 +--- + +# :icon-tag: Resolve clashing barcodes from different molecules + +=== :icon-checklist: You will need +- paired-end reads from an Illumina sequencer in FASTQ format [!badge variant="secondary" text="gzip recommended"] + - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] or [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] + - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] or [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] +=== + + + +Running [!badge corners="pill" text="deconvolve"] is **optional**. In the alignment +workflows ([!badge corners="pill" text="align bwa"](Align/bwa.md) +[!badge corners="pill" text="align strobe"](Align/strobe.md)), Harpy already uses a distance-based approach to +deconvolve barcodes and assign `MI` tags (Molecular Identifier), whereas the +[!badge corners="pill" text="align ema"](Align/ema.md) workflow has the +deconvolution occur within the `ema` aligner itself. This workflow uses a reference-free method, +[QuickDeconvolution](https://github.com/RolandFaure/QuickDeconvolution), which uses k-mers to look at "read clouds" (all reads with the same linked-read barcode) +and decide which ones likely originate from different molecules. Regardless of whether you run +this workflow or not, [!badge corners="pill" text="harpy align"](Align/Align.md) will still perform its own deconvolution. + +!!!danger Won't work with EMA +Reads with deconvolved barcodes will not work with [!badge corners="pill" text="align ema"](Align/ema.md), +since EMA expects barcodes to have a specific, un-hyphenated format. If deconvolving, use either +[!badge corners="pill" text="align bwa"](Align/bwa.md) or [!badge corners="pill" text="align strobe"](Align/strobe.md) +for sequence alignment. +!!! + + +!!! Also in harpy qc +This method of deconvolution is also available as an option in the [!badge corners="pill" text="qc"](qc.md) workflow +!!! + +```bash usage +harpy deconvolve OPTIONS... INPUTS... +``` + +## :icon-terminal: Running Options +{.compact} +| argument | short name | type | default | required | description | +|:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | +| `--density` | `-d` | integer | 3 | | On average, $\frac{1}{2^d}$ kmers are indexed | +| `--dropout` | `-a` | integer | 0 | | Minimum cloud size to deconvolve | +| `--kmer-length` | `-k` | integer | 21 | | Size of k-mers to search for similarities | +| `--window-size` | `-w` | integer | 40 | | Size of window guaranteed to contain at least one kmer | + +## Resulting Barcodes +After deconvolution, some barcodes may have a hyphenated suffix like `-1` or `-2` (e.g. `A01C33B41D93-1`). +This is how deconvolution methods create unique variants of barcodes to denote that identical barcodes +do not come from the same original molecules. QuickDeconvolution adds the `-0` suffix to barcodes it was unable +to deconvolve. + +## Harpy Deconvolution Nuances +Some of the downstream linked-read tools Harpy uses expect linked read barcodes to either look like the 16-base 10X +variety or a standard haplotag (AxxCxxBxxDxx). Their pattern-matching would not recognize barcodes deconvoluted with +hyphens. To remedy this, `MI` assignment in [!badge corners="pill" text="align bwa"](Align/bwa.md) +and [!badge corners="pill" text="align strobe"](Align/strobe.md) will assign the deconvolved (hyphenated) barcode to a `DX:Z` +tag and restore the original barcode as the `BX:Z` tag. \ No newline at end of file diff --git a/Modules/demultiplex.md b/Modules/demultiplex.md index 2772b3f18..82c60c81b 100644 --- a/Modules/demultiplex.md +++ b/Modules/demultiplex.md @@ -2,8 +2,7 @@ label: Demultiplex description: Demultiplex raw sequences into haplotag barcoded samples icon: versions -#visibility: hidden -order: 6 +order: 9 --- # :icon-versions: Demultiplex Raw Sequences @@ -29,14 +28,14 @@ harpy demultiplex gen1 --threads 20 --schema demux.schema Plate_1_S001_R*.fastq. In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="demultiplex"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:------------------|:----------:|:-----------|:-------:|:--------:|:------------------------------------------------------------------------| -| `R1_FQ` | | file path | | **yes** | The forward multiplexed FASTQ file | -| `R2_FQ` | | file path | | **yes** | The reverse multiplexed FASTQ file | -| `I1_FQ` | | file path | | **yes** | The forward FASTQ index file provided by the sequencing facility | -| `I2_FQ` | | file path | | **yes** | The reverse FASTQ index file provided by the sequencing facility | -| `METHOD` | | choice | | **yes** | Haplotag technology of the sequences [`gen1`] | -| `--schema` | `-s` | file path | | **yes** | Tab-delimited file of sample\barcode | +| argument | short name | type | required | description | +|:------------------|:----------:|:-----------|:--------:|:------------------------------------------------------------------------| +| `METHOD` | | choice | **yes** | Haplotag technology of the sequences [`gen1`] | +| `R1_FQ` | | file path | **yes** | The forward multiplexed FASTQ file | +| `R2_FQ` | | file path | **yes** | The reverse multiplexed FASTQ file | +| `I1_FQ` | | file path | **yes** | The forward FASTQ index file provided by the sequencing facility | +| `I2_FQ` | | file path | **yes** | The reverse FASTQ index file provided by the sequencing facility | +| `--schema` | `-s` | file path | **yes** | Tab-delimited file of sample\barcode | ## Haplotag Types ==- Generation 1 - `gen1` diff --git a/Modules/impute.md b/Modules/impute.md index ae29b4a2b..c06638eb5 100644 --- a/Modules/impute.md +++ b/Modules/impute.md @@ -2,7 +2,7 @@ label: Impute description: Impute genotypes for haplotagged data with Harpy icon: workflow -order: 3 +order: 8 --- # :icon-workflow: Impute Genotypes using Sequences @@ -56,10 +56,10 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op | argument | short name | type | default | required | description | |:---------------|:----------:|:------------|:-------------:|:--------:|:------------------------------------------------------------------------------------------------| | `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md) | -| `--vcf` | `-v` | file path | | **yes** | Path to VCF/BCF file | | `--extra-params` | `-x` | folder path | | no | Extra arguments to add to the STITCH R function, provided in quotes and R syntax | -| `--vcf-samples`| | toggle | | no | [Use samples present in vcf file](#prioritize-the-vcf-file) for imputation rather than those found the directory | | `--parameters` | `-p` | file path | | **yes** | STITCH [parameter file](#parameter-file) (tab-delimited) | +| `--vcf` | `-v` | file path | | **yes** | Path to VCF/BCF file | +| `--vcf-samples`| | toggle | | no | [Use samples present in vcf file](#prioritize-the-vcf-file) for imputation rather than those found the directory | ### Extra STITCH parameters You may add [additional parameters](https://github.com/rwdavies/STITCH/blob/master/Options.md) to STITCH by way of the diff --git a/Modules/other.md b/Modules/other.md index fc6eca91c..c48a0cd75 100644 --- a/Modules/other.md +++ b/Modules/other.md @@ -1,8 +1,8 @@ --- label: Other -order: 1 icon: file-diff description: Generate extra files for analysis with Harpy +order: 7 --- # :icon-file-diff: Other Harpy modules diff --git a/Modules/phase.md b/Modules/phase.md index 6720327ff..355568109 100644 --- a/Modules/phase.md +++ b/Modules/phase.md @@ -2,7 +2,7 @@ label: Phase description: Phase haplotypes for haplotagged data with Harpy icon: stack -order: 2 +order: 6 --- # :icon-stack: Phase SNPs into Haplotypes @@ -38,13 +38,13 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op | argument | short name | type | default | required | description | |:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------| | `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | -| `--vcf` | `-v` | file path | | **yes** | Path to BCF/VCF file | +| `--extra-params` | `-x` | string | | no | Additional Hapcut2 arguments, in quotes | | `--genome ` | `-g` | file path | | no | Path to genome if wanting to also use reads spanning indels | +| `--ignore-bx` | `-b` | toggle | | no | Ignore haplotag barcodes for phasing | | `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | | `--prune-threshold` | `-p` | integer (0-100) | 7 | no | PHRED-scale (%) threshold for pruning low-confidence SNPs | -| `--ignore-bx` | `-b` | toggle | | no | Ignore haplotag barcodes for phasing | +| `--vcf` | `-v` | file path | | **yes** | Path to BCF/VCF file | | `--vcf-samples` | | toggle | | no | [Use samples present in vcf file](#prioritize-the-vcf-file) for imputation rather than those found the directory | -| `--extra-params` | `-x` | string | | no | Additional Hapcut2 arguments, in quotes | ### Prioritize the vcf file Sometimes you want to run imputation on all the samples present in the `INPUTS`, but other times you may want diff --git a/Modules/preflight.md b/Modules/preflight.md index 6a523806c..9086b97b6 100644 --- a/Modules/preflight.md +++ b/Modules/preflight.md @@ -2,8 +2,7 @@ label: Preflight description: Run file format checks on haplotagged FASTQ/BAM files icon: rocket -#visibility: hidden -order: 6 +order: 5 --- # :icon-rocket: Pre-flight checks for input files @@ -12,6 +11,9 @@ order: 6 - at least 2 cores/threads available - [!badge corners="pill" text="preflight bam"]: SAM/BAM alignment files [!badge variant="secondary" text="BAM recommended"] - [!badge corners="pill" text="preflight fastq"]: paired-end reads from an Illumina sequencer in FASTQ format [!badge variant="secondary" text="gzip recommended"] + - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] or [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] + - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] or [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] === Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). diff --git a/Modules/qc.md b/Modules/qc.md index ead1d7785..283d3eddd 100644 --- a/Modules/qc.md +++ b/Modules/qc.md @@ -2,7 +2,7 @@ label: QC description: Quality trim haplotagged sequences with Harpy icon: codescan-checkmark -order: 6 +order: 4 --- # :icon-codescan-checkmark: Quality Trim Sequences @@ -16,8 +16,8 @@ order: 6 Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence -analyses is to remove these adapters and trim poor quality data. You can remove adapters -and quality trim sequences using the [!badge corners="pill" text="qc"] module: +analyses is to remove these adapters and trim poor quality data. You can remove adapters, +remove duplicates, deconvolve, and quality trim sequences using the [!badge corners="pill" text="qc"] module: ```bash usage harpy qc OPTIONS... INPUTS... @@ -31,13 +31,21 @@ harpy qc --threads 20 Sequences_Raw/ In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="qc"] module is configured using these command-line arguments: {.compact} -| argument | short name | type | default | required | description | -|:-----------------|:----------:|:------------|:-------:|:-------:|:------------------------------------------------------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | -| `--min-length` | `-n` | integer | 30 | no | Discard reads shorter than this length | -| `--max-length` | `-m` | integer | 150 | no | Maximum length to trim sequences down to | -| `--ignore-adapters` | `-x` | toggle | | no | Skip adapter trimming | -| `--extra-params` | `-x` | string | | no | Additional fastp arguments, in quotes | +| argument | short name | type | default | required | description | +|:-----------------|:----------:|:------------|:-------:|:-------:|:--------------------------------------------------------------------------------------------------| +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | +| `--deconvolve` | `-c` | toggle | | | Resolve barcode clashes between reads from different molecules | +| `--deconvolve-params` | `-p` | (int,int,int,int) | (21,40,3,0) | | Accepts the QuickDeconvolution parameters for `k`,`w`,`d`,`a`, in that order | +| `--deduplicate` | `-d` | toggle | | | Identify and remove PCR duplicates | +| `--extra-params` | `-x` | string | | | Additional fastp arguments, in quotes | +| `--min-length` | `-n` | integer | 30 | | Discard reads shorter than this length | +| `--max-length` | `-m` | integer | 150 | | Maximum length to trim sequences down to | +| `--trim-adapters` | `-a` | toggle | | | Detect and remove adapter sequences | + +By default, this workflow will only quality-trim the sequences. You can also opt-in to: +- [!badge variant="secondary" text="recommended"] find and remove sequencing adapters +- [!badge variant="secondary" text="recommended"] find and remove PCR duplicates +- resolve situations where reads from different molecules have the same barcode (see [!badge corners="pill" text="deconvolve"](deconvolve.md)) --- ## :icon-git-pull-request: QC Workflow @@ -54,6 +62,7 @@ graph LR end Inputs-->A:::clean A([fastp]) --> B([count barcodes]):::clean + A-->|--deconvolve|C([QuickDeconvolution]):::clean style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px ``` diff --git a/Modules/snp.md b/Modules/snp.md index 864b35532..85dd241bc 100644 --- a/Modules/snp.md +++ b/Modules/snp.md @@ -2,7 +2,7 @@ label: SNP description: Call SNPs and small indels icon: sliders -order: 5 +order: 2 --- # :icon-sliders: Call SNPs and small indels @@ -60,12 +60,12 @@ In addition to the [!badge variant="info" corners="pill" text="common runtime op {.compact} | argument | short name | type | default | required | description | |:-----------------|:----------:|:--------------------------------|:-------:|:--------:|:----------------------------------------------------| -| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | +| `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input BAM files](/commonoptions.md#input-arguments) | +| `--extra-params` | `-x` | string | | no | Additional mpileup/freebayes arguments, in quotes | | `--genome` | `-g` | file path | | **yes** | Genome assembly for variant calling | -| `--regions` | `-r` | integer/file path/string | 50000 | no | Regions to call variants on ([see below](#regions)) | -| `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | | `--ploidy` | `-x` | integer | 2 | no | Ploidy of samples | -| `--extra-params` | `-x` | string | | no | Additional mpileup/freebayes arguments, in quotes | +| `--populations` | `-p` | file path | | no | Tab-delimited file of sample\<*tab*\>group | +| `--regions` | `-r` | integer/file path/string | 50000 | no | Regions to call variants on ([see below](#regions)) | ### regions The `--regions` (`-r`) option lets you specify the genomic regions you want to call variants on. Keep in mind that diff --git a/development.md b/development.md index 3b92caa09..eb93e849a 100644 --- a/development.md +++ b/development.md @@ -175,6 +175,7 @@ build a new Dockerfile and tag it with the same git tag for Harpy's next release In doing so, it will also replace the tag of the container in all of Harpy's snakefiles from `latest` to the current Harpy version. In other words, during development the top of every snakefile reads `containerized: docker://pdimens/harpy:latest` and the automation replaces it with (e.g.) `containerized: docker://pdimens/harpy:1.17`. +Same for the software version, which is kept at `0.0.0` (`pyproject.toml` and `__main__.py`) in the development version and gets replaced with the tagged version with the automation. Tagging is easily accomplished with Git commands in the command line: ```bash # make sure you're on the main branch diff --git a/retype.yml b/retype.yml index 981af9acf..18f894eec 100644 --- a/retype.yml +++ b/retype.yml @@ -24,7 +24,7 @@ footer: copyright: "© Copyright {{ year }}. All rights reserved." branding: title: Harpy - label: v1.1 + label: v1.2.0 logo: static/favicon.png logoDark: static/favicon.png logoAlign: left diff --git a/software.md b/software.md index 2e5faa903..9a5db987e 100644 --- a/software.md +++ b/software.md @@ -13,9 +13,9 @@ Issues with specific tools might warrant a discussion with the authors/developer ## Standalone Software {.compact} | Software | Links | Publication | -|:------------|:-------------------------:| :-------------------------------------------------------------------------------------------| +|:------------|:-------------------------| :-------------------------------------------------------------------------------------------| | bash | [website](https://www.gnu.org/software/bash/) | -| bcftools | [github](https://github.com/samtools/bcftools), [website](https://samtools.github.io/bcftools/bcftools.html) | | +| bcftools | [github](https://github.com/samtools/bcftools), [website](https://samtools.github.io/bcftools/bcftools.html) | | | bgzip | [website](http://www.htslib.org/doc/bgzip.html) | | | bwa | [github](https://github.com/lh3/bwa)| [publication](http://arxiv.org/abs/1303.3997) | | conda | [github](https://github.com/conda) | | @@ -29,8 +29,9 @@ Issues with specific tools might warrant a discussion with the authors/developer | NAIBR | [github](https://github.com/raphael-group/NAIBR), [github (fork)](https://github.com/pontushojer/NAIBR) |[publication](https://doi.org/10.1093/bioinformatics/btx712) | | plotly | [website](https://plotly.com/) | | | python | [website](https://www.python.org/) | | +| QuickDeconvolution | [github](https://github.com/RolandFaure/QuickDeconvolution) | [publication](https://doi.org/10.1093/bioadv/vbac068) | | R | [website](https://www.r-project.org/) | | -| samtools | [github](https://github.com/samtools/samtools), [website](http://www.htslib.org/) | | +| samtools | [github](https://github.com/samtools/samtools), [website](http://www.htslib.org/) | | | seqtk | [github](https://github.com/lh3/seqtk) | | | simuG | [github](https://github.com/aquaskyline/LRSIM) | [publication](https://doi.org/10.1093/bioinformatics/btz424) | | Snakemake | [github](https://github.com/snakemake/snakemake)| [publication](https://f1000research.com/articles/10-33/v1) | @@ -40,7 +41,7 @@ Issues with specific tools might warrant a discussion with the authors/developer ## Software Packages {.compact} | Package | Language | Links | Publication | -|:------------|:-----: |:-----------:|:------------------------------------------------------------------------------------------------------| +|:------------|:-----: |:-----------|:------------------------------------------------------------------------------------------------------| | click | python | [github](https://github.com/pallets/click) | | | pysam | python | [github](https://github.com/pysam-developers/pysam) | | | r-biocircos | R | [github](https://github.com/lvulliard/BioCircos.R) | |