diff --git a/Modules/Align/Align.md b/Modules/Align/Align.md index a7d446fb3..ab8d9b366 100644 --- a/Modules/Align/Align.md +++ b/Modules/Align/Align.md @@ -9,11 +9,11 @@ will need to align them to a reference genome before you can call variants. Harpy offers several aligners for this purpose: {.compact} -| aligner | linked-read aware | speed | link | -| :--- | :---: | :---:| :---: | -| [BWA](bwa.md) | no โŒ | fast โšก | [repo](https://github.com/lh3/bwa), [paper](http://arxiv.org/abs/1303.3997) | -| [EMA](ema.md) | yes โœ… | slow ๐Ÿข |[repo](https://github.com/arshajii/ema), [paper](https://www.biorxiv.org/content/early/2017/11/16/220236) | -| [Minimap2](minimap.md) | no โŒ | fast โšก | [repo](https://github.com/lh3/minimap2) [paper](https://doi.org/10.1093/bioinformatics/btab705) | +| aligner | linked-read aware | speed | repository | publication | +| :--- | :---: | :---:| :---: | :---:| +| [BWA](bwa.md) | no โŒ | fast โšก | [github](https://github.com/lh3/bwa) | [paper](http://arxiv.org/abs/1303.3997) | +| [EMA](ema.md) | yes โœ… | slow ๐Ÿข |[github](https://github.com/arshajii/ema) | [preprint](https://www.biorxiv.org/content/early/2017/11/16/220236) | +| [strobealign](strobe.md) | no โŒ | super fast โšกโšก | [github](https://github.com/ksahlin/strobealign) | [paper](https://doi.org/10.1186/s13059-022-02831-7) | -Despite the fact that EMA is the only barcode-aware aligner offered, when using BWA or Minimap2, Harpy retains the barcode information from the sequence headers and will +Despite the fact that EMA is the only barcode-aware aligner offered, when using BWA or strobealign, Harpy retains the barcode information from the sequence headers and will assign molecule identifiers (`MI:i` SAM tags) based on these barcodes and the [molecule distance threshold](../../haplotagdata.md/#barcode-thresholds). \ No newline at end of file diff --git a/Modules/Align/bwa.md b/Modules/Align/bwa.md index 5f1bde1e8..0e090cc19 100644 --- a/Modules/Align/bwa.md +++ b/Modules/Align/bwa.md @@ -10,6 +10,9 @@ order: 5 - at least 4 cores/threads available - a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] - paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) [!badge variant="secondary" text="gzipped recommended"] + - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] + - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] === Once sequences have been trimmed and passed through other QC filters, they will need to @@ -123,6 +126,7 @@ Align/bwa โ”‚ โ”œโ”€โ”€ sample1.markdup.log โ”‚ โ”‚โ”€โ”€ sample1.sort.log โ””โ”€โ”€ reports + โ”œโ”€โ”€ barcodes.summary.html โ”œโ”€โ”€ bwa.stats.html โ”œโ”€โ”€ Sample1.html โ””โ”€โ”€ data @@ -140,6 +144,7 @@ Align/bwa | `logs/*markdup.log` | stats provided by `samtools markdup` | | `logs/*sort.log` | output of `samtools sort` | | `reports/` | various counts/statistics/reports relating to sequence alignment | +| `reports/barcodes.summary.html` | interactive html report summarizing barcode-specific metrics across all samples | | `reports/bwa.stats.html` | report summarizing `samtools flagstat and stats` results across all samples from `multiqc` | | `reports/Sample1.html` | interactive html report summarizing BX tag metrics and alignment coverage | | `reports/data/coverage/*.cov.gz` | output from samtools cov, used for plots | @@ -173,16 +178,13 @@ These are taken directly from the [BWA documentation](https://bio-bwa.sourceforg +++ :icon-graph: reports These are the summary reports Harpy generates for this workflow. You may right-click the images and open them in a new tab if you wish to see the examples in better detail. -||| Depth and coverage -Reports the depth of alignments in 10kb windows. +||| Alignment BX Information +An aggregate report of barcode-specific alignment information for all samples. ![reports/coverage/*.html](/static/report_align_coverage.png) -||| BX validation -Reports the number of valid/invalid barcodes in the alignments. -![reports/reads.bxstats.html](/static/report_align_bxstats.png) -||| Molecule size +||| Molecule size and Coverage Reports the inferred molecule sized based on barcodes in the alignments. ![reports/BXstats/*.bxstats.html](/static/report_align_bxmol.png) -||| Alignment stats +||| Samtools Alignment stats Reports the general statistics computed by samtools `stats` and `flagstat` ![reports/samtools_*stat/*html](/static/report_align_flagstat.png) ||| diff --git a/Modules/Align/ema.md b/Modules/Align/ema.md index 147d8d3a8..84a8c46a5 100644 --- a/Modules/Align/ema.md +++ b/Modules/Align/ema.md @@ -10,6 +10,9 @@ order: 5 - at least 4 cores/threads available - a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] - paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) [!badge variant="secondary" text="gzipped recommended"] + - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] + - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] - patience because EMA is [!badge variant="warning" text="slow"] ==- Why EMA? The original haplotag manuscript uses BWA to map reads. The authors have since recommended @@ -144,8 +147,8 @@ Align/ema โ”‚ โ””โ”€โ”€ preproc โ”‚ ย ย  โ””โ”€โ”€ Sample1.preproc.log โ””โ”€โ”€ reports + โ”œโ”€โ”€ barcodes.summary.html โ”œโ”€โ”€ ema.stats.html - โ”œโ”€โ”€ reads.bxcounts.html โ”œโ”€โ”€ Sample1.html โ””โ”€โ”€ data ย ย  โ”œโ”€โ”€ bxstats @@ -162,7 +165,7 @@ Align/ema | `logs/preproc/*.preproc.log` | everything `ema preproc` writes to `stderr` during operation | | `reports/` | various counts/statistics/reports relating to sequence alignment | | `reports/ema.stats.html` | report summarizing `samtools flagstat and stats` results across all samples from `multiqc` | -| `reports/reads.bxcounts.html` | interactive html report summarizing `ema count` across all samples | +| `reports/barcodes.summary.html` | interactive html report summarizing barcode-specific metrics across all samples | | `reports/Sample1.html` | interactive html report summarizing BX tag metrics and alignment coverage | | `reports/data/coverage/*.cov.gz` | output from samtools cov, used for plots | | `reports/data/bxstats` | tabular data containing the information used to generate the BX stats in reports | @@ -184,16 +187,13 @@ These are taken directly from the [EMA documentation](https://github.com/arshaji These are the summary reports Harpy generates for this workflow. You may right-click the images and open them in a new tab if you wish to see the examples in better detail. -||| Depth and coverage -Reports the depth of alignments in 10kb windows. +||| Alignment BX Information +An aggregate report of barcode-specific alignment information for all samples. ![reports/coverage/*.html](/static/report_align_coverage.png) -||| BX validation -Reports the number of valid/invalid barcodes in the alignments. -![reports/reads.bxstats.html](/static/report_align_bxstats.png) -||| Molecule size +||| Molecule size and Coverage Reports the inferred molecule sized based on barcodes in the alignments. ![reports/BXstats/*.bxstats.html](/static/report_align_bxmol.png) -||| Alignment stats +||| Samtools Alignment stats Reports the general statistics computed by samtools `stats` and `flagstat` ![reports/samtools_*stat/*html](/static/report_align_flagstat.png) ||| diff --git a/Modules/Align/minimap.md b/Modules/Align/strobe.md similarity index 51% rename from Modules/Align/minimap.md rename to Modules/Align/strobe.md index 9229cb34b..6f24369da 100644 --- a/Modules/Align/minimap.md +++ b/Modules/Align/strobe.md @@ -1,41 +1,59 @@ --- -label: Minimap -description: Align haplotagged sequences with Minimap2 +label: Strobe +description: Align haplotagged sequences with strobealign icon: dot order: 5 --- -# :icon-quote: Map Reads onto a genome with Minimap2 +# :icon-quote: Map Reads onto a genome with strobealign === :icon-checklist: You will need - at least 4 cores/threads available - a genome assembly in FASTA format: [!badge variant="success" text=".fasta"] [!badge variant="success" text=".fa"] [!badge variant="success" text=".fasta.gz"] [!badge variant="success" text=".fa.gz"] - paired-end fastq sequence file with the [proper naming convention](/haplotagdata/#naming-conventions) [!badge variant="secondary" text="gzipped recommended"] + - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] + - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] + - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] === Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input, such as those derived using [!badge corners="pill" text="harpy qc"](../qc.md). You can map reads onto a genome assembly with Harpy -using the [!badge corners="pill" text="align minimap"] module: +using the [!badge corners="pill" text="align strobe"] module: ```bash usage -harpy align minimap OPTIONS... INPUTS... +harpy align strobe OPTIONS... INPUTS... ``` ```bash example -harpy align minimap --genome genome.fasta Sequences/ +harpy align strobe --genome genome.fasta Sequences/ ``` ## :icon-terminal: Running Options -In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="align minimap"] module is configured using these command-line arguments: +In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="align strobe"] module is configured using these command-line arguments: {.compact} | argument | short name | type | default | required | description | |:-------------------|:----------:|:----------------------|:-------:|:--------:|:------------------------------------------------------| | `INPUTS` | | file/directory paths | | **yes** | Files or directories containing [input FASTQ files](/commonoptions.md#input-arguments) | | `--genome` | `-g` | file path | | **yes** | Genome assembly for read mapping | -| `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | +| `--read-length` | `-r` | choice | `auto` | no | Average read length for creating index. Options: [auto, 50, 75, 100, 125, 150, 250, 400] | +| `--molecule-distance` | `-m` | integer | 100000 | no | Base-pair distance threshold to separate molecules | | `--quality-filter` | `-f` | integer (0-40) | 30 | no | Minimum `MQ` (SAM mapping quality) to pass filtering | | `--extra-params` | `-x` | string | | no | Additional EMA-align/BWA arguments, in quotes | +### Read Length +The strobealign program uses a new _strobemer_ design for aligning and requires its own way of indexing the genome. +The index must be configured for the average read length of **the sample** being aligned. If your samples are all about +the same length (on average), then you may specify a read length for `-r` from one of [`50`, `75`, `100`, `125`, `150`, `250`, `400`], +which correspond to base pairs. Specifying an average read length would create a genomic index once and align samples afterwards, +cutting down the time and disk usage for the workflow. If choosing `auto` (the default), strobealign will create an index on the fly +for each sample, guesstimating the average read length for that sample from the first 500 sequences in the FASTQ files. + +!!! Read lengths +Keep in mind that your sequences should have had their adapters removed by this point, which means the maximum length would be ~25bp less +than the total read length from your sequencer. In other words, if you have 2x150bp reads, your average +read length will likely not exceed 125bp after adapter removal. +!!! + ### Molecule distance The `--molecule-distance` option is used during the BWA alignment workflow to assign alignments a unique Molecular Identifier `MI:i` tag based on their @@ -76,16 +94,16 @@ if the primary alignment was marked as a duplicate. Duplicates get marked but ** ---- -## :icon-git-pull-request: Minimap workflow +## :icon-git-pull-request: Strobealign workflow +++ :icon-git-merge: details - ignores (but retains) barcode information - ultra-fast -- comparable accuracy to BWA MEM for sequences greater than 100bp +- [as-good-or-better accuracy](https://github.com/ksahlin/strobealign/blob/main/evaluation.md) to BWA MEM for sequences greater than 100bp - accuracy may be lower for sequences less than 100bp -The [minimap2](https://github.com/lh3/minimap2) workflow is nearly identical to the BWA workflow, +The [strobealign](https://github.com/lh3/strobealign) workflow is nearly identical to the BWA workflow, the only real difference being how the input genome is indexed and that alignment is performed with -`minimap2` instead of BWA. Duplicates are marked using `samtools markdup`. +`strobealign` instead of BWA. Duplicates are marked using `samtools markdup`. The `BX:Z` tags in the read headers are still added to the alignment headers, even though barcodes are not used to inform mapping. The `-m` threshold is used for alignment molecule assignment. @@ -112,19 +130,20 @@ graph LR style aln fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px ``` -+++ :icon-file-directory: minimap2 output -The default output directory is `Align/minimap` with the folder structure below. `Sample1` is a generic sample name for demonstration purposes. ++++ :icon-file-directory: strobealign output +The default output directory is `Align/strobealign` with the folder structure below. `Sample1` is a generic sample name for demonstration purposes. The resulting folder also includes a `workflow` directory (not shown) with workflow-relevant runtime files and information. ``` -Align/minimap +Align/strobealign โ”œโ”€โ”€ Sample1.bam โ”œโ”€โ”€ Sample1.bam.bai โ”œโ”€โ”€ logs -โ”‚ โ”œโ”€โ”€ sample1.minimap.log +โ”‚ โ”œโ”€โ”€ sample1.strobealign.log โ”‚ โ”œโ”€โ”€ sample1.markdup.log โ”‚ โ”‚โ”€โ”€ sample1.sort.log โ””โ”€โ”€ reports - โ”œโ”€โ”€ minimap.stats.html + โ”œโ”€โ”€ barcodes.summary.html + โ”œโ”€โ”€ strobealign.stats.html โ”œโ”€โ”€ Sample1.html โ””โ”€โ”€ data ย ย  โ”œโ”€โ”€ bxstats @@ -137,58 +156,62 @@ Align/minimap |:---------|:------------------------------------------------------------------------------------------------------------| | `*.bam` | sequence alignments for each sample | | `*.bai` | sequence alignment indexes for each sample | -| `logs/*bwa.log` | output of minimap2 during run | +| `logs/*bwa.log` | output of strobealign during run | | `logs/*markdup.log` | stats provided by `samtools markdup` | | `logs/*sort.log` | output of `samtools sort` | | `reports/` | various counts/statistics/reports relating to sequence alignment | -| `reports/minimap.stats.html` | report summarizing `samtools flagstat and stats` results across all samples from `multiqc` | +| `reports/barcodes.summary.html` | interactive html report summarizing barcode-specific metrics across all samples | +| `reports/strobealign.stats.html` | report summarizing `samtools flagstat and stats` results across all samples from `multiqc` | | `reports/Sample1.html` | interactive html report summarizing BX tag metrics and alignment coverage | | `reports/data/coverage/*.cov.gz` | output from samtools cov, used for plots | | `reports/data/bxstats` | tabular data containing the information used to generate the BX stats in reports | -+++ :icon-code-square: minimap2 parameters -By default, Harpy runs `minimap2` with these parameters (excluding inputs and outputs): ++++ :icon-code-square: strobealign parameters +By default, Harpy runs `strobealign` with these parameters (excluding inputs and outputs): ```bash -minimap2 -ax sr -y --sam-hit-only -R \"@RG\\tID:samplename\\tSM:samplename\" +strobealign [--use-index -r ...] -t THREADS -U -C --rg-id={sample} --rg=SM:{sample} {input.genome} {input.fastq} ``` -The Minimap2 aligner has a lot of parameters that can be provided, too many to list here, so please refer to the -[Minimap2 documentation](https://lh3.github.io/minimap2/minimap2.html). The `-a` indicates to output a SAM format -file and the `-x sr` argument is a preset for short reads with these parameters: -- `-k21` -- `-w11` -- `--sr` -- `--frag=yes` -- `-A2` -- `-B8` -- `-O12,32` -- `-E2,1` -- `-b0` -- `-r100` -- `-p.5` -- `-N20` -- `-f1000,5000` -- `-n2` -- `-m20` -- `-s40` -- `-g100` -- `-2K50m` -- `--heap-sort=yes` -- `--secondary=no` +Below is a list of all `strobealign` command line arguments, excluding those Harpy already uses or those made redundant by Harpy's implementation of it. + +{.compact} +| argument | type | description | +| :---- | :---: | :---------- | +| -v | toggle | Verbose output | +| --aemb | toggle | Output the estimated abundance value of contigs, the format of output file is: contig_id abundance_value | +| --eqx | toggle | Emit =/X instead of M CIGAR operations | +| --no-PG | toggle | Do not output PG header | +| --details | toggle | Add debugging details to SAM records | +| --rg= | [TAG:VALUE...] | Add read group metadata to SAM header (can be specified multiple times). Example: SM:samplename | +| -N | integer | Retain at most INT secondary alignments (is upper bounded by -M and depends on -S) [0] | +| -m | integer | Maximum seed length. Defaults to r - 50. For reasonable values on -l and -u, the seed length distribution is usually determined by parameters l and u. Then, this parameter is only active in regions where syncmers are very sparse. | +| -k | integer | Strobe length, has to be below 32. [20] | +| -l | integer | Lower syncmer offset from k/(k-s+1). Start sample second syncmer k/(k-s+1) + l syncmers downstream [0] | +| -u | integer | Upper syncmer offset from k/(k-s+1). End sample second syncmer k/(k-s+1) + u syncmers downstream [7] | +| -c | integer | Bitcount length between 2 and 63. [8] | +| -s | integer | Submer size used for creating syncmers [k-4]. Only even numbers on k-s allowed. A value of s=k-4 roughly represents w=10 as minimizer window [k-4]. It is recommended not to change this parameter unless you have a good understanding of syncmers as it will drastically change the memory usage and results with non default values. | +| -b | integer | No. of top bits of hash to use as bucket indices (8-31)[determined from reference size] | +| -A | integer | Matching score [2] | +| -B | integer | Mismatch penalty [8] | +| -O | integer | Gap open penalty [12] | +| -E | integer | Gap extension penalty [1] | +| -L | integer | Soft clipping penalty [10] | +| -f | float | Top fraction of repetitive strobemers to filter out from sampling [0.0002] | +| -S | float | Try candidate sites with mapping score at least S of maximum mapping score [0.5] | +| -M | integer | Maximum number of mapping sites to try [20] | +| -R | integer | Rescue level. Perform additional search for reads with many repetitive seeds filtered out. This search includes seeds of R*repetitive_seed_size_filter (default: R=2). Higher R than default makes strobealign significantly slower but more accurate. R <= 1 deactivates rescue and is the fastest. | + +++ :icon-graph: reports These are the summary reports Harpy generates for this workflow. You may right-click the images and open them in a new tab if you wish to see the examples in better detail. -||| Depth and coverage -Reports the depth of alignments in 10kb windows. +||| Alignment BX Information +An aggregate report of barcode-specific alignment information for all samples. ![reports/coverage/*.html](/static/report_align_coverage.png) -||| BX validation -Reports the number of valid/invalid barcodes in the alignments. -![reports/reads.bxstats.html](/static/report_align_bxstats.png) -||| Molecule size +||| Molecule size and Coverage Reports the inferred molecule sized based on barcodes in the alignments. ![reports/BXstats/*.bxstats.html](/static/report_align_bxmol.png) -||| Alignment stats +||| Samtools Alignment stats Reports the general statistics computed by samtools `stats` and `flagstat` ![reports/samtools_*stat/*html](/static/report_align_flagstat.png) ||| diff --git a/Modules/SV/SV.md b/Modules/SV/SV.md index 3e8d0fc83..c020e9930 100644 --- a/Modules/SV/SV.md +++ b/Modules/SV/SV.md @@ -26,4 +26,4 @@ the [!badge corners="pill" text="sv naibr"](naibr.md) module can use that to pha ### LEVIATHAN LEVIATHAN relies on split-read information in the sequence alignments to call variants. The EMA aligner does not report split read alignments, instead it reports secondary alignments. -It is recommended to use BWA- or Minimap2-generated alignments if intending to call variants with [!badge corners="pill" text="sv leviathan"](leviathan.md). \ No newline at end of file +It is recommended to use BWA- or strobealign-generated alignments if intending to call variants with [!badge corners="pill" text="sv leviathan"](leviathan.md). \ No newline at end of file diff --git a/Modules/SV/leviathan.md b/Modules/SV/leviathan.md index c21bfc9f0..375a5842f 100644 --- a/Modules/SV/leviathan.md +++ b/Modules/SV/leviathan.md @@ -17,7 +17,7 @@ order: 1 !!!warning EMA-mapped reads Leviathan relies on split-read information in the sequence alignments to call variants. The EMA aligner does not report split read alignments, instead it reports secondary alignments. It is recommended to use -BWA- or Minimap2-generated alignments if intending to call variants with leviathan. +BWA- or strobealign-generated alignments if intending to call variants with leviathan. !!! ==- :icon-file: sample grouping file [!badge variant="ghost" text="optional"] This file is optional and only useful if you want variant calling to happen on a per-population level. diff --git a/Modules/other.md b/Modules/other.md index 74f2d0e53..fc6eca91c 100644 --- a/Modules/other.md +++ b/Modules/other.md @@ -12,9 +12,35 @@ Some parts of Harpy (variant calling, imputation) want or need extra files. You {.compact} | module | description | |:---------------|:---------------------------------------------------------------------------------| +| `resume` | Continue a Harpy workflow from an existing output folder | | `popgroup` | Create generic sample-group file using existing sample file names (fq.gz or bam) | | `stitchparams` | Create template STITCH parameter file | +### resume +When calling a workflow (e.g. [!badge corners="pill" text="qc"](qc.md)), Harpy performs various file checks and validations, sets up the Snakemake command, +output folder(s), etc. In the event you want to continue a failed or manually terminated workflow without overwriting the workflow +files (e.g. `config.yaml`), you can use [!badge corners="pill" text="harpy resume"]. + +```bash usage +harpy resume [--conda] DIRECTORY +``` + +#### arguments +{.compact} +| argument | short name | type | default | required | description | +|:----------------------|:----------:|:----------------|:-------:|:--------:|:---------------------------------------------------------------------| +| `DIRECTORY` | | file/directory paths | | **yes** | Output directory of an existing harpy workflow | +| `--conda` | | toggle | | | generate a `.harpy_envs/` folder with the necessary conda enviroments | + +The `DIRECTORY` is the output directory of a previous harpy-invoked workflow, which **must** have the `workflow/config.yaml` file. +For example, if you previously ran `harpy align bwa -o align-bwa ...`, then you would use `harpy resume align-bwa`, +which would have the necessary `workflow/config.yaml` (and other necessary things) required to successfully continue the workflow. +Using [!badge corners="pill" text="resume"] does **not** overwrite any preprocessing files in the target directory (whereas rerunning the workflow would), +which means you can also manually modify the `config.yaml` file (advanced, not recommended unless you are confident with what you're doing). + +[!badge corners="pill" text="resume"] also requires an existing and populated `.harpy_envs/` directory in the current work directory, like the kind all +main `harpy` workflows would create. If one is not present, you can use `--conda` to create one. + ### popgroup Creates a sample grouping file for variant calling diff --git a/Modules/qc.md b/Modules/qc.md index 0fedcc511..ead1d7785 100644 --- a/Modules/qc.md +++ b/Modules/qc.md @@ -9,9 +9,8 @@ order: 6 === :icon-checklist: You will need - at least 2 cores/threads available - paired-end [fastq](../haplotagdata.md/#naming-conventions) sequence files [!badge variant="secondary" text="gzip recommended"] - - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] - - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - - note that this **does not include** [!badge variant="danger" text=".1"] or [!badge variant="danger" text="_1"] conventions for forward/reverse + - **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text=".1"] or [!badge variant="success" text="_1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] + - **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text=".2"] or [!badge variant="success" text="_2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] === diff --git a/commonoptions.md b/commonoptions.md index 77806fbdb..10a0c6a40 100644 --- a/commonoptions.md +++ b/commonoptions.md @@ -26,12 +26,12 @@ to avoid unexpected behavior. !!! !!!warning clashing names -Harpy will symlink just the file names into `workflow/input` regardless of their origin, -meaning that files in different directories that have the same name (ignoring extensions) will -clash. As an example, both `folderA/sample001.bam` and `folderB/sample001.bam` will become symlinked -as `workflow/input/sample001.bam`, with one symlink overwriting the other, leaving you with one missing -sample. During parsing, Harpy will inform you of naming clashes and terminate to protect you against -this behavior. +Given the regex pattern matching Harpy employs under the hood and the isolation of just the sample names for Snakemake rules, +files in different directories that have the same name (ignoring extensions) will clash. For example, `lane1/sample1.F.fq` +and `lane2/sample1.F.fq` would both derive the sample name `sample1`, which, in a workflow like [!badge corners="pill" text="align"](Modules/Align/Align.md) +would both result in `output/sample1.bam`, creating a problem. This also holds true for the same sample name but different extension, such +as `sample1.F.fq` and `sample1_R1.fq.gz`, which would again derive `sample1` as the sample name and create a naming clash for workflow outputs. +During parsing, Harpy will inform you of naming clashes and terminate to protect you against this behavior. !!! ## Common command-line options @@ -53,14 +53,14 @@ configured using these arguments: | `--quiet` | `-q` | toggle | | Suppress Snakemake printing to console | | `--help` | `-h` | | | Show the module docstring | -As as example, you could call [!badge corners="pill" text="align minimap"](Modules/Align/minimap.md) and specify 20 threads with no output to console: +As as example, you could call [!badge corners="pill" text="align strobe"](Modules/Align/strobe.md) and specify 20 threads with no output to console: ```bash -harpy align minimap --threads 20 --quiet samples/trimmedreads +harpy align strobe --threads 20 --quiet samples/trimmedreads # identical to # -harpy align minimap -t 20 -q samples/trimmedreads +harpy align strobe -t 20 -q samples/trimmedreads ``` --- diff --git a/haplotagdata.md b/haplotagdata.md index e19479347..ed4a969bb 100644 --- a/haplotagdata.md +++ b/haplotagdata.md @@ -66,9 +66,8 @@ most common FASTQ naming styles are supported: - **sample names**: Alphanumeric and [!badge variant="success" text="."] [!badge variant="success" text="_"] [!badge variant="success" text="-"] - you can mix and match special characters, but that's bad practice and not recommended - examples: `Sample.001`, `Sample_001_year4`, `Sample-001_population1.year2` <- not recommended -- **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] -- **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - - note that this **does not include** [!badge variant="danger" text=".1"] or [!badge variant="danger" text="_1"] conventions for forward/reverse +- **forward**: [!badge variant="success" text="_F"] [!badge variant="success" text=".F"] [!badge variant="success" text="_1"] [!badge variant="success" text=".1"] [!badge variant="success" text="_R1_001"] [!badge variant="success" text=".R1_001"] [!badge variant="success" text="_R1"] [!badge variant="success" text=".R1"] +- **reverse**: [!badge variant="success" text="_R"] [!badge variant="success" text=".R"] [!badge variant="success" text="_2"] [!badge variant="success" text=".2"] [!badge variant="success" text="_R2_001"] [!badge variant="success" text=".R2_001"] [!badge variant="success" text="_R2"] [!badge variant="success" text=".R2"] - **fastq extension**: [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="success" text=".FQ"] [!badge variant="success" text=".FASTQ"] - **gzipped**: supported and recommended - **not gzipped**: supported diff --git a/retype.yml b/retype.yml index 6cdd23a1d..981af9acf 100644 --- a/retype.yml +++ b/retype.yml @@ -2,10 +2,10 @@ input: . output: .retype url: pdimens.github.io/harpy/ meta: - title: " | Harpy haplotagging pipeline" + title: " | Harpy haplotag" links: - text: GitHub - link: https://github.com/pdimens/HARPY + link: https://github.com/pdimens/harpy icon: mark-github target: blank - text: Therkildsen Lab @@ -24,7 +24,7 @@ footer: copyright: "© Copyright {{ year }}. All rights reserved." branding: title: Harpy - label: v1.0 + label: v1.1 logo: static/favicon.png logoDark: static/favicon.png logoAlign: left diff --git a/software.md b/software.md index 929c56789..2e5faa903 100644 --- a/software.md +++ b/software.md @@ -11,41 +11,43 @@ If any tools were missed, please let us know! Issues with specific tools might warrant a discussion with the authors/developers on the repositories of these projects. ## Standalone Software -| Software | Links | -|:------------|:--------------------------------------------------------------------------------------------------------------------| -| bash | [website](https://www.gnu.org/software/bash/) | -| bcftools | [website](https://samtools.github.io/bcftools/bcftools.html) | -| bgzip | [website](http://www.htslib.org/doc/bgzip.html) | -| bwa | [website](https://github.com/lh3/bwa), [publication](http://arxiv.org/abs/1303.3997) | -| conda | [website](https://github.com/conda) | -| EMA | [website](https://github.com/arshajii/ema), [publication](https://www.biorxiv.org/content/early/2017/11/16/220236) | -| fastp | [website](https://github.com/OpenGene/fastp), [publication](https://doi.org/10.1093/bioinformatics/bty560) | -| HapCUT2 | [website](https://github.com/vibansal/HapCUT2), [publication](https://doi.org/10.1101/gr.213462.116) | -| LEVIATHAN | [website](https://github.com/morispi/LEVIATHAN), [publication](https://doi.org/10.1101/2021.03.25.437002) | -| LRez | [website](https://github.com/morispi/LRez), [publication](https://academic.oup.com/bioinformaticsadvances/article/1/1/vbab022/6375438?login=false) | -| LRSIM | [webiste](https://github.com/aquaskyline/LRSIM) [publication](http://doi.org/10.1016/j.csbj.2017.10.002) | -| mamba | [website](https://github.com/mamba-org/mamba) | -| minimap2 | [website](https://github.com/lh3/minimap2) [publication](https://doi.org/10.1093/bioinformatics/btab705) | -| NAIBR | [website](https://github.com/raphael-group/NAIBR), [fork](https://github.com/pontushojer/NAIBR), [publication](https://doi.org/10.1093/bioinformatics/btx712) | -| plotly | [website](https://plotly.com/) | -| python | [website](https://www.python.org/) | -| R | [website](https://www.r-project.org/) | -| samtools | [website](http://www.htslib.org/) | -| seqtk | [website](https://github.com/lh3/seqtk) | -| simuG | [website](https://github.com/aquaskyline/LRSIM) [publication](https://doi.org/10.1093/bioinformatics/btz424) | -| Snakemake | [website](https://github.com/snakemake/snakemake), [publication](https://f1000research.com/articles/10-33/v1) | -| whatshap | [website](https://github.com/whatshap/whatshap), [publication](https://doi.org/10.1101/085050) | +{.compact} +| Software | Links | Publication | +|:------------|:-------------------------:| :-------------------------------------------------------------------------------------------| +| bash | [website](https://www.gnu.org/software/bash/) | +| bcftools | [github](https://github.com/samtools/bcftools), [website](https://samtools.github.io/bcftools/bcftools.html) | | +| bgzip | [website](http://www.htslib.org/doc/bgzip.html) | | +| bwa | [github](https://github.com/lh3/bwa)| [publication](http://arxiv.org/abs/1303.3997) | +| conda | [github](https://github.com/conda) | | +| EMA | [github](https://github.com/arshajii/ema) | [publication](https://www.biorxiv.org/content/early/2017/11/16/220236) | +| fastp | [github](https://github.com/OpenGene/fastp)| [publication](https://doi.org/10.1093/bioinformatics/bty560) | +| HapCUT2 | [github](https://github.com/vibansal/HapCUT2)| [publication](https://doi.org/10.1101/gr.213462.116) | +| LEVIATHAN | [github](https://github.com/morispi/LEVIATHAN)| [publication](https://doi.org/10.1101/2021.03.25.437002) | +| LRez | [github](https://github.com/morispi/LRez) | [publication](https://academic.oup.com/bioinformaticsadvances/article/1/1/vbab022/6375438?login=false) | +| LRSIM | [github](https://github.com/aquaskyline/LRSIM) |[publication](http://doi.org/10.1016/j.csbj.2017.10.002) | +| mamba | [github](https://github.com/mamba-org/mamba) | | +| NAIBR | [github](https://github.com/raphael-group/NAIBR), [github (fork)](https://github.com/pontushojer/NAIBR) |[publication](https://doi.org/10.1093/bioinformatics/btx712) | +| plotly | [website](https://plotly.com/) | | +| python | [website](https://www.python.org/) | | +| R | [website](https://www.r-project.org/) | | +| samtools | [github](https://github.com/samtools/samtools), [website](http://www.htslib.org/) | | +| seqtk | [github](https://github.com/lh3/seqtk) | | +| simuG | [github](https://github.com/aquaskyline/LRSIM) | [publication](https://doi.org/10.1093/bioinformatics/btz424) | +| Snakemake | [github](https://github.com/snakemake/snakemake)| [publication](https://f1000research.com/articles/10-33/v1) | +| strobealign | [github](https://github.com/ksahlin/strobealign)| [publication](https://doi.org/10.1186/s13059-022-02831-7) | +| whatshap | [github](https://github.com/whatshap/whatshap) |[publication](https://doi.org/10.1101/085050) | ## Software Packages -| Package | Language | Links | -|:------------|:-----: |:--------------------------------------------------------------------------------------------------------------------| -| click | python | [website](https://github.com/pallets/click) | -| pysam | python | [webiste](https://github.com/pysam-developers/pysam) | -| r-biocircos | R | [website](https://github.com/lvulliard/BioCircos.R) | -| r-circlize | R | [website](https://github.com/jokergoo/circlize), [publication](https://doi.org/10.1093/bioinformatics/btu393) | -| r-highcharter | R | [website (source)](https://www.highcharts.com/), [website (R-package)](https://github.com/jbkunst/highcharter/) | -| r-tidyverse | R | [website](https://www.tidyverse.org/), [publication](https://doi.org/10.21105/joss.01686) | -| r-DT | R | [website](https://rstudio.github.io/DT/), [js-website](http://datatables.net) | -| rich | python | [website](https://github.com/Textualize/rich) | -| rich-click | python | [website](https://github.com/ewels/rich-click) | -| STITCH | R | [website](https://github.com/rwdavies/STITCH), [publication](https://doi.org/10.1038%2Fng.3594) | +{.compact} +| Package | Language | Links | Publication | +|:------------|:-----: |:-----------:|:------------------------------------------------------------------------------------------------------| +| click | python | [github](https://github.com/pallets/click) | | +| pysam | python | [github](https://github.com/pysam-developers/pysam) | | +| r-biocircos | R | [github](https://github.com/lvulliard/BioCircos.R) | | +| r-circlize | R | [github](https://github.com/jokergoo/circlize) | [publication](https://doi.org/10.1093/bioinformatics/btu393) | +| r-highcharter | R | [website (source)](https://www.highcharts.com/) [website (R-package)](https://github.com/jbkunst/highcharter/) | | +| r-tidyverse | R | [website](https://www.tidyverse.org/) | [publication](https://doi.org/10.21105/joss.01686) | +| r-DT | R | [website](https://rstudio.github.io/DT/), [js-website](http://datatables.net) | | +| rich | python | [github](https://github.com/Textualize/rich) | | +| rich-click | python | [github](https://github.com/ewels/rich-click) | | +| STITCH | R | [github](https://github.com/rwdavies/STITCH) | [publication](https://doi.org/10.1038%2Fng.3594) |