diff --git a/Modules/Align/bwa.md b/Modules/Align/bwa.md index ba741bf30..de5951b2d 100644 --- a/Modules/Align/bwa.md +++ b/Modules/Align/bwa.md @@ -18,10 +18,10 @@ such as those derived using `harpy qc`. You can map reads onto a genome assembly using the `align` module: ```bash usage -harpy align bwa OPTIONS... +harpy align bwa|ema OPTIONS... ``` ```bash example -harpy align --genome genome.fasta --directory Sequences/ +harpy align bwa --genome genome.fasta --directory Sequences/ ``` ## :icon-terminal: Running Options diff --git a/Modules/extrafiles.md b/Modules/extrafiles.md deleted file mode 100644 index 4bf0e07b1..000000000 --- a/Modules/extrafiles.md +++ /dev/null @@ -1,71 +0,0 @@ ---- -label: Extra -order: 7 -icon: file-diff -description: Generate extra files for analysis with Harpy ---- - -# :icon-file-diff: Generate Extra Files -Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using the `harpy extra` module: -```bash -harpy extra OPTIONS... -``` - -The arguments represent different sub-commands and can be run in any order or combination to generate the files you need. - -## :icon-terminal: Running Options -| argument | short name | type | default | required | description | -|:------------------|:----------:|:---------------|:-------:|:--------:|:---------------------------------------------------------------------------------| -| `--popgroup` | `-p` | folder path | | no | Create generic sample-group file using existing sample file names (fq.gz or bam) | -| `--stitch-params` | `-s` | file path | | no | Create template STITCH parameter file | -| `--hpc` | `-h` | string [slurm] | slurm | no | Create HPC scheduling profile for cluster submission | -| `--help` | | | | | Show the module docstring | - - -### popgroup -||| `--popgroup` -**Sample grouping file for variant calling** - -This file is entirely optional and useful if you want variant calling to happen on a per-population level using mpileup via `harpy variants -p`. -- takes the format of sample\group -- all the samples will be assigned to group `1` since file names don't always provide grouping information, so make sure to edit the second column to reflect your data correctly. -- the file will look like: -```less popgroups.txt -sample1 pop1 -sample2 pop1 -sample3 pop2 -sample4 pop1 -sample5 pop3 -``` -||| - -### stitch-params -||| `--stitch-params` -**STITCH parameter file** - -Typically, one runs STITCH multiple times, exploring how results vary with -different model parameters. The solution Harpy uses for this is to have the user -provide a tab-delimited dataframe file where the columns are the 6 STITCH model -parameters and the rows are the values for those parameters. To make formatting -easier, a template file is generated for you, just replace the values and add/remove -rows as necessary. See the [Imputation section](/Modules/impute.md) for details on these parameters. -||| - -### hpc -||| `--hpc` -**HPC cluster profile** -!!!warning -HPC support is not yet natively integrated into Harpy. Until then, you can manually -use the [Snakemake HPC infrastructure](https://snakemake.readthedocs.io/en/stable/executing/cluster.html) with the `-s` flag. -!!! - -For snakemake to work in harmony with an HPC scheduler, a "profile" needs to -be provided that tells Snakemake how it needs to interact with the HPC scheduler -to submit your jobs to the cluster. Using `harpy extra --hpc ` will create -the necessary folder and profile yaml file for you to use. To use the profile, call -the intended Harpy module with an extra ``--snakemake` argument: -```bash -# use the slurm profile -harpy module --option1 --option2 --snakemake "--profile slurm/" -``` -||| \ No newline at end of file diff --git a/Modules/impute.md b/Modules/impute.md index 6cddf6972..9e0b6d366 100644 --- a/Modules/impute.md +++ b/Modules/impute.md @@ -33,7 +33,7 @@ harpy impute OPTIONS... ```bash example # create stitch parameter file 'stitch.params' -harpy extra -s stitch.params +harpy stitchparams -o stitch.params # run imputation harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --directory Align/ema --parameters stitch.params @@ -62,7 +62,7 @@ Typically, one runs STITCH multiple times, exploring how results vary with different model parameters (explained in next section). The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model parameters and the rows are the values for those parameters. The parameter file -is required and can be created manually or with `harpy extra -s `. +is required and can be created manually or with `harpy stitchparams -o `. If created using harpy, the resulting file includes largely meaningless values that you will need to adjust for your study. The parameter must follow a particular format: - tab or comma delimited diff --git a/Modules/othermodules.md b/Modules/othermodules.md new file mode 100644 index 000000000..57115eb57 --- /dev/null +++ b/Modules/othermodules.md @@ -0,0 +1,69 @@ +--- +label: Other +order: 1 +icon: file-diff +description: Generate extra files for analysis with Harpy +--- + +# :icon-file-diff: Other Harpy modules +Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules: +The arguments represent different sub-commands and can be run in any order or combination to generate the files you need. + +## :icon-terminal: Other modules +| module | description | +|:---------------|:---------------------------------------------------------------------------------| +| `popgroup` | Create generic sample-group file using existing sample file names (fq.gz or bam) | +| `stitchparams` | Create template STITCH parameter file | +| `hpc` | Create HPC scheduling profile for cluster submission | + +### popgroup +#### Sample grouping file for variant calling +##### arguments +- `-o`, `--output`: name of the output file +- `-d`, `--directory`: name of the directory of input files, either fastq or bam. + +This file is entirely optional and useful if you want SNP variant calling to happen on a +per-population level via `harpy snp ... -p` or on samples pooled-as-populations via `harpy sv ... -p`. +- takes the format of sample\group +- all the samples will be assigned to group `pop1` since file names don't always provide grouping information + - so make sure to edit the second column to reflect your data correctly. +- the file will look like: +```less popgroups.txt +sample1 pop1 +sample2 pop1 +sample3 pop2 +sample4 pop1 +sample5 pop3 +``` + +### stitchparams +#### STITCH parameter file +##### arguments +- `-o`, `--output`: name of the output file + +Typically, one runs STITCH multiple times, exploring how results vary with +different model parameters. The solution Harpy uses for this is to have the user +provide a tab-delimited dataframe file where the columns are the 6 STITCH model +parameters and the rows are the values for those parameters. To make formatting +easier, a template file is generated for you, just replace the values and add/remove +rows as necessary. See the [Imputation section](/Modules/impute.md) for details on these parameters. + +### hpc +#### HPC cluster profile +!!!warning +HPC support is not yet natively integrated into Harpy. Until then, you can manually +use the [Snakemake HPC infrastructure](https://snakemake.readthedocs.io/en/stable/executing/cluster.html) with the `-s` flag. +!!! +##### arguments +- `-o`, `--output`: name of the output file +- `-s`, `--system`: name of the scheduling system + - options: `slurm` (more to come) + +For snakemake to work in harmony with an HPC scheduler, a "profile" needs to +be provided that tells Snakemake how it needs to interact with the HPC scheduler +to submit your jobs to the cluster. Using `harpy hpc -s ` will create +the necessary folder and profile yaml file for you to use. To use the profile, call +the intended Harpy module with an additional ``--snakemake` argument: +```bash use the slurm profile +harpy module --option1 --option2 --snakemake "--profile slurm.profile" +``` \ No newline at end of file diff --git a/commonoptions.md b/commonoptions.md index 6fe55bbf3..42ab21496 100644 --- a/commonoptions.md +++ b/commonoptions.md @@ -8,7 +8,7 @@ order: 4 Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments, -the genome assembly, etc. All modules (except `extra`) also share a series of common runtime +the genome assembly, etc. All main modules (e.g. `qc`) also share a series of common runtime parameters that don't impact the results of the module, but instead control the speed/verbosity/etc. of calling the module. These runtime parameters are listed in the modules' help strings and can be configured using these arguments: diff --git a/development.md b/development.md index ffd6387b6..059184c05 100644 --- a/development.md +++ b/development.md @@ -14,7 +14,8 @@ development and how to contribute to it, if you were inclined to do so. Before we get into the technical details, you, dear reader, need to understand why Harpy is the way it is. Harpy may be a pipeline for other software, but there is a lot of extra stuff built in to make it user -friendly. Not just friendly, but _compassionate_. That means there is a lot +friendly. Not just friendly, but _compassionate_. The guiding ethos for Harpy is +**"We don't hate the user"**. That means there is a lot of code that checks input files, runtime details, etc. to exit before Snakemake takes over. This is done to minimize time wasted on minor errors that only show their ugly heads 18 hours into a 96 hour process. With that in mind: @@ -92,10 +93,9 @@ build script is also stored in `misc/meta.yml` and `misc/build.sh`. The yaml fil is the metadata of the package, including software deps and their versions. The build script is how conda will install all of Harpy's parts. In order to modify these files for a new release, you need to fork `bioconda/bioconda-recipes`, -create a new branch, modify the Harpy `meta.yml` and `build.sh` files, then open -a pull request onto the `master` branch of `bioconda/bioconda-recipes`. There is -also an automation that submits a pull request on your behalf when you change the -version number. +create a new branch, modify the Harpy `meta.yml` (and possibly `build.sh`) files. Bioconda +has an bot that looks for changes to the version number in the `meta.yml` file +and will automatically submit a Pull Request when it notices that's been changed. ## The Harpy repository ### structure diff --git a/haplotagdata.md b/haplotagdata.md index 107a1f5bd..ec224189e 100644 --- a/haplotagdata.md +++ b/haplotagdata.md @@ -50,7 +50,7 @@ sequences, then it will make sure the `BX:Z:` tag is moved to the end of the ali !!! ### Read length -Reads must be at least 15 base pairs in length for alignment. The `trim` module removes reads <15bp. +Reads must be at least 30 base pairs in length for alignment. The `qc` module removes reads <50bp. ### Compression Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. @@ -60,7 +60,7 @@ Compressed files are expected to end with the extension `.gz`. Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations. -To that end, for the `demultiplex`, `trim`, and `align` modules, the +To that end, for the `demultiplex`, `qc`, and `align` modules, the most common FASTQ naming styles are supported: - **sample names**: Alphanumeric and `.`, `-`, `_` - you can mix and match special characters, but that's bad practice and not recommended diff --git a/index.md b/index.md index d80efa541..e5ff76d41 100644 --- a/index.md +++ b/index.md @@ -35,16 +35,17 @@ Great! Only want to call variants? Awesome! All modules are called by `harpy > qc >> align >> snp >> impute >> phase + demultiplex >> qc >> align >> snp >> impute >> phase >> sv Documentation: https://pdimens.github.io/harpy/ -╭─ Options ───────────────────────────────────────────────────╮ -│ --version Show the version and exit. │ -│ --help -h Show this message and exit. │ -╰─────────────────────────────────────────────────────────────╯ -╭─ Modules ───────────────────────────────────────────────────╮ -│ demultiplex Demultiplex haplotagged FASTQ files │ -│ qc Remove adapters and quality trim sequences │ -│ align Align sample sequences to a reference genome │ -│ snp Call SNPs and small indels │ -│ sv Call large structural variants │ -│ impute Impute genotypes using variants and sequences │ -│ phase Phase SNPs into haplotypes │ -╰─────────────────────────────────────────────────────────────╯ -╭─ Other Commands ────────────────────────────────────────────╮ -│ preflight Run file format checks on haplotag data │ -│ extra Create various optional/necessary input files │ -╰─────────────────────────────────────────────────────────────╯ +╭─ Options ──────────────────────────────────────────────────╮ +│ --version Show the version and exit. │ +│ --help -h Show this message and exit. │ +╰────────────────────────────────────────────────────────────╯ +╭─ Modules ──────────────────────────────────────────────────╮ +│ demultiplex Demultiplex haplotagged FASTQ files │ +│ qc Remove adapters and quality trim sequences │ +│ align Align sample sequences to a reference genome │ +│ snp Call SNPs and small indels │ +│ sv Call large structural variants │ +│ impute Impute genotypes using variants and sequences │ +│ phase Phase SNPs into haplotypes │ +╰────────────────────────────────────────────────────────────╯ +╭─ Other Commands ───────────────────────────────────────────╮ +│ preflight Run file format checks on haplotag data │ +│ popgroup Create a sample grouping file │ +│ stitchparams Create a template STITCH parameter file │ +│ hpc Create a config file to run Harpy on an HPC │ +╰────────────────────────────────────────────────────────────╯ ``` diff --git a/snakemake.md b/snakemake.md index 6ecbf7860..065da352e 100644 --- a/snakemake.md +++ b/snakemake.md @@ -7,7 +7,7 @@ order: 2 # :icon-terminal: Adding Snakamake parameters Harpy relies on Snakemake under the hood to handle file and job dependencies. Most of these details have been abstracted away from the end-user, but every -module of Harpy (except `extra`) has an optional flag `-s` (`--snakemake`) +module of Harpy (except `hpc`, `popgroup`, and `stitchparams`) has an optional flag `-s` (`--snakemake`) that you can use to augment the Snakemake workflow if necessary. Whenever you use this flag, your argument must be enclosed in quotation marks, for example: ```bash @@ -49,7 +49,7 @@ Sometimes you want to generate a specific intermediate file (or files) rather th you want the beadtag report Harpy makes from the output of `EMA count`. To do this, just list the file/files (relative to your working directory) without any flags. Example for the beadtag report: ```bash -harpy align -g genome.fasta -d Trim/ -t 4 -s "Align/ema/stats/reads.bxstats.html" +harpy align -g genome.fasta -d QC/ -t 4 -s "Align/ema/stats/reads.bxstats.html" ``` This of course necessitates knowing the names of the files ahead of time. See the individual modules for a breakdown of expected outputs. diff --git a/software.md b/software.md index f5161a971..73c868a9b 100644 --- a/software.md +++ b/software.md @@ -23,11 +23,13 @@ Issues with specific tools might warrant a discussion with the authors/developer | LEVIATHAN | [website](https://github.com/morispi/LEVIATHAN) | [publication](https://doi.org/10.1101/2021.03.25.437002) | | LRez | [website](https://github.com/morispi/LRez) | [publication](https://academic.oup.com/bioinformaticsadvances/article/1/1/vbab022/6375438?login=false) | | mamba | [website](https://github.com/mamba-org/mamba) | | -| NAIBR | [website](https://github.com/raphael-group/NAIBR) + [fork](https://github.com/pontushojer/NAIBR) | [publication](https://doi.org/10.1093/bioinformatics/btx712) | +| NAIBR | [website](https://github.com/raphael-group/NAIBR) + [fork](https://github.com/pontushojer/NAIBR) | [publication](https://doi.org/10.1093/bioinformatics/btx712) | | python | [website](https://www.python.org/) | | +| rich | [webiste](https://github.com/Textualize/rich) | | | rich-click | [website](https://github.com/ewels/rich-click) | | | sambamba | [website](https://github.com/biod/sambamba) | [publication](https://doi.org/10.1093/bioinformatics/btv098) | | samtools | [website](http://www.htslib.org/) | | | seqtk | [website](https://github.com/lh3/seqtk) | | | Snakemake | [website](https://github.com/snakemake/snakemake) | [publication](https://f1000research.com/articles/10-33/v1) | | STITCH | [website](https://github.com/rwdavies/STITCH) | [publication](https://doi.org/10.1038%2Fng.3594) | +| whatshap | [website](https://github.com/whatshap/whatshap) | [publication](https://doi.org/10.1101/085050) | \ No newline at end of file diff --git a/static/errormsg.png b/static/errormsg.png index b3eba9b40..8c52812e8 100644 Binary files a/static/errormsg.png and b/static/errormsg.png differ