Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs #26

Merged
merged 5 commits into from
Feb 8, 2024
Merged

Docs #26

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Modules/Align/bwa.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ such as those derived using `harpy qc`. You can map reads onto a genome assembly
using the `align` module:

```bash usage
harpy align bwa OPTIONS...
harpy align bwa|ema OPTIONS...
```
```bash example
harpy align --genome genome.fasta --directory Sequences/
harpy align bwa --genome genome.fasta --directory Sequences/
```

## :icon-terminal: Running Options
Expand Down
71 changes: 0 additions & 71 deletions Modules/extrafiles.md

This file was deleted.

4 changes: 2 additions & 2 deletions Modules/impute.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ harpy impute OPTIONS...

```bash example
# create stitch parameter file 'stitch.params'
harpy extra -s stitch.params
harpy stitchparams -o stitch.params

# run imputation
harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --directory Align/ema --parameters stitch.params
Expand Down Expand Up @@ -62,7 +62,7 @@ Typically, one runs STITCH multiple times, exploring how results vary with
different model parameters (explained in next section). The solution Harpy uses for this is to have the user
provide a tab-delimited dataframe file where the columns are the 6 STITCH model
parameters and the rows are the values for those parameters. The parameter file
is required and can be created manually or with `harpy extra -s <filename>`.
is required and can be created manually or with `harpy stitchparams -o <filename>`.
If created using harpy, the resulting file includes largely meaningless values
that you will need to adjust for your study. The parameter must follow a particular format:
- tab or comma delimited
Expand Down
69 changes: 69 additions & 0 deletions Modules/othermodules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
label: Other
order: 1
icon: file-diff
description: Generate extra files for analysis with Harpy
---

# :icon-file-diff: Other Harpy modules
Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules:
The arguments represent different sub-commands and can be run in any order or combination to generate the files you need.

## :icon-terminal: Other modules
| module | description |
|:---------------|:---------------------------------------------------------------------------------|
| `popgroup` | Create generic sample-group file using existing sample file names (fq.gz or bam) |
| `stitchparams` | Create template STITCH parameter file |
| `hpc` | Create HPC scheduling profile for cluster submission |

### popgroup
#### Sample grouping file for variant calling
##### arguments
- `-o`, `--output`: name of the output file
- `-d`, `--directory`: name of the directory of input files, either fastq or bam.

This file is entirely optional and useful if you want SNP variant calling to happen on a
per-population level via `harpy snp ... -p` or on samples pooled-as-populations via `harpy sv ... -p`.
- takes the format of sample\<tab\>group
- all the samples will be assigned to group `pop1` since file names don't always provide grouping information
- so make sure to edit the second column to reflect your data correctly.
- the file will look like:
```less popgroups.txt
sample1 pop1
sample2 pop1
sample3 pop2
sample4 pop1
sample5 pop3
```

### stitchparams
#### STITCH parameter file
##### arguments
- `-o`, `--output`: name of the output file

Typically, one runs STITCH multiple times, exploring how results vary with
different model parameters. The solution Harpy uses for this is to have the user
provide a tab-delimited dataframe file where the columns are the 6 STITCH model
parameters and the rows are the values for those parameters. To make formatting
easier, a template file is generated for you, just replace the values and add/remove
rows as necessary. See the [Imputation section](/Modules/impute.md) for details on these parameters.

### hpc
#### HPC cluster profile
!!!warning
HPC support is not yet natively integrated into Harpy. Until then, you can manually
use the [Snakemake HPC infrastructure](https://snakemake.readthedocs.io/en/stable/executing/cluster.html) with the `-s` flag.
!!!
##### arguments
- `-o`, `--output`: name of the output file
- `-s`, `--system`: name of the scheduling system
- options: `slurm` (more to come)

For snakemake to work in harmony with an HPC scheduler, a "profile" needs to
be provided that tells Snakemake how it needs to interact with the HPC scheduler
to submit your jobs to the cluster. Using `harpy hpc -s <hpc-type>` will create
the necessary folder and profile yaml file for you to use. To use the profile, call
the intended Harpy module with an additional ``--snakemake` argument:
```bash use the slurm profile
harpy module --option1 <value1> --option2 <value2> --snakemake "--profile slurm.profile"
```
2 changes: 1 addition & 1 deletion commonoptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ order: 4

Every Harpy module has a series of configuration parameters. These are arguments you need to input
to configure the module to run on your data, such as the directory with the reads/alignments,
the genome assembly, etc. All modules (except `extra`) also share a series of common runtime
the genome assembly, etc. All main modules (e.g. `qc`) also share a series of common runtime
parameters that don't impact the results of the module, but instead control the speed/verbosity/etc.
of calling the module. These runtime parameters are listed in the modules' help strings and can be
configured using these arguments:
Expand Down
10 changes: 5 additions & 5 deletions development.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ development and how to contribute to it, if you were inclined to do so.
Before we get into the technical details, you, dear reader, need to understand
why Harpy is the way it is. Harpy may be a pipeline for other software, but
there is a lot of extra stuff built in to make it user
friendly. Not just friendly, but _compassionate_. That means there is a lot
friendly. Not just friendly, but _compassionate_. The guiding ethos for Harpy is
**"We don't hate the user"**. That means there is a lot
of code that checks input files, runtime details, etc. to exit before
Snakemake takes over. This is done to minimize time wasted on minor
errors that only show their ugly heads 18 hours into a 96 hour process. With that in mind:
Expand Down Expand Up @@ -92,10 +93,9 @@ build script is also stored in `misc/meta.yml` and `misc/build.sh`. The yaml fil
is the metadata of the package, including software deps and their versions. The
build script is how conda will install all of Harpy's parts. In order to modify
these files for a new release, you need to fork `bioconda/bioconda-recipes`,
create a new branch, modify the Harpy `meta.yml` and `build.sh` files, then open
a pull request onto the `master` branch of `bioconda/bioconda-recipes`. There is
also an automation that submits a pull request on your behalf when you change the
version number.
create a new branch, modify the Harpy `meta.yml` (and possibly `build.sh`) files. Bioconda
has an bot that looks for changes to the version number in the `meta.yml` file
and will automatically submit a Pull Request when it notices that's been changed.

## The Harpy repository
### structure
Expand Down
4 changes: 2 additions & 2 deletions haplotagdata.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ sequences, then it will make sure the `BX:Z:` tag is moved to the end of the ali
!!!

### Read length
Reads must be at least 15 base pairs in length for alignment. The `trim` module removes reads <15bp.
Reads must be at least 30 base pairs in length for alignment. The `qc` module removes reads <50bp.

### Compression
Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway.
Expand All @@ -60,7 +60,7 @@ Compressed files are expected to end with the extension `.gz`.
Unfortunately, there are many different ways of naming FASTQ files, which makes it
difficult to accomodate every wacky iteration currently in circulation.
While Harpy tries its best to be flexible, there are limitations.
To that end, for the `demultiplex`, `trim`, and `align` modules, the
To that end, for the `demultiplex`, `qc`, and `align` modules, the
most common FASTQ naming styles are supported:
- **sample names**: Alphanumeric and `.`, `-`, `_`
- you can mix and match special characters, but that's bad practice and not recommended
Expand Down
47 changes: 25 additions & 22 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,16 +35,17 @@ Great! Only want to call variants? Awesome! All modules are called by `harpy <mo

| Module | Description |
|:--------------|:----------------------------------------------|
| `extra` | Create various associated or necessary files |
| `preflight` | Run various format checks for FASTQ and BAM files |
| `demultiplex` | Demultiplex haplotagged FASTQ files |
| `qc` | Remove adapters and quality trim sequences |
| `qc` | Remove adapters and quality trim sequences |
| `align` | Align sample sequences to a reference genome |
| `snp` | Call SNPs and small indels |
| `snp` | Call SNPs and small indels |
| `sv` | Call large structural variants |
| `impute` | Impute genotypes using variants and sequences |
| `phase` | Phase SNPs into haplotypes |

| `popgroup` | Create a sample grouping file |
| `stitchparams` | Create a template STITCH parameter file |
| `hpc` | Create a config file to run Harpy on an HPC |

## Using Harpy
You can call `harpy` without any arguments (or with `--help`) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with `--help` to see their usage (e.g. `harpy align --help`).
Expand All @@ -56,25 +57,27 @@ You can call `harpy` without any arguments (or with `--help`) to print the docst
reads, map sequences, call variants, impute genotypes, and
phase haplotypes of Haplotagging data. Batteries included.

demultiplex >> qc >> align >> snp >> impute >> phase
demultiplex >> qc >> align >> snp >> impute >> phase >> sv

Documentation: https://pdimens.github.io/harpy/

╭─ Options ───────────────────────────────────────────────────╮
│ --version Show the version and exit. │
│ --help -h Show this message and exit. │
╰─────────────────────────────────────────────────────────────╯
╭─ Modules ───────────────────────────────────────────────────╮
│ demultiplex Demultiplex haplotagged FASTQ files │
│ qc Remove adapters and quality trim sequences │
│ align Align sample sequences to a reference genome │
│ snp Call SNPs and small indels │
│ sv Call large structural variants │
│ impute Impute genotypes using variants and sequences │
│ phase Phase SNPs into haplotypes │
╰─────────────────────────────────────────────────────────────╯
╭─ Other Commands ────────────────────────────────────────────╮
│ preflight Run file format checks on haplotag data │
│ extra Create various optional/necessary input files │
╰─────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────╮
│ --version Show the version and exit. │
│ --help -h Show this message and exit. │
╰────────────────────────────────────────────────────────────╯
╭─ Modules ──────────────────────────────────────────────────╮
│ demultiplex Demultiplex haplotagged FASTQ files │
│ qc Remove adapters and quality trim sequences │
│ align Align sample sequences to a reference genome │
│ snp Call SNPs and small indels │
│ sv Call large structural variants │
│ impute Impute genotypes using variants and sequences │
│ phase Phase SNPs into haplotypes │
╰────────────────────────────────────────────────────────────╯
╭─ Other Commands ───────────────────────────────────────────╮
│ preflight Run file format checks on haplotag data │
│ popgroup Create a sample grouping file │
│ stitchparams Create a template STITCH parameter file │
│ hpc Create a config file to run Harpy on an HPC │
╰────────────────────────────────────────────────────────────╯
```
4 changes: 2 additions & 2 deletions snakemake.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ order: 2
# :icon-terminal: Adding Snakamake parameters
Harpy relies on Snakemake under the hood to handle file and job dependencies.
Most of these details have been abstracted away from the end-user, but every
module of Harpy (except `extra`) has an optional flag `-s` (`--snakemake`)
module of Harpy (except `hpc`, `popgroup`, and `stitchparams`) has an optional flag `-s` (`--snakemake`)
that you can use to augment the Snakemake workflow if necessary. Whenever you
use this flag, your argument must be enclosed in quotation marks, for example:
```bash
Expand Down Expand Up @@ -49,7 +49,7 @@ Sometimes you want to generate a specific intermediate file (or files) rather th
you want the beadtag report Harpy makes from the output of `EMA count`. To do this, just list the file/files (relative
to your working directory) without any flags. Example for the beadtag report:
```bash
harpy align -g genome.fasta -d Trim/ -t 4 -s "Align/ema/stats/reads.bxstats.html"
harpy align -g genome.fasta -d QC/ -t 4 -s "Align/ema/stats/reads.bxstats.html"
```
This of course necessitates knowing the names of the files ahead of time. See the individual modules for a breakdown of expected outputs.

Expand Down
Loading
Loading