pdimens · pdimens · Feb 8, 2024 · Oct 27, 2023 · Oct 27, 2023 · Nov 2, 2023
diff --git a/Modules/Align/bwa.md b/Modules/Align/bwa.md
@@ -18,10 +18,10 @@ such as those derived using `harpy qc`. You can map reads onto a genome assembly
 using the `align` module:
 
 ```bash usage
-harpy align bwa OPTIONS...
+harpy align bwa|ema OPTIONS...
 ```
 ```bash example
-harpy align --genome genome.fasta --directory Sequences/ 
+harpy align bwa --genome genome.fasta --directory Sequences/ 
 ```
 
 ## :icon-terminal: Running Options

diff --git a/Modules/extrafiles.md b/Modules/extrafiles.md
diff --git a/Modules/impute.md b/Modules/impute.md
@@ -33,7 +33,7 @@ harpy impute OPTIONS...
 
 ```bash example
 # create stitch parameter file 'stitch.params'
-harpy extra -s stitch.params 
+harpy stitchparams -o stitch.params 
 
 # run imputation
 harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --directory Align/ema --parameters stitch.params
@@ -62,7 +62,7 @@ Typically, one runs STITCH multiple times, exploring how results vary with
 different model parameters (explained in next section). The solution Harpy uses for this is to have the user
 provide a tab-delimited dataframe file where the columns are the 6 STITCH model 
 parameters and the rows are the values for those parameters. The parameter file 
-is required and can be created manually or with `harpy extra -s <filename>`.
+is required and can be created manually or with `harpy stitchparams -o <filename>`.
 If created using harpy, the resulting file includes largely meaningless values 
 that you will need to adjust for your study. The parameter must follow a particular format:
 - tab or comma delimited

diff --git a/Modules/othermodules.md b/Modules/othermodules.md
@@ -0,0 +1,69 @@
+---
+label: Other
+order: 1
+icon: file-diff
+description: Generate extra files for analysis with Harpy
+---
+
+# :icon-file-diff: Other Harpy modules
+Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules:
+The arguments represent different sub-commands and can be run in any order or combination to generate the files you need.
+
+## :icon-terminal: Other modules
+| module         | description                                                                      |
+|:---------------|:---------------------------------------------------------------------------------|
+| `popgroup`     | Create generic sample-group file using existing sample file names (fq.gz or bam) |
+| `stitchparams` | Create template STITCH parameter file                                            |
+| `hpc`          | Create HPC scheduling profile for cluster submission                             |
+
+### popgroup
+#### Sample grouping file for variant calling
+##### arguments
+- `-o`, `--output`: name of the output file
+- `-d`, `--directory`: name of the directory of input files, either fastq or bam.
+
+This file is entirely optional and useful if you want SNP variant calling to happen on a
+per-population level via `harpy snp ... -p` or on samples pooled-as-populations via `harpy sv ... -p`.
+- takes the format of sample\<tab\>group
+- all the samples will be assigned to group `pop1` since file names don't always provide grouping information
+    - so make sure to edit the second column to reflect your data correctly.
+- the file will look like:
+```less popgroups.txt
+sample1 pop1
+sample2 pop1
+sample3 pop2
+sample4 pop1
+sample5 pop3
+```
+
+### stitchparams
+#### STITCH parameter file
+##### arguments
+- `-o`, `--output`: name of the output file
+
+Typically, one runs STITCH multiple times, exploring how results vary with
+different model parameters. The solution Harpy uses for this is to have the user
+provide a tab-delimited dataframe file where the columns are the 6 STITCH model 
+parameters and the rows are the values for those parameters. To make formatting
+easier, a template file is generated for you, just replace the values and add/remove
+rows as necessary. See the [Imputation section](/Modules/impute.md) for details on these parameters.
+
+### hpc
+#### HPC cluster profile
+!!!warning
+HPC support is not yet natively integrated into Harpy. Until then, you can manually
+use the [Snakemake HPC infrastructure](https://snakemake.readthedocs.io/en/stable/executing/cluster.html) with the `-s` flag.
+!!!
+##### arguments
+- `-o`, `--output`: name of the output file
+- `-s`, `--system`: name of the scheduling system
+    - options: `slurm` (more to come)
+
+For snakemake to work in harmony with an HPC scheduler, a "profile" needs to
+be provided that tells Snakemake how it needs to interact with the HPC scheduler
+to submit your jobs to the cluster. Using `harpy hpc -s <hpc-type>` will create
+the necessary folder and profile yaml file for you to use. To use the profile, call
+the intended Harpy module with an additional ``--snakemake` argument:
+```bash use the slurm profile
+harpy module --option1 <value1> --option2 <value2> --snakemake "--profile slurm.profile"
+```
diff --git a/commonoptions.md b/commonoptions.md
@@ -8,7 +8,7 @@ order: 4
 
 Every Harpy module has a series of configuration parameters. These are arguments you need to input
 to configure the module to run on your data, such as the directory with the reads/alignments,
-the genome assembly, etc. All modules (except `extra`) also share a series of common runtime
+the genome assembly, etc. All main modules (e.g. `qc`) also share a series of common runtime
 parameters that don't impact the results of the module, but instead control the speed/verbosity/etc.
 of calling the module. These runtime parameters are listed in the modules' help strings and can be 
 configured using these arguments:

diff --git a/development.md b/development.md
@@ -14,7 +14,8 @@ development and how to contribute to it, if you were inclined to do so.
 Before we get into the technical details, you, dear reader, need to understand
 why Harpy is the way it is. Harpy may be a pipeline for other software, but 
 there is a lot of extra stuff built in to make it user 
-friendly. Not just friendly, but _compassionate_. That means there is a lot
+friendly. Not just friendly, but _compassionate_. The guiding ethos for Harpy is
+**"We don't hate the user"**. That means there is a lot
 of code that checks input files, runtime details, etc. to exit before 
 Snakemake takes over. This is done to minimize time wasted on minor 
 errors that only show their ugly heads 18 hours into a 96 hour process. With that in mind:
@@ -92,10 +93,9 @@ build script is also stored in `misc/meta.yml` and `misc/build.sh`. The yaml fil
 is the metadata of the package, including software deps and their versions. The
 build script is how conda will install all of Harpy's parts. In order to modify 
 these files for a new release, you need to fork `bioconda/bioconda-recipes`, 
-create a new branch, modify the Harpy `meta.yml` and `build.sh` files, then open
-a pull request onto the `master` branch of `bioconda/bioconda-recipes`. There is 
-also an automation that submits a pull request on your behalf when you change the
-version number. 
+create a new branch, modify the Harpy `meta.yml` (and possibly `build.sh`) files. Bioconda
+has an bot that looks for changes to the version number in the `meta.yml` file
+and will automatically submit a Pull Request when it notices that's been changed.
 
 ## The Harpy repository
 ### structure

diff --git a/haplotagdata.md b/haplotagdata.md
@@ -50,7 +50,7 @@ sequences, then it will make sure the `BX:Z:` tag is moved to the end of the ali
 !!!
 
 ### Read length
-Reads must be at least 15 base pairs in length for alignment. The `trim` module removes reads <15bp.
+Reads must be at least 30 base pairs in length for alignment. The `qc` module removes reads <50bp.
 
 ### Compression
 Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway.
@@ -60,7 +60,7 @@ Compressed files are expected to end with the extension `.gz`.
 Unfortunately, there are many different ways of naming FASTQ files, which makes it 
 difficult to accomodate every wacky iteration currently in circulation.
 While Harpy tries its best to be flexible, there are limitations. 
-To that end, for the `demultiplex`, `trim`, and `align` modules, the 
+To that end, for the `demultiplex`, `qc`, and `align` modules, the 
 most common FASTQ naming styles are supported:
 - **sample names**: Alphanumeric and `.`, `-`, `_`
     - you can mix and match special characters, but that's bad practice and not recommended

diff --git a/index.md b/index.md
@@ -35,16 +35,17 @@ Great! Only want to call variants? Awesome! All modules are called by `harpy <mo
 
 | Module        | Description                                   |
 |:--------------|:----------------------------------------------|
-| `extra`       | Create various associated or necessary files  |
 | `preflight`   | Run various format checks for FASTQ and BAM files |
 | `demultiplex` | Demultiplex haplotagged FASTQ files           |
-| `qc`        | Remove adapters and quality trim sequences    |
+| `qc`          | Remove adapters and quality trim sequences    |
 | `align`       | Align sample sequences to a reference genome  |
-| `snp`          | Call SNPs and small indels                   |
+| `snp`         | Call SNPs and small indels                    |
 | `sv`          | Call large structural variants                |
 | `impute`      | Impute genotypes using variants and sequences |
 | `phase`       | Phase SNPs into haplotypes                    |
-
+| `popgroup`      | Create a sample grouping file               |
+| `stitchparams`  | Create a template STITCH parameter file     |
+| `hpc`           | Create a config file to run Harpy on an HPC |
 
 ## Using Harpy
 You can call `harpy` without any arguments (or with `--help`) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with `--help` to see their usage  (e.g. `harpy align --help`).
@@ -56,25 +57,27 @@ You can call `harpy` without any arguments (or with `--help`) to print the docst
  reads, map sequences, call variants, impute genotypes, and    
  phase haplotypes of Haplotagging data. Batteries included.    
 
- demultiplex >> qc >> align >> snp >> impute >> phase          
+ demultiplex >> qc >> align >> snp >> impute >> phase >> sv          
 
  Documentation: https://pdimens.github.io/harpy/               
 
-╭─ Options ───────────────────────────────────────────────────╮
-│ --version      Show the version and exit.                   │
-│ --help     -h  Show this message and exit.                  │
-╰─────────────────────────────────────────────────────────────╯
-╭─ Modules ───────────────────────────────────────────────────╮
-│ demultiplex  Demultiplex haplotagged FASTQ files            │
-│ qc           Remove adapters and quality trim sequences     │
-│ align        Align sample sequences to a reference genome   │
-│ snp          Call SNPs and small indels                     │
-│ sv           Call large structural variants                 │
-│ impute       Impute genotypes using variants and sequences  │
-│ phase        Phase SNPs into haplotypes                     │
-╰─────────────────────────────────────────────────────────────╯
-╭─ Other Commands ────────────────────────────────────────────╮
-│ preflight  Run file format checks on haplotag data          │
-│ extra      Create various optional/necessary input files    │
-╰─────────────────────────────────────────────────────────────╯
+╭─ Options ──────────────────────────────────────────────────╮
+│ --version      Show the version and exit.                  │
+│ --help     -h  Show this message and exit.                 │
+╰────────────────────────────────────────────────────────────╯
+╭─ Modules ──────────────────────────────────────────────────╮
+│ demultiplex  Demultiplex haplotagged FASTQ files           │
+│ qc           Remove adapters and quality trim sequences    │
+│ align        Align sample sequences to a reference genome  │
+│ snp          Call SNPs and small indels                    │
+│ sv           Call large structural variants                │
+│ impute       Impute genotypes using variants and sequences │
+│ phase        Phase SNPs into haplotypes                    │
+╰────────────────────────────────────────────────────────────╯
+╭─ Other Commands ───────────────────────────────────────────╮
+│ preflight     Run file format checks on haplotag data      │
+│ popgroup      Create a sample grouping file                │
+│ stitchparams  Create a template STITCH parameter file      │
+│ hpc           Create a config file to run Harpy on an HPC  │
+╰────────────────────────────────────────────────────────────╯
 ```
diff --git a/snakemake.md b/snakemake.md
@@ -7,7 +7,7 @@ order: 2
 # :icon-terminal: Adding Snakamake parameters
 Harpy relies on Snakemake under the hood to handle file and job dependencies.
 Most of these details have been abstracted away from the end-user, but every
-module of Harpy (except `extra`) has an optional flag `-s` (`--snakemake`) 
+module of Harpy (except `hpc`, `popgroup`, and `stitchparams`) has an optional flag `-s` (`--snakemake`) 
 that you can use to augment the Snakemake workflow if necessary. Whenever you
 use this flag, your argument must be enclosed in quotation marks, for example:
 ```bash
@@ -49,7 +49,7 @@ Sometimes you want to generate a specific intermediate file (or files) rather th
 you want the beadtag report Harpy makes from the output of `EMA count`. To do this, just list the file/files (relative
 to your working directory) without any flags. Example for the beadtag report:
 ```bash
-harpy align -g genome.fasta -d Trim/ -t 4 -s "Align/ema/stats/reads.bxstats.html"
+harpy align -g genome.fasta -d QC/ -t 4 -s "Align/ema/stats/reads.bxstats.html"
 ```
 This of course necessitates knowing the names of the files ahead of time. See the individual modules for a breakdown of expected outputs.