Each of the main Harpy modules (e.g. qc or phase) follows the format of
+
Each of the main Harpy modules (e.g.
+ qc
+ or
+ phase
+) follows the format of
harpy module options arguments
-
where module is something like impute or snp mpileup and options are the runtime parameters,
+
where module is something like
+ impute
+ or
+ snp mpileup
+ and options are the runtime parameters,
which can include things like an input --vcf file, --molecule-distance, etc. After the options
is where you provide the input files/directories without flags and following standard BASH expansion
rules (e.g. wildcards). You can mix and match entire directories, individual files, and wildcard expansions.
@@ -299,19 +307,20 @@
Every Harpy module has a series of configuration parameters. These are arguments you need to input
to configure the module to run on your data, such as the directory with the reads/alignments,
-the genome assembly, etc. All main modules (e.g. qc) also share a series of common runtime
+the genome assembly, etc. All main modules (e.g.
+ qc
+) also share a series of common runtime
parameters that don't impact the results of the module, but instead control the speed/verbosity/etc.
of calling the module. These runtime parameters are listed in the modules' help strings and can be
configured using these arguments:
-
+
argument
short name
type
default
-
required
description
@@ -321,7 +330,6 @@
-o
string
varies
-
no
Name of output directory
@@ -329,23 +337,27 @@
-t
integer
4
-
no
Number of threads to use
+
--conda
+
+
toggle
+
+
Use local conda environments instead of preconfigured Singularity container
+
+
--skipreports
-
-r
+
toggle
-
no
Skip the processing and generation of HTML reports in a workflow
understanding or as a point of reference when writing the Methods within a manuscript. The presence of the folder
and the contents therein also allow you to rerun the workflow manually. The workflow folder may contain the following:
-
+
item
@@ -417,7 +429,7 @@
useful to understand math behind plots/tables or borrow code from
-
*.workflow.summary
+
*.summary
Plain-text overview of the important parts of the workflow
useful for bookkeeping and writing Methods
@@ -433,8 +445,9 @@
You will notice that many of the workflows will create a Genome folder in the working
directory. This folder is to make it easier for Harpy to store the genome and the associated
-indexing/etc. files. Your input genome will be symlinked into that directory (not copied), but
-all the other files (.fai, .bwt, .bed, etc.) will be created in that directory.
+indexing/etc. files across workflows without having to redo things unnecessarily. Your input
+genome will be symlinked into that directory (not copied, unless a workflow requires gzipping/decompressing),
+but all the other files (.fai, .bwt, .bed, etc.) will be created in that directory.
diff --git a/development/index.html b/development/index.html
index 55c189058..268ffa14a 100644
--- a/development/index.html
+++ b/development/index.html
@@ -4,7 +4,7 @@
-
+
@@ -34,12 +34,12 @@
-
+
-
+
-
-
+
+
@@ -273,7 +273,7 @@
Installing Harpy for development
-
The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the misc/buildlocal.sh
+
The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the resources/buildlocal.sh
script to move all the necessary files to the /bin/ path within your active conda environment.
@@ -287,7 +287,7 @@
Step 2: install the conda environment dependencies
install the dependencies with conda/mamba
-
mamba env create --name harpy --file misc/harpyenv.yaml
+
mamba env create --name harpy --file resources/harpy.yaml
This will create a conda environment named harpy with all the bits necessary to successfully run Harpy. You can change the name of this environment by specifying
--name something.
@@ -303,12 +303,12 @@
Step 4: install the Harpy files
-
Call the misc/buildlocal.sh bash script to finish the installation.
+
Call the resources/buildlocal.sh bash script to finish the installation.
This will build the harpy python program, and copy all the additional files Harpy needs to run
to the bin/ directory of your active conda environment.
install harpy and the necessary files
-
bash misc/buildlocal.sh
+
bash resources/buildlocal.sh
@@ -382,11 +382,7 @@
main
-
the source code of the current release and used for new bioconda releases
-
-
-
dev
-
staging and testing area for new code prior to merging with main for release
+
staging and testing area for new code prior to creating the next release
within your fork, create a new branch, name it something relevant to what you intend to do (e.g., naibr_bugfix, add_deepvariant)
-
-
create a branch off of main if you are trying to fix a bug in the release version
-
create a branch off of dev if you are adding a new feature or breaking change
-
-
+
within your fork, create a new branch, name it something relevant to what you intend to do (e.g., naibr_bugfix, add_deepvariant)
add and modify code with your typical coding workflow, pushing your changes to your Harpy fork
-
when it's ready for inclusion into Harpy (and testing), create a Pull Request to merge your changes into the Harpy dev branch
+
when it's ready for inclusion into Harpy (and testing), create a Pull Request to merge your changes into the Harpy main branch
+
+
+ #
+ containerization
+
+
+
As of Harpy v1.0, the software dependencies that the Snakemake workflows use are pre-configured as a Docker image
+that is uploaded to Dockerhub. Updating or editing this container can be done automatically or manually.
+
+
+ #
+ automatically
+
+
+
The testing GitHub Action will automatically create a Dockerfile with
+ harpy containerize
+ (a hidden harpy command)
+and build a new Docker container, then upload it to dockerhub
+with the latest tag. This process is triggered on push or pull request with changes to either
+src/harpy/conda_deps or src/harpy/snakefiles/containerize.smk on main.
+
+
+ #
+ manually
+
+
+
The dockerfile for that container is created by using a hidden harpy command
+ harpy containerize
+
+
+
auto-generate Dockerfile
+
harpy containerize
+
+
which does all of the work for us. The result is a Dockerfile that has all of the conda environments
+written into it. After creating the Dockerfile, the image must then be built.
+
+
build the Docker image
+
cd resources
+docker build -t pdimens/harpy .
+
+
This will take a bit because the R dependencies are hefty. Once that's done, the image can be pushed to Dockerhub:
+
+
push image to Dockerhub
+
docker push pdimens/harpy
+
+
This containerize -> dockerfile -> build -> process will push the changes to Dockerhub with the latest tag, which is suitable for
+the development cycle. When the container needs to be tagged to be associated with the release of a new Harpy version, you will need to
+add a tag to the docker build step:
+
+
build tagged Docker image
+
cd resources
+docker build -t pdimens/harpy:TAG
+
+
where TAG is the Harpy version, such as 1.0, 1.4.1, 2.1, etc. As such, during development, the containerized: docker://pdimens/harpy:TAG declaration at the top of the snakefiles should use the latest tag, and when ready for release, changed to match the Harpy
+version. So, if the Harpy version is 1.4.12, then the associated docker image should also be tagged with 1.4.12. The tag should remain latest
+(unless there is a very good reason otherwise) since automatic Docker tagging happens upon releases of new Harpy versions.
#
@@ -435,7 +482,7 @@
CI (Continuous Integration) is a term describing automated actions that do
things to/with your code and are triggered by how you interact with a repository.
-Harpy has a series of GitHub Actions triggered by interactions with the dev branch (in .github/workflows)
+Harpy has a series of GitHub Actions triggered by interactions with the main branch (in .github/workflows)
to test the Harpy modules depending on which files are being changed by the push or
pull request. It's setup such that, for example, when files associated with
demultiplexing are altered, it will run harpy demultiplex on the test data
@@ -449,8 +496,13 @@
Releases
-
To save on disk space, there is an automation to strip out the unnecessary files and upload a
-cleaned tarball to the new release. This is triggered automatically when a new version is tagged.
+
There is an automation
+that gets triggered every time Harpy is tagged with the new version. It strips out the unnecessary files and will
+upload a cleaned tarball to the new release (reducing filesize by orders of magnitude). The automation will also
+build a new Dockerfile and tag it with the same git tag for Harpy's next release and push it to Dockerhub.
+In doing so, it will also replace the tag of the container in all of Harpy's snakefiles from latest to the
+current Harpy version. In other words, during development the top of every snakefile reads
+containerized: docker://pdimens/harpy:latest and the automation replaces it with (e.g.) containerized: docker://pdimens/harpy:1.17.
Tagging is easily accomplished with Git commands in the command line:
# make sure you're on the main branch
diff --git a/haplotagdata/index.html b/haplotagdata/index.html
index 8349074a8..8becf46bf 100644
--- a/haplotagdata/index.html
+++ b/haplotagdata/index.html
@@ -4,7 +4,7 @@
-
+
@@ -34,11 +34,11 @@
-
+
-
+
-
+
@@ -268,9 +268,9 @@
where the barcodes go
-
Chromium 10X linked-reads have a particular format where the barcode is the leading 16 bases
-of the read. However, haplotagging data does not use that format, nor do the tools
-implemented in Harpy work correctly with it. Once demultiplexed, haplotagging sequences should look
+
Chromium 10X linked-reads use a format where the barcode is the leading 16 bases
+of the forward (R1) read. However, haplotagging data does not use that format and many of the tools
+implemented in Harpy won't work correctly with the 10X format. Once demultiplexed, haplotagging sequences should look
like regular FASTQ files of inserts and the barcode is stored in a BX:Z:AxxCxxBxxDxx tag
in the read header. Again, do not include the barcode in the sequence.
@@ -307,8 +307,11 @@
A caveat
The Leviathan structural variant caller expects the BX:Z: tag at the end of the alignment
record, so if you intend on using that variant caller, you will need to make sure the BX:Z:
-tag is the last one in the sequence alignment (BAM file). If you use Harpy to align the
-sequences, then it will make sure the BX:Z: tag is moved to the end of the alignment.
+tag is the last one in the sequence alignment (BAM file). If you use any method within
+
+ harpy align
+, the BX:Z: tag is guaranteed to be at
+the end of the alignment record.
@@ -318,7 +321,9 @@
Read length
-
Reads must be at least 30 base pairs in length for alignment. The qc module removes reads <50bp.
+
Reads must be at least 30 base pairs in length for alignment. By default, the
+ qc
+ module removes reads <30bp.
#
@@ -326,7 +331,9 @@
Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway.
-Compressed files are expected to end with the extension .gz.
+Compressed files are expected to end with the extension
+ .gz
+.
#
@@ -336,21 +343,70 @@
Unfortunately, there are many different ways of naming FASTQ files, which makes it
difficult to accomodate every wacky iteration currently in circulation.
While Harpy tries its best to be flexible, there are limitations.
-To that end, for the demultiplex, qc, and align modules, the
+To that end, for the
+ deumultiplex
+,
+ qc
+, and
+ align
+ modules, the
most common FASTQ naming styles are supported:
-
sample names: Alphanumeric and ., -, _
+
sample names: Alphanumeric and
+ .
+
+ _
+
+ -
+
you can mix and match special characters, but that's bad practice and not recommended
examples: Sample.001, Sample_001_year4, Sample-001_population1.year2 <- not recommended
-
forward/reverse: _F, .F, _R1, .R1, _R1_001, .R1_001, etc.
+
If you use bamutils clipOverlap on alignments that are used for the impute or
-phase modules, they will cause both programs to error. We don't know why, but they do.
+
If you use bamutils clipOverlap on alignments that are used for the
+ impute
+ or
+
+ phase
+ modules, they will cause both programs to error. We don't know why, but they do.
Solution: Do not clip overlapping alignments for bam files you intend to use for
-the impute or phase modules. Harpy does not clip overlapping alignments, so
+the
+ impute
+ or
+
+ phase
+ modules. Harpy does not clip overlapping alignments, so
alignments produced by Harpy should work just fine.
Once sequences have been trimmed and passed through other QC filters, they will need to
be aligned to a reference genome. This module within Harpy expects filtered reads as input,
-such as those derived using harpy qc. You can map reads onto a genome assembly with Harpy
-using the align module:
+such as those derived using
+ harpy qc
+. You can map reads onto a genome assembly with Harpy
+using the
+ align bwa
+ module:
usage
harpy align bwa OPTIONS... INPUTS...
@@ -261,9 +275,13 @@
Running Options
-
In addition to the common runtime options, the harpy align bwa module is configured using these command-line arguments:
+
In addition to the
+ common runtime options
+, the
+ align bwa
+ module is configured using these command-line arguments:
-
+
argument
@@ -308,14 +326,6 @@
Minimum MQ (SAM mapping quality) to pass filtering
Once sequences have been trimmed and passed through other QC filters, they will need to
be aligned to a reference genome. This module within Harpy expects filtered reads as input,
-such as those derived using harpy qc. You can map reads onto a genome assembly with Harpy
-using the align module:
+such as those derived using
+ harpy qc
+. You can map reads onto a genome assembly with Harpy
+using the
+ align ema
+ module:
usage
harpy align ema OPTIONS... INPUTS...
@@ -277,9 +293,13 @@
Running Options
-
In addition to the common runtime options, the harpy align ema module is configured using these command-line arguments:
+
In addition to the
+ common runtime options
+, the
+ align ema
+ module is configured using these command-line arguments:
Once sequences have been trimmed and passed through other QC filters, they will need to
be aligned to a reference genome. This module within Harpy expects filtered reads as input,
-such as those derived using harpy qc. You can map reads onto a genome assembly with Harpy
-using the align module:
+such as those derived using
+ harpy qc
+. You can map reads onto a genome assembly with Harpy
+using the
+ align minimap
+ module:
usage
harpy align minimap OPTIONS... INPUTS...
@@ -261,9 +275,13 @@
Running Options
-
In addition to the common runtime options, the harpy align minimap module is configured using these command-line arguments:
+
In addition to the
+ common runtime options
+, the
+ align minimap
+ module is configured using these command-line arguments:
-
+
argument
@@ -308,14 +326,6 @@
Minimum MQ (SAM mapping quality) to pass filtering
paired-end reads from an Illumina sequencer (gzipped recommended)
+
paired-end reads from an Illumina sequencer in FASTQ format
+ gzip recommended
+
When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ
@@ -247,11 +249,11 @@
haplotag technology you are using (read Haplotag Types).
usage
-
harpy demultiplex OPTIONS... INPUT
+
harpy demultiplex METHOD OPTIONS... R1_FQ R2_FQ I1_FQ I2_FQ
-
example
-
harpy demultiplex --threads 20 --samplesheet demux.schema Plate_1_S001_R1.fastq.gz
+
example using wildcards
+
harpy demultiplex gen1 --threads 20 --schema demux.schema Plate_1_S001_R*.fastq.gz Plate_1_S001_I*.fastq.gz
@@ -259,9 +261,13 @@
Running Options
-
In addition to the common runtime options, the harpy demultiplex module is configured using these command-line arguments:
+
In addition to the
+ common runtime options
+, the
+ demultiplex
+ module is configured using these command-line arguments:
-
+
argument
@@ -274,28 +280,52 @@
-
INPUT
+
R1_FQ
file path
yes
-
The forward (or reverse) multiplexed FASTQ file
+
The forward multiplexed FASTQ file
-
--samplesheet
-
-b
+
R2_FQ
+
file path
yes
-
Tab-delimited file of sample<tab>barcode
+
The reverse multiplexed FASTQ file
+
+
+
I1_FQ
+
+
file path
+
+
yes
+
The forward FASTQ index file provided by the sequencing facility
+
+
+
I2_FQ
+
+
file path
+
+
yes
+
The reverse FASTQ index file provided by the sequencing facility
-
--method
-
-m
+
METHOD
+
choice
-
gen1
+
yes
-
Haplotag technology of the sequences
+
Haplotag technology of the sequences [gen1]
+
+
+
--schema
+
-s
+
file path
+
+
yes
+
Tab-delimited file of sample<tab>barcode
@@ -318,15 +348,15 @@
do not demultiplex the sequences. Requires the use of bcl2fastq without sample-sheet and with the settings
--use-bases-mask=Y151,I13,I13,Y151 and --create-fastq-for-index-reads. With Generation I beadtags, the C barcode is sample-specific,
meaning a single sample should have the same C barcode for all of its sequences.
-
+
- #
- sample sheet
+ #
+ demultiplexing schema
Since Generation I haplotags use a unique Cxx barcode per sample, that's the barcode
that will be used to identify sequences by sample. You will need to provide a simple text
-file to --samplesheet (-b) with two columns, the first being the sample name, the second being
+file to --schema (-s) with two columns, the first being the sample name, the second being
the Cxx barcode (e.g., C19). This file is to be tab or space delimited and must have no column names.
example sample sheet
@@ -364,11 +394,11 @@
demultiplexing output
-
The default output directory is Demultiplex/PREFIX with the folder structure below, where PREFIX is the prefix of your input file that Harpy
-infers by removing the file extension and forward/reverse distinction. Sample1 and Sample2 are generic sample names for demonstration purposes.
-The resulting folder also includes a workflow directory (not shown) with workflow-relevant runtime files and information.
+
The default output directory is Demultiplex with the folder structure below. Sample1 and Sample2 are
+generic sample names for demonstration purposes. The resulting folder also includes a workflow directory
+(not shown) with workflow-relevant runtime files and information.
a variant call format file:
+ .vcf
+
+ .vcf.gz
+
+ .bcf
+
Curation of input VCF file
-
STITCH needs the input VCF to meet specific criteria:
+
To work well with STITCH, Harpy needs the input variant call file to meet specific criteria.
+Where labelled with
+ automatic
+, Harpy will perform those curation steps on your input
+variant call file. Where labelled with
+ manual
+, you will need to perform these curation
+tasks yourself prior to running the
+ impute
+ module.
+
+
+ #
+ Variant call file criteria
+
+
-
Biallelic SNPs only
-
VCF is sorted by position
-
No duplicate positions
-
No duplicate sample names
+
+ automatic
+ Biallelic SNPs only
+
+ automatic
+ VCF is sorted by position
+
+ manual
+ No duplicate positions
+
+
+
+
example to remove duplicate positions
+
bcftools norm -D in.vcf -o out.vcf
+
+
+
+
+
+ manual
+ No duplicate sample names
+
+
+
+
count the occurrence of samples
+
bcftools query -l file.bcf | sort | uniq -c
+
+
+
you will need to remove duplicate samples how you see fit
+
+
-
Harpy will automatically extract biallelic SNPs and sort the input VCF file (1 and 2), but it will not
-do any further assessments for your input VCF file regarding duplicate sample names or positions. Please
-curate your input VCF to meet criteria 3 and 4 prior to running the impute module.
After variants have been called, you may want to impute missing genotypes to get the
most from your data. Harpy uses STITCH to impute genotypes, a haplotype-based
method that is linked-read aware. Imputing genotypes requires a variant call file
-containing SNPs, such as that produced by harpy snp. You can impute genotypes with Harpy using the impute module:
+containing SNPs, such as that produced by
+ harpy snp
+ and preferably filtered in some capacity.
+You can impute genotypes with Harpy using the
+ impute
+ module:
usage
harpy impute OPTIONS... INPUTS...
@@ -280,9 +332,13 @@
Running Options
-
In addition to the common runtime options, the harpy impute module is configured using these command-line arguments:
+
In addition to the
+ common runtime options
+, the
+ impute
+ module is configured using these command-line arguments:
-
+
argument
@@ -357,12 +413,12 @@
Prioritize the vcf file
-
Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want
+
Sometimes you want to run imputation on all the samples present in the INPUTS, but other times you may want
to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples
-present in the --directory and will inform you of errors when there is a mismatch between the sample files
+present in the INPUTS and will inform you of errors when there is a mismatch between the sample files
present and those listed in the --vcf file. You can instead use the --vcf-samples flag if you want Harpy to build a workflow
around the samples present in the --vcf file. When using this toggle, Harpy will inform you when samples in the --vcf file
-are missing from the provided --directory.
+are missing from the provided INPUTS.
#
@@ -373,7 +429,9 @@
different model parameters (explained in next section). The solution Harpy uses for this is to have the user
provide a tab-delimited dataframe file where the columns are the 6 STITCH model
parameters and the rows are the values for those parameters. The parameter file
-is required and can be created manually or with harpy stitchparams -o <filename>.
+is required and can be created manually or with
+ harpy stitchparams
+.
If created using harpy, the resulting file includes largely meaningless values
that you will need to adjust for your study. The parameter must follow a particular format:
@@ -398,7 +456,7 @@
See the section below for detailed information on each parameter. This
table serves as an overview of the parameters.
-
+
column name
@@ -452,7 +510,7 @@
example file (as a table)
This is the table view of the tab-delimited file, shown here for clarity.
Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules:
-The arguments represent different sub-commands and can be run in any order or combination to generate the files you need.
+
Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules:
#
@@ -239,7 +238,7 @@
-
+
module
@@ -255,10 +254,6 @@
stitchparams
Create template STITCH parameter file
-
-
hpc
-
Create HPC scheduling profile for cluster submission
-
@@ -268,28 +263,64 @@
popgroup
-
-
- #
- Sample grouping file for variant calling
-
-
+
Creates a sample grouping file for variant calling
+
usage
+
harpy popgroup -o OUTPUTFILE INPUTS
+
+
+
usage example
harpy popgroup -o samples.groups data/
-
+
#arguments
-
+
+
+
+
+
+
argument
+
short name
+
type
+
default
+
required
+
description
+
+
+
+
+
INPUTS
+
+
file/directory paths
+
+
yes
+
Files or directories containing input FASTQ/BAM files
+
+
+
--output
+
-o
+
file path
+
+
yes
+
name of the output file
+
+
+
+
+
This optional file is useful if you want SNP variant calling to happen on a
+per-population level via
+ harpy snp
+ or on samples
+pooled-as-populations via
+ harpy sv
+.
-
-o, --output: name of the output file
-
-
This file is entirely optional and useful if you want SNP variant calling to happen on a
-per-population level via harpy snp ... -p or on samples pooled-as-populations via harpy sv ... -p.
-
-
takes the format of sample<tab>group
+
takes the format of sample
+ tab
+group
all the samples will be assigned to group pop1 since file names don't always provide grouping information
so make sure to edit the second column to reflect your data correctly.
@@ -305,84 +336,69 @@
sample4 pop1
sample5 pop3
+
#stitchparams
-
-
- #
- STITCH parameter file
-
-
+
Create a template parameter file for the
+ impute
+ module. The file is formatted correctly and serves
+as a starting point for using parameters that make sense for your study.
+
+
usage
+
harpy stitchparams -o OUTPUTFILE
+
+
+
example
+
harpy stitchparams -o params.stitch
+
-
+
#arguments
-
+
-
-
-o, --output: name of the output file
-
+
+
+
+
+
argument
+
short name
+
type
+
default
+
required
+
description
+
+
+
+
+
--output
+
-o
+
file path
+
+
yes
+
name of the output file
+
+
+
+
Typically, one runs STITCH multiple times, exploring how results vary with
different model parameters. The solution Harpy uses for this is to have the user
provide a tab-delimited dataframe file where the columns are the 6 STITCH model
parameters and the rows are the values for those parameters. To make formatting
easier, a template file is generated for you, just replace the values and add/remove
-rows as necessary. See the Imputation section for details on these parameters.
-
-
- #
- hpc
-
-
-
-
- #
- HPC cluster profile
-
-
-
-
-
-
-
-
-
-
HPC support is not yet natively integrated into Harpy. Until then, you can manually
-use the Snakemake HPC infrastructure with the -s flag.
-
-
-
-
-
- #
- arguments
-
-
-
-
-o, --output: name of the output file
-
-s, --system: name of the scheduling system
-
-
options: slurm (more to come)
-
-
-
-
For snakemake to work in harmony with an HPC scheduler, a "profile" needs to
-be provided that tells Snakemake how it needs to interact with the HPC scheduler
-to submit your jobs to the cluster. Using harpy hpc -s <hpc-type> will create
-the necessary folder and profile yaml file for you to use. To use the profile, call
-the intended Harpy module with an additional ``--snakemake` argument:
+rows as necessary. See the section for the
+ impute
+
+module for details on these parameters. The template file will look like:
-
use the slurm profile
-
harpy module --option1 <value1> --option2 <value2> --snakemake "--profile slurm.profile"
+
params.stitch
+
model usebx bxlimit k s ngen
+diploid TRUE 50000 3 2 10
+diploid TRUE 50000 3 1 5