diff --git a/404.html b/404.html index 9ce9104b7..c3bfb8883 100644 --- a/404.html +++ b/404.html @@ -4,7 +4,7 @@ - + @@ -29,11 +29,11 @@ - + - + - +
diff --git a/commonoptions/index.html b/commonoptions/index.html index 2a6d97eff..13ed55373 100644 --- a/commonoptions/index.html +++ b/commonoptions/index.html @@ -4,7 +4,7 @@ - + @@ -12,7 +12,7 @@ Common Harpy Options | Harpy - + @@ -21,23 +21,23 @@ - + - + - + - + - - + +
@@ -236,11 +236,19 @@

Input Arguments

-

Each of the main Harpy modules (e.g. qc or phase) follows the format of

+

Each of the main Harpy modules (e.g. + qc + or + phase +) follows the format of

harpy module options arguments
-

where module is something like impute or snp mpileup and options are the runtime parameters, +

where module is something like + impute + or + snp mpileup + and options are the runtime parameters, which can include things like an input --vcf file, --molecule-distance, etc. After the options is where you provide the input files/directories without flags and following standard BASH expansion rules (e.g. wildcards). You can mix and match entire directories, individual files, and wildcard expansions. @@ -299,19 +307,20 @@

Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments, -the genome assembly, etc. All main modules (e.g. qc) also share a series of common runtime +the genome assembly, etc. All main modules (e.g. + qc +) also share a series of common runtime parameters that don't impact the results of the module, but instead control the speed/verbosity/etc. of calling the module. These runtime parameters are listed in the modules' help strings and can be configured using these arguments:

- +
- @@ -321,7 +330,6 @@

- @@ -329,23 +337,27 @@

- + + + + + + + - + - - + - @@ -353,27 +365,27 @@

- - + - + -
argument short name type defaultrequired description
-o string variesno Name of output directory
-t integer 4no Number of threads to use
--condatoggleUse local conda environments instead of preconfigured Singularity container
--skipreports-r toggle no Skip the processing and generation of HTML reports in a workflow
--snakemake-s string no Additional Snakemake options, in quotes
-q toggle noSupressing Snakemake printing to consoleSuppress Snakemake printing to console
--help-h Show the module docstring
-

As as example, you could call the harpy align module and specify 20 threads with no output to console:

+

As as example, you could call + align minimap + and specify 20 threads with no output to console:

-
harpy align bwa --threads 20 --quiet samples/trimmedreads
+
harpy align minimap --threads 20 --quiet samples/trimmedreads
 
 # identical to #
 
-harpy align bwa -t 20 -q samples/trimmedreads
+harpy align minimap -t 20 -q samples/trimmedreads

@@ -387,7 +399,7 @@

understanding or as a point of reference when writing the Methods within a manuscript. The presence of the folder and the contents therein also allow you to rerun the workflow manually. The workflow folder may contain the following:

- +
@@ -417,7 +429,7 @@

- + @@ -433,8 +445,9 @@

You will notice that many of the workflows will create a Genome folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated -indexing/etc. files. Your input genome will be symlinked into that directory (not copied), but -all the other files (.fai, .bwt, .bed, etc.) will be created in that directory.

+indexing/etc. files across workflows without having to redo things unnecessarily. Your input +genome will be symlinked into that directory (not copied, unless a workflow requires gzipping/decompressing), +but all the other files (.fai, .bwt, .bed, etc.) will be created in that directory.

diff --git a/development/index.html b/development/index.html index 55c189058..268ffa14a 100644 --- a/development/index.html +++ b/development/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + +
@@ -273,7 +273,7 @@

Installing Harpy for development

-

The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the misc/buildlocal.sh +

The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the resources/buildlocal.sh script to move all the necessary files to the /bin/ path within your active conda environment.

@@ -287,7 +287,7 @@

install the dependencies with conda/mamba
-
mamba env create --name harpy --file misc/harpyenv.yaml
+
mamba env create --name harpy --file resources/harpy.yaml

This will create a conda environment named harpy with all the bits necessary to successfully run Harpy. You can change the name of this environment by specifying --name something.

@@ -303,12 +303,12 @@

-

Call the misc/buildlocal.sh bash script to finish the installation. +

Call the resources/buildlocal.sh bash script to finish the installation. This will build the harpy python program, and copy all the additional files Harpy needs to run to the bin/ directory of your active conda environment.

install harpy and the necessary files
-
bash misc/buildlocal.sh
+
bash resources/buildlocal.sh
@@ -382,11 +382,7 @@

- - - - - + @@ -412,15 +408,66 @@

The dev workflow is reasonably standard:

  1. create a fork of Harpy
  2. -
  3. within your fork, create a new branch, name it something relevant to what you intend to do (e.g., naibr_bugfix, add_deepvariant) -
      -
    • create a branch off of main if you are trying to fix a bug in the release version
    • -
    • create a branch off of dev if you are adding a new feature or breaking change
    • -
    -
  4. +
  5. within your fork, create a new branch, name it something relevant to what you intend to do (e.g., naibr_bugfix, add_deepvariant)
  6. add and modify code with your typical coding workflow, pushing your changes to your Harpy fork
  7. -
  8. when it's ready for inclusion into Harpy (and testing), create a Pull Request to merge your changes into the Harpy dev branch
  9. +
  10. when it's ready for inclusion into Harpy (and testing), create a Pull Request to merge your changes into the Harpy main branch
+ +

+ # + containerization +

+
+

As of Harpy v1.0, the software dependencies that the Snakemake workflows use are pre-configured as a Docker image +that is uploaded to Dockerhub. Updating or editing this container can be done automatically or manually.

+ +

+ # + automatically +

+
+

The testing GitHub Action will automatically create a Dockerfile with + harpy containerize + (a hidden harpy command) +and build a new Docker container, then upload it to dockerhub +with the latest tag. This process is triggered on push or pull request with changes to either +src/harpy/conda_deps or src/harpy/snakefiles/containerize.smk on main.

+ +

+ # + manually +

+
+

The dockerfile for that container is created by using a hidden harpy command + harpy containerize +

+
+
auto-generate Dockerfile
+
harpy containerize
+
+

which does all of the work for us. The result is a Dockerfile that has all of the conda environments +written into it. After creating the Dockerfile, the image must then be built.

+
+
build the Docker image
+
cd resources
+docker build -t pdimens/harpy .
+
+

This will take a bit because the R dependencies are hefty. Once that's done, the image can be pushed to Dockerhub:

+
+
push image to Dockerhub
+
docker push pdimens/harpy
+
+

This containerize -> dockerfile -> build -> process will push the changes to Dockerhub with the latest tag, which is suitable for +the development cycle. When the container needs to be tagged to be associated with the release of a new Harpy version, you will need to +add a tag to the docker build step:

+
+
build tagged Docker image
+
cd resources
+docker build -t pdimens/harpy:TAG
+
+

where TAG is the Harpy version, such as 1.0, 1.4.1, 2.1, etc. As such, during development, the containerized: docker://pdimens/harpy:TAG declaration at the top of the snakefiles should use the latest tag, and when ready for release, changed to match the Harpy +version. So, if the Harpy version is 1.4.12, then the associated docker image should also be tagged with 1.4.12. The tag should remain latest +(unless there is a very good reason otherwise) since automatic Docker tagging happens upon releases of new Harpy versions.

# @@ -435,7 +482,7 @@

CI (Continuous Integration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. -Harpy has a series of GitHub Actions triggered by interactions with the dev branch (in .github/workflows) +Harpy has a series of GitHub Actions triggered by interactions with the main branch (in .github/workflows) to test the Harpy modules depending on which files are being changed by the push or pull request. It's setup such that, for example, when files associated with demultiplexing are altered, it will run harpy demultiplex on the test data @@ -449,8 +496,13 @@

Releases

-

To save on disk space, there is an automation to strip out the unnecessary files and upload a -cleaned tarball to the new release. This is triggered automatically when a new version is tagged. +

There is an automation +that gets triggered every time Harpy is tagged with the new version. It strips out the unnecessary files and will +upload a cleaned tarball to the new release (reducing filesize by orders of magnitude). The automation will also +build a new Dockerfile and tag it with the same git tag for Harpy's next release and push it to Dockerhub. +In doing so, it will also replace the tag of the container in all of Harpy's snakefiles from latest to the +current Harpy version. In other words, during development the top of every snakefile reads +containerized: docker://pdimens/harpy:latest and the automation replaces it with (e.g.) containerized: docker://pdimens/harpy:1.17. Tagging is easily accomplished with Git commands in the command line:

# make sure you're on the main branch
diff --git a/haplotagdata/index.html b/haplotagdata/index.html
index 8349074a8..8becf46bf 100644
--- a/haplotagdata/index.html
+++ b/haplotagdata/index.html
@@ -4,7 +4,7 @@
     
     
     
-    
+    
 
     
     
@@ -34,11 +34,11 @@
     
 
     
-    
+    
 
-    
+    
     
-    
+    
     
     
 
@@ -268,9 +268,9 @@ 

where the barcodes go

-

Chromium 10X linked-reads have a particular format where the barcode is the leading 16 bases -of the read. However, haplotagging data does not use that format, nor do the tools -implemented in Harpy work correctly with it. Once demultiplexed, haplotagging sequences should look +

Chromium 10X linked-reads use a format where the barcode is the leading 16 bases +of the forward (R1) read. However, haplotagging data does not use that format and many of the tools +implemented in Harpy won't work correctly with the 10X format. Once demultiplexed, haplotagging sequences should look like regular FASTQ files of inserts and the barcode is stored in a BX:Z:AxxCxxBxxDxx tag in the read header. Again, do not include the barcode in the sequence.

@@ -307,8 +307,11 @@

A caveat

The Leviathan structural variant caller expects the BX:Z: tag at the end of the alignment record, so if you intend on using that variant caller, you will need to make sure the BX:Z: -tag is the last one in the sequence alignment (BAM file). If you use Harpy to align the -sequences, then it will make sure the BX:Z: tag is moved to the end of the alignment.

+tag is the last one in the sequence alignment (BAM file). If you use any method within + + harpy align +, the BX:Z: tag is guaranteed to be at +the end of the alignment record.

@@ -318,7 +321,9 @@

Read length

-

Reads must be at least 30 base pairs in length for alignment. The qc module removes reads <50bp.

+

Reads must be at least 30 base pairs in length for alignment. By default, the + qc + module removes reads <30bp.

# @@ -326,7 +331,9 @@

Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. -Compressed files are expected to end with the extension .gz.

+Compressed files are expected to end with the extension + .gz +.

# @@ -336,21 +343,70 @@

Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations. -To that end, for the demultiplex, qc, and align modules, the +To that end, for the + deumultiplex +, + qc +, and + align + modules, the most common FASTQ naming styles are supported:

    -
  • sample names: Alphanumeric and ., -, _ +
  • sample names: Alphanumeric and + . + + _ + + - +
    • you can mix and match special characters, but that's bad practice and not recommended
    • examples: Sample.001, Sample_001_year4, Sample-001_population1.year2 <- not recommended
  • -
  • forward/reverse: _F, .F, _R1, .R1, _R1_001, .R1_001, etc. +
  • forward: + _F + + .F + + _R1_001 + + .R1_001 + + _R1 + + .R1 +
  • +
  • reverse: + _R + + .R + + _R2_001 + + .R2_001 + + _R2 + + .R2 +
      -
    • note that this does not include .1 or _1 conventions for forward/reverse
    • +
    • note that this does not include + .1 + or + _1 + conventions for forward/reverse
    • +
    +
  • +
  • fastq extension: + .fq + + .fastq + +
      +
    • or uppercase variants
  • -
  • fastq extension: .fastq, .FASTQ, .fq, .FQ
  • gzipped: supported and recommended
  • not gzipped: supported
@@ -411,7 +467,7 @@

- + - + - + - + - + - + - + - + - - - - - - - - - - - -
itemuseful to understand math behind plots/tables or borrow code from
*.workflow.summary*.summary Plain-text overview of the important parts of the workflow useful for bookkeeping and writing Methods
mainthe source code of the current release and used for new bioconda releases
devstaging and testing area for new code prior to merging with main for releasestaging and testing area for new code prior to creating the next release
docs
preflight + preflight + Run various format checks for FASTQ and BAM files
demultiplex + demultiplex + Demultiplex haplotagged FASTQ files
qc + qc + Remove adapters and quality trim sequences
align + align + Align sample sequences to a reference genome
snp + snp + Call SNPs and small indels
sv + sv + Call large structural variants
impute + impute + Impute genotypes using variants and sequences
phase + phase + Phase SNPs into haplotypes
popgroupCreate a sample grouping file
stitchparamsCreate a template STITCH parameter file
hpcCreate a config file to run Harpy on an HPC
diff --git a/install/index.html b/install/index.html index 045681fe5..b32315a21 100644 --- a/install/index.html +++ b/install/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
diff --git a/issues/index.html b/issues/index.html index 489a8c968..ae908e29a 100644 --- a/issues/index.html +++ b/issues/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
@@ -247,10 +247,19 @@

Failures during imputation or phasing

-

If you use bamutils clipOverlap on alignments that are used for the impute or -phase modules, they will cause both programs to error. We don't know why, but they do.

+

If you use bamutils clipOverlap on alignments that are used for the + impute + or + + phase + modules, they will cause both programs to error. We don't know why, but they do.

Solution: Do not clip overlapping alignments for bam files you intend to use for -the impute or phase modules. Harpy does not clip overlapping alignments, so +the + impute + or + + phase + modules. Harpy does not clip overlapping alignments, so alignments produced by Harpy should work just fine.

diff --git a/modules/align/bwa/index.html b/modules/align/bwa/index.html index 6750b1649..7f03ea084 100644 --- a/modules/align/bwa/index.html +++ b/modules/align/bwa/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -239,14 +239,28 @@

  • at least 4 cores/threads available
  • -
  • a genome assembly in FASTA format
  • -
  • paired-end fastq sequence file with the proper naming convention (gzipped recommended)
  • +
  • a genome assembly in FASTA format: + .fasta + + .fa + + .fasta.gz + + .fa.gz +
  • +
  • paired-end fastq sequence file with the proper naming convention + gzipped recommended +

Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input, -such as those derived using harpy qc. You can map reads onto a genome assembly with Harpy -using the align module:

+such as those derived using + harpy qc +. You can map reads onto a genome assembly with Harpy +using the + align bwa + module:

usage
harpy align bwa OPTIONS... INPUTS...
@@ -261,9 +275,13 @@

Running Options

-

In addition to the common runtime options, the harpy align bwa module is configured using these command-line arguments:

+

In addition to the + common runtime options +, the + align bwa + module is configured using these command-line arguments:

- +
@@ -308,14 +326,6 @@

- - - - - - - - @@ -442,7 +452,7 @@

-

argumentMinimum MQ (SAM mapping quality) to pass filtering
--method-mchoice [bwa, ema]bwanoWhich aligning software to use
--extra-params -x string
+
diff --git a/modules/align/ema/index.html b/modules/align/ema/index.html index 4f7b68596..7652949e8 100644 --- a/modules/align/ema/index.html +++ b/modules/align/ema/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -240,9 +240,21 @@

  • at least 4 cores/threads available
  • -
  • a genome assembly in FASTA format
  • -
  • paired-end fastq sequence file with the proper naming convention (gzipped recommended)
  • -
  • patience
  • +
  • a genome assembly in FASTA format: + .fasta + + .fa + + .fasta.gz + + .fa.gz +
  • +
  • paired-end fastq sequence file with the proper naming convention + gzipped recommended +
  • +
  • patience because EMA is + slow +
@@ -261,8 +273,12 @@

Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input, -such as those derived using harpy qc. You can map reads onto a genome assembly with Harpy -using the align module:

+such as those derived using + harpy qc +. You can map reads onto a genome assembly with Harpy +using the + align ema + module:

usage
harpy align ema OPTIONS... INPUTS...
@@ -277,9 +293,13 @@

Running Options

-

In addition to the common runtime options, the harpy align ema module is configured using these command-line arguments:

+

In addition to the + common runtime options +, the + align ema + module is configured using these command-line arguments:

-

item
+
@@ -479,7 +499,7 @@

-

argument
+
diff --git a/modules/align/minimap/index.html b/modules/align/minimap/index.html index 4c086b0e5..7e50e1464 100644 --- a/modules/align/minimap/index.html +++ b/modules/align/minimap/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -239,14 +239,28 @@

  • at least 4 cores/threads available
  • -
  • a genome assembly in FASTA format
  • -
  • paired-end fastq sequence file with the proper naming convention (gzipped recommended)
  • +
  • a genome assembly in FASTA format: + .fasta + + .fa + + .fasta.gz + + .fa.gz +
  • +
  • paired-end fastq sequence file with the proper naming convention + gzipped recommended +

Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input, -such as those derived using harpy qc. You can map reads onto a genome assembly with Harpy -using the align module:

+such as those derived using + harpy qc +. You can map reads onto a genome assembly with Harpy +using the + align minimap + module:

usage
harpy align minimap OPTIONS... INPUTS...
@@ -261,9 +275,13 @@

Running Options

-

In addition to the common runtime options, the harpy align minimap module is configured using these command-line arguments:

+

In addition to the + common runtime options +, the + align minimap + module is configured using these command-line arguments:

-

item
+
@@ -308,14 +326,6 @@

- - - - - - - - @@ -447,7 +457,7 @@

-

argumentMinimum MQ (SAM mapping quality) to pass filtering
--method-mchoice [bwa, ema]bwanoWhich aligning software to use
--extra-params -x string
+
@@ -612,10 +622,10 @@

- + Next - Variants + Linked Reads diff --git a/modules/demultiplex/index.html b/modules/demultiplex/index.html index 1a56fba5e..98a82c846 100644 --- a/modules/demultiplex/index.html +++ b/modules/demultiplex/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -237,7 +237,9 @@

  • at least 2 cores/threads available
  • -
  • paired-end reads from an Illumina sequencer (gzipped recommended)
  • +
  • paired-end reads from an Illumina sequencer in FASTQ format + gzip recommended +

When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ @@ -247,11 +249,11 @@

haplotag technology you are using (read Haplotag Types).

usage
-
harpy demultiplex OPTIONS... INPUT
+
harpy demultiplex METHOD OPTIONS... R1_FQ R2_FQ I1_FQ I2_FQ
-
example
-
harpy demultiplex --threads 20 --samplesheet demux.schema Plate_1_S001_R1.fastq.gz
+
example using wildcards
+
harpy demultiplex gen1 --threads 20 --schema demux.schema Plate_1_S001_R*.fastq.gz Plate_1_S001_I*.fastq.gz

@@ -259,9 +261,13 @@

Running Options

-

In addition to the common runtime options, the harpy demultiplex module is configured using these command-line arguments:

+

In addition to the + common runtime options +, the + demultiplex + module is configured using these command-line arguments:

-

item
+
@@ -274,28 +280,52 @@

- + - + - - + + - + + + + + + + + + + + + + + + + + - - + + - + - + + + + + + + + +
argument
INPUTR1_FQ file path yesThe forward (or reverse) multiplexed FASTQ fileThe forward multiplexed FASTQ file
--samplesheet-bR2_FQ file path yesTab-delimited file of sample<tab>barcodeThe reverse multiplexed FASTQ file
I1_FQfile pathyesThe forward FASTQ index file provided by the sequencing facility
I2_FQfile pathyesThe reverse FASTQ index file provided by the sequencing facility
--method-mMETHOD choicegen1 yesHaplotag technology of the sequencesHaplotag technology of the sequences [gen1]
--schema-sfile pathyesTab-delimited file of sample<tab>barcode
@@ -318,15 +348,15 @@

do not demultiplex the sequences. Requires the use of bcl2fastq without sample-sheet and with the settings --use-bases-mask=Y151,I13,I13,Y151 and --create-fastq-for-index-reads. With Generation I beadtags, the C barcode is sample-specific, meaning a single sample should have the same C barcode for all of its sequences.

- +

- # - sample sheet + # + demultiplexing schema

Since Generation I haplotags use a unique Cxx barcode per sample, that's the barcode that will be used to identify sequences by sample. You will need to provide a simple text -file to --samplesheet (-b) with two columns, the first being the sample name, the second being +file to --schema (-s) with two columns, the first being the sample name, the second being the Cxx barcode (e.g., C19). This file is to be tab or space delimited and must have no column names.

example sample sheet
@@ -364,11 +394,11 @@

-

The default output directory is Demultiplex/PREFIX with the folder structure below, where PREFIX is the prefix of your input file that Harpy -infers by removing the file extension and forward/reverse distinction. Sample1 and Sample2 are generic sample names for demonstration purposes. -The resulting folder also includes a workflow directory (not shown) with workflow-relevant runtime files and information.

+

The default output directory is Demultiplex with the folder structure below. Sample1 and Sample2 are +generic sample names for demonstration purposes. The resulting folder also includes a workflow directory +(not shown) with workflow-relevant runtime files and information.

-
Demultiplex/PREFIX
+
Demultiplex/
 ├── Sample1.F.fq.gz
 ├── Sample1.R.fq.gz
 ├── Sample2.F.fq.gz
@@ -377,7 +407,7 @@ 

   └── demultiplex.QC.html

- +
diff --git a/modules/impute/index.html b/modules/impute/index.html index 7d4f748ee..e69e6e965 100644 --- a/modules/impute/index.html +++ b/modules/impute/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -240,28 +240,80 @@

  • a tab-delimited parameter file
  • -
  • a variant call format file (.vcf, .vcf.gz, .bcf)
  • -
  • sequence alignments in .bam format
  • +
  • sequence alignments in BAM format: + .bam +
  • +
  • a variant call format file: + .vcf + + .vcf.gz + + .bcf +
-

STITCH needs the input VCF to meet specific criteria:

+

To work well with STITCH, Harpy needs the input variant call file to meet specific criteria. +Where labelled with + automatic +, Harpy will perform those curation steps on your input +variant call file. Where labelled with + manual +, you will need to perform these curation +tasks yourself prior to running the + impute + module.

+ +

+ # + Variant call file criteria +

+
    -
  1. Biallelic SNPs only
  2. -
  3. VCF is sorted by position
  4. -
  5. No duplicate positions
  6. -
  7. No duplicate sample names
  8. +
  9. + automatic + Biallelic SNPs only
  10. +
  11. + automatic + VCF is sorted by position
  12. +
  13. + manual + No duplicate positions +
      +
    • +
      +
      example to remove duplicate positions
      +
        bcftools norm -D in.vcf -o out.vcf
      +
      +
    • +
    +
  14. +
  15. + manual + No duplicate sample names +
      +
    • +
      +
      count the occurrence of samples
      +
        bcftools query -l file.bcf | sort | uniq -c
      +
      +
    • +
    • you will need to remove duplicate samples how you see fit
    • +
    +
-

Harpy will automatically extract biallelic SNPs and sort the input VCF file (1 and 2), but it will not -do any further assessments for your input VCF file regarding duplicate sample names or positions. Please -curate your input VCF to meet criteria 3 and 4 prior to running the impute module.

After variants have been called, you may want to impute missing genotypes to get the most from your data. Harpy uses STITCH to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file -containing SNPs, such as that produced by harpy snp. You can impute genotypes with Harpy using the impute module:

+containing SNPs, such as that produced by + harpy snp + and preferably filtered in some capacity. +You can impute genotypes with Harpy using the + impute + module:

usage
harpy impute OPTIONS... INPUTS...
@@ -280,9 +332,13 @@

Running Options

-

In addition to the common runtime options, the harpy impute module is configured using these command-line arguments:

+

In addition to the + common runtime options +, the + impute + module is configured using these command-line arguments:

-

item
+
@@ -357,12 +413,12 @@

Prioritize the vcf file

-

Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want +

Sometimes you want to run imputation on all the samples present in the INPUTS, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples -present in the --directory and will inform you of errors when there is a mismatch between the sample files +present in the INPUTS and will inform you of errors when there is a mismatch between the sample files present and those listed in the --vcf file. You can instead use the --vcf-samples flag if you want Harpy to build a workflow around the samples present in the --vcf file. When using this toggle, Harpy will inform you when samples in the --vcf file -are missing from the provided --directory.

+are missing from the provided INPUTS.

# @@ -373,7 +429,9 @@

different model parameters (explained in next section). The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model parameters and the rows are the values for those parameters. The parameter file -is required and can be created manually or with harpy stitchparams -o <filename>. +is required and can be created manually or with + harpy stitchparams +. If created using harpy, the resulting file includes largely meaningless values that you will need to adjust for your study. The parameter must follow a particular format:

    @@ -398,7 +456,7 @@

    See the section below for detailed information on each parameter. This table serves as an overview of the parameters.

    -

argument
+
@@ -452,7 +510,7 @@

This is the table view of the tab-delimited file, shown here for clarity.

-

column name
+
@@ -679,7 +737,7 @@
Filtering for biallelic contigs
-
model
+
diff --git a/modules/othermodules/index.html b/modules/other/index.html similarity index 85% rename from modules/othermodules/index.html rename to modules/other/index.html index 77a762620..3d8848623 100644 --- a/modules/othermodules/index.html +++ b/modules/other/index.html @@ -4,7 +4,7 @@ - + @@ -15,29 +15,29 @@ - + - + - + - + - + - - + +
@@ -230,8 +230,7 @@

Other Harpy modules

-

Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules: -The arguments represent different sub-commands and can be run in any order or combination to generate the files you need.

+

Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules:

# @@ -239,7 +238,7 @@

-
item
+
@@ -255,10 +254,6 @@

- - - -
modulestitchparams Create template STITCH parameter file
hpcCreate HPC scheduling profile for cluster submission
@@ -268,28 +263,64 @@

popgroup

- -

- # - Sample grouping file for variant calling -

-
+

Creates a sample grouping file for variant calling

+
usage
+
harpy popgroup -o OUTPUTFILE INPUTS
+
+
+
usage example
harpy popgroup -o samples.groups data/
-
+

# arguments -

+

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
argumentshort nametypedefaultrequireddescription
INPUTSfile/directory pathsyesFiles or directories containing input FASTQ/BAM files
--output-ofile pathyesname of the output file
+
+

This optional file is useful if you want SNP variant calling to happen on a +per-population level via + harpy snp + or on samples +pooled-as-populations via + harpy sv +.

    -
  • -o, --output: name of the output file
  • -
-

This file is entirely optional and useful if you want SNP variant calling to happen on a -per-population level via harpy snp ... -p or on samples pooled-as-populations via harpy sv ... -p.

-
    -
  • takes the format of sample<tab>group
  • +
  • takes the format of sample + tab +group
  • all the samples will be assigned to group pop1 since file names don't always provide grouping information
    • so make sure to edit the second column to reflect your data correctly.
    • @@ -305,84 +336,69 @@
      sample4 pop1 sample5 pop3
+

# stitchparams

- -

- # - STITCH parameter file -

-
+

Create a template parameter file for the + impute + module. The file is formatted correctly and serves +as a starting point for using parameters that make sense for your study.

+
+
usage
+
harpy stitchparams -o OUTPUTFILE
+
+
+
example
+
harpy stitchparams -o params.stitch
+
-
+

# arguments -

+

-
    -
  • -o, --output: name of the output file
  • -
+
+ + + + + + + + + + + + + + + + + + + + + +
argumentshort nametypedefaultrequireddescription
--output-ofile pathyesname of the output file
+

Typically, one runs STITCH multiple times, exploring how results vary with different model parameters. The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model parameters and the rows are the values for those parameters. To make formatting easier, a template file is generated for you, just replace the values and add/remove -rows as necessary. See the Imputation section for details on these parameters.

- -

- # - hpc -

-
- -

- # - HPC cluster profile -

-
-
-
- -
- -
- # - arguments -
-
-
    -
  • -o, --output: name of the output file
  • -
  • -s, --system: name of the scheduling system -
      -
    • options: slurm (more to come)
    • -
    -
  • -
-

For snakemake to work in harmony with an HPC scheduler, a "profile" needs to -be provided that tells Snakemake how it needs to interact with the HPC scheduler -to submit your jobs to the cluster. Using harpy hpc -s <hpc-type> will create -the necessary folder and profile yaml file for you to use. To use the profile, call -the intended Harpy module with an additional ``--snakemake` argument:

+rows as necessary. See the section for the + impute + +module for details on these parameters. The template file will look like:

-
use the slurm profile
-
harpy module --option1 <value1> --option2 <value2> --snakemake "--profile slurm.profile"
+
params.stitch
+
model	usebx	bxlimit	k	s	ngen
+diploid	TRUE	50000	3	2	10
+diploid	TRUE	50000	3	1	5
@@ -391,7 +407,7 @@
- + Edit this page diff --git a/modules/phase/index.html b/modules/phase/index.html index dbde9a670..744b64ebb 100644 --- a/modules/phase/index.html +++ b/modules/phase/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -237,16 +237,47 @@

  • at least 2 cores/threads available
  • -
  • a vcf/bcf file of genotypes
  • -
  • sequence alignments, in .bam format
  • +
  • sequence alignments, in BAM format: + .bam +
  • +
  • a variant call format file of genotypes: + .vcf + + .bcf +
  • +
  • + optional + a reference genome in FASTA format: + .fasta + + .fa + + .fasta.gz + + .fa.gz +

You may want to phase your genotypes into haplotypes, as haplotypes tend to be more informative than unphased genotypes (higher polymorphism, captures relationship between genotypes). Phasing -genotypes into haplotypes requires alignment files, such as those produced by harpy align and -a variant call file, such as those produced by harpy snp or harpy impute. Phasing only -works on SNP data, and will not work for structural variants produced by LEVIATHAN or NAIBR. You can -phase genotypes into haplotypes with Harpy using the phase module:

+genotypes into haplotypes requires alignment files, such as those produced by + align bwa + +and a variant call file, such as one produced by + snp freebayes + +or + impute +. Phasing only works on SNP data, and will not +work for structural variants produced by + sv leviathan + +or + sv naibr +, preferably filtered in some capacity. You can phase genotypes into haplotypes with +Harpy using the + phase + module:

usage
harpy phase OPTIONS... INPUTS...
@@ -261,9 +292,13 @@

Running Options

-

In addition to the common runtime options, the harpy phase module is configured using these command-line arguments:

+

In addition to the + common runtime options +, the + phase + module is configured using these command-line arguments:

- +
@@ -348,12 +383,12 @@

Prioritize the vcf file

-

Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want +

Sometimes you want to run imputation on all the samples present in the INPUTS, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples -present in the --directory and will inform you of errors when there is a mismatch between the sample files +present in the INPUTS and will inform you of errors when there is a mismatch between the sample files present and those listed in the --vcf file. You can instead use the --vcf-samples flag if you want Harpy to build a workflow around the samples present in the --vcf file. When using this toggle, Harpy will inform you when samples in the --vcf file -are missing from the provided --directory.

+are missing from the provided INPUTS.

The molecule distance and pruning thresholds are considered the most impactful parameters for running HapCut2.

@@ -452,7 +487,7 @@

-

argument
+
@@ -594,7 +629,7 @@

- + Next Other diff --git a/modules/preflight/index.html b/modules/preflight/index.html index d0f2f6dd4..3b7565a5c 100644 --- a/modules/preflight/index.html +++ b/modules/preflight/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + +

item
+
@@ -309,12 +323,16 @@

- -

Below is a table of the format specifics harpy preflight checks for FASTQ files. Since 10X data doesn't use -the haplotagging data format, you will find little value in running preflight on 10X FASTQ files. Take note + +

Below is a table of the format specifics + preflight fastq + checks for FASTQ files. Since 10X data doesn't use +the haplotagging data format, you will find little value in running + preflight fastq + on 10X FASTQ files. Take note of the language such as when "any" and "all" are written.

-

argument
+
@@ -348,11 +366,13 @@

- -

Below is a table of the format specifics harpy preflight checks for SAM/BAM files. Take note + +

Below is a table of the format specifics + preflight bam + checks for SAM/BAM files. Take note of the language such as when "any" and "all" are written.

-

Criteria
+
diff --git a/modules/qc/index.html b/modules/qc/index.html index 29c643f1d..27b4ce128 100644 --- a/modules/qc/index.html +++ b/modules/qc/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -237,13 +237,17 @@

  • at least 2 cores/threads available
  • -
  • paired-end fastq sequence files (gzipped recommended)
  • +
  • paired-end fastq sequence files + gzip recommended +

Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence analyses is to remove these adapters and trim poor quality data. You can remove adapters -and quality trim sequences using the qc module:

+and quality trim sequences using the + qc +` module:

usage
harpy qc OPTIONS... INPUTS...
@@ -258,9 +262,13 @@

Running Options

-

In addition to the common runtime options, the harpy qc module is configured using these command-line arguments:

+

In addition to the + common runtime options +, the + qc + module is configured using these command-line arguments:

-

Criteria
+
@@ -281,14 +289,30 @@

+ + + + + + + + - + + + + + + + + + @@ -347,7 +371,7 @@

      └── Sample2.fastp.json
-

argumentFiles or directories containing input FASTQ files
--min-length-ninteger30noDiscard reads shorter than this length
--max-length-l-m integer 150 no Maximum length to trim sequences down to
--ignore-adapters-xtogglenoSkip adapter trimming
--extra-params -x string
+
diff --git a/modules/simulate/simulate-linkedreads/index.html b/modules/simulate/simulate-linkedreads/index.html new file mode 100644 index 000000000..6f31b1a87 --- /dev/null +++ b/modules/simulate/simulate-linkedreads/index.html @@ -0,0 +1,602 @@ + + + + + + + + + + + + + Simulate Linked Reads | Harpy + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+
+ + +
+ + +
+ + + +
+ +
+ + + + +
+ + + + + +
+ +
+ + +
+ + +
+ +
+
+ +
+
+ + + + +
+
+
+
+ +
+ + + + + + + + +
+ +
+
+
+
+ +
+ + +

+ # + Simulate Linked Reads +

+
+

Simulate linked reads from a genome

+ + + +
    +
  • two haplotypes of a reference genome in FASTA format: + .fasta + + .fa + + .fasta.gz + + .fa.gz + + +
  • +
  • + optional + a file of 16-basepair barcodes to tag linked reads with
  • +
+
+ + +

The original LRSIM is a lengthy Perl script that, like Harpy, outsources +to various other programs (SURVIVOR, DWGSIM, samtools, msort) and acts as a workflow through these programs. The Harpy +version of LRSIM keeps only the novel LRSIM code that creates linked reads from reads simulated by DWGSIM. The +rest of LRSIM's components are reincorporated into the Snakemake workflow governing the + simulate linkedreads +module, while removing the SURVIVOR part since + simulate {snpindel,...} + are used for that purpose.

+ +

+ # + Notable differences +

+
+
    +
  • dependencies are expected to be on the PATH, not hardcoded to the folder LRSIM is running from
  • +
  • -r parameter changed to folder prefix since Harpy uses -g for the haplotypes
  • +
  • outputs are coded a little differently for flexibility (and use the -r parameter for some parts)
  • +
  • SURVIVOR variant simulation functionality removed entirely
  • +
  • DWGSIM, samtools, msort, and extractReads functionality moved into Harpy workflow
  • +
  • uses newer version of DWGSIM
  • +
+
+
+

You may want to benchmark haplotag data on different kinds of genomic variants. To +do that, you'll need known variants (like those created by + simulate {snpindel,...} +) and +linked-read sequences. This module will create (diploid) linked-read sequences from two genome haplotypes. +To accomplish this, Harpy uses a modified version of LRSIM, +and converts the LRSIM 10X-style output into Haplotag-style reads. To simulate linked reads, use:

+
+
usage
+
harpy simulate linkedreads OPTIONS... HAP1_GENOME HAP2_GENOME
+
+
+
example
+
harpy simulate linkedreads -t 4 -n 2  -l 100 -p 50  data/genome.hap1.fasta data/genome.hap2.fasta
+
+ +

+ # + Running Options +

+
+

In addition to the + common runtime options +, the + simulate linkedreads + module is configured using these command-line arguments:

+
+
item
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
argumentshort nametypedefaultrequireddescription
HAP1_GENOMEfile pathyesHaplotype 1 of the diploid genome to simulate reads
HAP2_GENOMEfile pathyesHaplotype 1 of the diploid genome to simulate reads
--outer-distance-dinteger350Outer distance between paired-end reads (bp)
--distance-sd-iinteger15Standard deviation of read-pair distance
--barcodes-bfile path10X barcodesFile of linked-read barcodes to add to reads
--read-pairs-nnumber600Number (in millions) of read pairs to simulate
--mutation-rate-rnumber0.001Random mutation rate for simulating reads (0 - 1.0)
--molecule-length-linteger100Mean molecule length (kbp)
--patitions-pinteger1500Number (in thousands) of partitions/beads to generate
--molecules-per-minteger10Average number of molecules per partition
+
+ +

+ # + Mutation Rate +

+
+

The read simulation is two-part: first dwgsim generates forward and reverse FASTQ files from the provided genome haplotypes +(HAP1_GENOME and HAP2_GENOME), then LRSIM takes over and creates linked-reads from that. The --mutation-rate +option controls random mutation rate dwgsim uses when creating FASTQ files from your provided genome haplotypes. This parameter +adds SNPs/variation in addition to the error rate assumed for the Illumina platform. If you don't want any more SNPs added to the +reads beyond sequencing error, set this value to --mutation-rate 0.

+ +

+ # + Simulating a single sample +

+
+

If you intend to simulate a "single individual" (i.e. use this module once), then you might want no additonal SNPs beyond the variants +you may have already introduced into the genome and set --mutation-rate 0.

+ +

+ # + Simulating multiple samples +

+
+

If you intend on simulating "multiple individuals" (i.e. use this module multiple times on the same genome haplotypes), +it may make sense to set this value larger than 0 so there is some "natural" variation between your simulated individuals.

+ +

+ # + Partitions +

+
+

TL;DR: 10X partitions ≈ haplotag beads

+

The option --partitions refers to the reaction "bubbles" in the original 10X linked-read chemistry. The 10X +protocol involved emulsion reactions where microscopic bubbles resulting from emulsion were each their own +reaction micro-environment. Each of these "partitions" (aka bubbles, etc.) was to contain a unique linked-read +barcode that would be ligated onto the sample DNA, thus creating the linked read barcoding. In an ideal situation, +there would be a single molecule per emulsion partition, which was rarely the case because it's really +difficult to achieve that. In haplotag terms, think of partitions as being synonymous with tagmentation beads. In +both 10X and haplotag simulation cases, you're controlling for how many clashing barcodes there might be where +reads that aren't from the same molecule have the same linked-read barcode. This is why linked-read software (and +corresponding Harpy modules) have an option to set the barcode threshold.

+ +

+ # + Barcodes +

+
+

Barcodes, if provided, must be given as 16-basepair nucleotide sequences, one per line. If not provided, +Harpy will download the standard 10X Genomics 4M-with-alts-february-2016.txt barcode set from the LRSIM +repository and use those. The barcode file should look like:

+
+
input.barcodes.txt
+
ATATGTACTCATACCA
+GGATGTACTCATTCCA
+TCACGTACTCATACCA
+etc...
+
+ +

+ # + 10X to Haplotag conversion +

+
+

Harpy will convert the simulated 10X-style reads, where the 16-basepair barcode is at the beginning of read 1, +to haplotag format, where the barcode is coded in the sequence header under the BX:Z tag with the format +AxxCxxBxxDxx, where xx is a number between 00 and 96. Using this format, a 00 would invalidate the +entire barcode due to a segment failing to be associated with a beadtag segment. In the simulated data, since +10X barcodes don't feature segments, failure to associate the first 16 bases of read 1 with barcodes provided +to --barcodes will appear as BX:Z:A00C00B00D00. The original 10X barcode (or first 16 bases of read 1) +will be removed from the sequence and stored in the TX:Z sequence header tag, e.g. TX:Z:ATATGTACTCATACCA. +The paired reverse read will also have these tags. The diagram below attempts to simplify this visually. +

+ 10X linked read barcode conversion into AxxCxxBxxDxx haplotag barcode format +
10X linked read barcode conversion into AxxCxxBxxDxx haplotag barcode format
+
+

+ +

+ # + Choosing parameters +

+
+

LRSIM does internal calculations to determine the number of reads per molecule based on --read-pairs, +--partitions, and --molecules-per. Understanding how these parameters affect the resulting sequences +will help inform your decisions for those parameters:

+
\text{Reads Per Molecule} = 0.499 + \frac{N \times 1,000,000}{\left(\frac{P \times 1,000}{H}\right) \times M \times H}
+

\text{where:}\\\text{N = number of reads to simulate (in millions)}\\\text{H = number of haplotypes (fixed at 2)}\\\text{P = number of partitions (in thousands)}\\\text{M = molecules per partition}

+ +

+ # + Parameter calculator +

+
+

Conveniently, we provide a calculator to help you make informed decisions for these parameters: +

+
+ +
+
+

+ +

+ # + Simulate Linkedreads Workflow +

+
+
graph LR
+    subgraph Inputs
+        A[genome haplotype 1]
+        B[genome haplotype 2]
+    end
+    Inputs-->D([dwgsim])
+    D-->L([LRSIM])
+    L-->H([convert to haplotag])
+    style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
+ + + + +
+ + + +
+
+
+
    +
+
+ +
+
+

+ + + + + + + +
+ +

+
+ + + +

+ + +
+ + + + diff --git a/modules/simulate/simulate-variants/index.html b/modules/simulate/simulate-variants/index.html index 83b46ab74..8b220d4e8 100644 --- a/modules/simulate/simulate-variants/index.html +++ b/modules/simulate/simulate-variants/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + + @@ -235,7 +235,15 @@

    -
  • a reference genome in fasta or gzipped fasta format
  • +
  • a reference genome in FASTA format: + .fasta + + .fa + + .fasta.gz + + .fa.gz +

You may want to benchmark haplotag data on different kinds of genomic variants. To @@ -262,7 +270,7 @@

There are 4 submodules with very obvious names:

- +
@@ -271,19 +279,27 @@

- + - + - + - + @@ -296,11 +312,13 @@

While there are serveral differences between the submodule command line options, each has available all the -common runtime options like other Harpy modules. Each requires and input genome at the + + common runtime options + like other Harpy modules. Each requires and input genome at the end of the command line, and each requires either a --count of variants to randomly simulate, or a --vcf of specific variants to simulate. There are also these unifying options among the different variant types:

-
submodule
snpindel + snpindel + simulates single nucleotide polymorphisms (snps) and insertion-deletions (indels)
inversion + inversion + simulates inversions
cnv + cnv + simulates copy number variants
translocation + translocation + simulates translocations
+
@@ -357,7 +375,7 @@

- +

# @@ -401,7 +419,7 @@

SNPs can be slow
-

argument
+
@@ -414,7 +432,7 @@
SNPs can be slow
- + @@ -480,7 +498,7 @@
SNPs can be slow
- +

# @@ -489,7 +507,7 @@

Inversions are when a section of a chromosome appears in the reverse orientation (source).

-

argument
--snp-vcf-v-s file path VCF file of known snps to simulate
+
@@ -533,7 +551,7 @@

- +

# @@ -559,7 +577,7 @@

-

argument
+
@@ -624,7 +642,7 @@

- +

# @@ -633,7 +651,7 @@

A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes (source).

-

argument
+
@@ -750,7 +768,9 @@

Simulate Diploid Assembly

-

Here is a simple but realistic workflow of creating a diploid assembly with simulated variants. +

Here is a simple but realistic workflow of creating a diploid assembly with simulated variants. Due +to the roundabout complexity of the process, attempts were made to use color to help keep track of the +original haploid genome and the resulting genome haplotypes. If you haven't already, please read the sections about simulating known variants and heterozygosity. The idea here is that due to the limitations of simuG, we can only simulate one type of variant at a time and we will take advantage of the VCF files created by @@ -766,7 +786,8 @@

graph LR
     geno(haploid genome)-->|simulate inversion -n 10 -z 0.5|hap(inversions.vcf)
     hap-->hap1(inversion.hap1.vcf)
-    hap-->hap2(inversion.hap2.vcf)
+ hap-->hap2(inversion.hap2.vcf) + style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px

# @@ -777,11 +798,21 @@

into homozygotes and heterozygotes, onto the original haploid genome, creating two haplotype genomes.

graph LR
-    subgraph id1 ["Inputs"]
+    subgraph id1 ["Haplotype 1 Inputs"]
     hap1(inversion.hap1.vcf)---geno(haploid genome)
     end
     id1-->|simulate inversion -v|hapgeno(haplotype-1 genome)
-    style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
+ style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style hapgeno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px + style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px +
graph LR
+    subgraph id2 ["Haplotype 2 Inputs"]
+    hap2(inversion.hap2.vcf)---geno(haploid genome)
+    end
+    id2-->|simulate inversion -v|hapgeno2(haplotype-2 genome)
+    style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
+    style hapgeno2 fill:#bd8fcb,stroke:#a460b7,stroke-width:2px
+    style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px

# @@ -795,7 +826,9 @@

graph LR
     geno(haplotype-1 genome)-->|simulate snpindel -n 100000 -z 0.5|hap(snpindel.vcf)
     hap-->hap1(snpindel.hap1.vcf)
-    hap-->hap2(snpindel.hap2.vcf)
+ hap-->hap2(snpindel.hap2.vcf) + style geno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px +

# @@ -809,13 +842,17 @@

hap1(snpindel.hap1.vcf)---geno(haplotype-1 genome) end id1-->|simulate inversion -v|genohap1(haplotype-1 genome with new variants) - style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style geno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px + style genohap1 fill:#90c8be,stroke:#000000,stroke-width:2px
graph LR
     subgraph id2 ["Haplotype 2 inputs"]
     hap1(snpindel.hap2.vcf)---geno(haplotype-2 genome)
     end
     id2-->|simulate inversion -v|genohap2(haplotype-2 genome with new variants)
-    style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
+ style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px + style geno fill:#bd8fcb,stroke:#a460b7,stroke-width:2px + style genohap2 fill:#bd8fcb,stroke:#000000,stroke-width:2px

# @@ -838,11 +875,11 @@

argument
+
@@ -340,12 +368,12 @@

- - - + + + - + @@ -374,16 +402,102 @@

argumentGenome assembly for variant calling
--windowsize-winteger--regions-rinteger/file path/string 50000 noInterval size for parallel variant callingRegions to call variants on (see below)
--populations
- +

- # - windowsize + # + regions

-

To speed things along, Harpy will call variants in parallel on different contig -intervals, then merge everything at the end. You can control the level of parallelization by using -the --windowsize option, which by default is 50 kbp. A smaller window would allow for more parallelization, -but it's likely anything below 20 kbp will give diminishing returns.

+

The --regions (-r) option lets you specify the genomic regions you want to call variants on. Keep in mind that +mpileup uses 1-based positions for genomic intervals, whereas freebayes uses 0-based positions. Harpy will perform +variant calling in parallel over these invervals and they can be specified in three ways:

+ + + +
+
+ +
+

This is the default method (-r 50000), where Harpy will create 50 kbp non-overlapping genomic intervals across +the entire genome. Intervals towards the end of contigs that are shorter than the specified interval +size are still used. These invervals look like:

+
+
chromosome_1    1   50000
+chromosome_1    50001 100000
+chromosome_1    100001  121761    <- reached the end of the contig
+
+
+ + +
+
+ +
+

Following the mpileup and freebayes format, you can specify a single genomic interval of interest +to call variants on. This interval must be in the format contig:start-end where:

+
    +
  • contig is the exact name of a contig in the supplied genome
  • +
  • start is an integer specifying the start position of the interval for that contig
  • +
  • end is an integer specifying the end position of the interval for that contig
  • +
+
+ + +
+
+ +
+

A BED-like file of contig<whitespace>start<whitespace>end can be provided to call variants +on only specific genomic intervals. This file will look like:

+
+
chromosome_1    1   45000
+chromosome_1    723123 999919
+chromosome_5    22421   564121
+
+
+

# @@ -393,6 +507,101 @@

Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to believe there is a biologically meaningful grouping scheme to your samples, then you should include it.

+ +

+ # + Filtering Variants +

+
+

The discussion around filtering SNPs and indels is massive and many researchers go about it differently, each very +opinionated as to why their method is the best. As a starting point, have a look at how the authors of HTSlib give a +technical overview of variant filtering. It's a dense read, but does offer +insights and considerations for SNP/indel filtering. Here are some of the basic things to be mindful of for variant filtering:

+ + +

The best and fastest way to filter variants is to use bcftools, +which has a bit of a learning curve, but its power is unmatched. Filtering can be achieved using either bcftools view or bcftools filter +and the filtering expression can either be -i to include sites or -e to exclude sites matching the expression:

+
+
# bcftools view approach
+bcftools view -i 'EXPRESSION' input.vcf > output.vcf
+bcftools view -e 'EXPRESSION' input.vcf > output.vcf
+
+# bcftools filter approach
+bcftools filter -i 'EXPRESSION' input.vcf > output.vcf
+bcftools filter -e 'EXPRESSION' input.vcf > output.vcf
+
+

In either case, you can add -Ob to output a compressed bcf (recommended) file instead of an uncompressed vcf file (default). The +EXPRESSION is extremely flexible and multiple expressions can be chained +with || (OR) and && (AND).

+
+
filtering expression examples
+
# -e to EXCLUDE
+bcftools view -Ob -e 'QUAL <= 10 || DP > 35 || MQBZ < -3 || RPBZ < -3 || RPBZ > 3 || FORMAT/SP > 32 || SCBZ > 3' in.vcf > out.bcf
+
+# -i to INCLUDE, this example would result in the same output as the -e example
+bcftools filter -Ob -i 'QUAL > 10 || DP <= 35 || MQBZ >= -3 || RPBZ >= -3 || RPBZ <= 3 || FORMAT/SP <= 32 || SCBZ <= 3' in.vcf > out.bcf
+
+
+ +

+ # + genotype quality (QUAL) +

+
+

You will obviously want higher quality genotype calls to remove false positives. The HTSlib guide suggests at least 50 (e.g. -i 'QUAL>=50'), +but we typically filter much higher at 90 or more (e.g. -i 'QUAL>=90').

+ +

+ # + read depth (DP) +

+
+

Variant sites with too few reads backing up the genotype might be false positives, although this may not hold true for very +low-coverage data. Conversely, a maximum cut off is important because sites with very high read depths (relative to the distribution of read depth) +are likely repetitive ones mapping to multiple parts of the genome. You could used fixed values for these thresholds that make sense for your data. +One scalable approach is to define the thresholds as quantiles, such as the 0.01 and 0.99 quantiles of read depths, which would remove the +sites with the lowest 1% and highest 1% read depths. These are example quantiles and they don't need to be symmetric. It would be best to +plot the distribution of site depths to assess what makes sense for your data. Unfortunately, bcftools does not have internal routines to calculate +quantiles, but you can do it all from the command line using a combination of bcftools query and awk (separated onto 3 lines here for demonstration purposes):

+
+
find a specific depth quantile
+
bcftools query -f "%DP\n" input.bcf |\
+  sort -n |\
+  awk '{all[NR] = $0} END{print all[int(NR*0.95 - 0.5)]}'
+
+
    +
  • line 1: extract the depth for every site in the vcf file, one per line
  • +
  • line 2: sort the values numerically
  • +
  • line 3: find the 95th quantile +
      +
    • change the 0.95 in NR*0.95 to whatever quantile you want
    • +
    • the - 0.5 part rounds down and may need to be adjusted for your quantile
    • +
    +
  • +
+ +

+ # + minor allele frequency (MAF) +

+
+

It's usually advisable to set a minor allele frequency threshold with which to remove sites below that threshold. The reasoning +is that if a MAF is too low, it might be because of incorrectly called genotypes in a very small handful of individuals (e.g. one or two). The MAF threshold is again dependent on your data, although it's +fairly common to use 0.05 (e.g. -i 'MAF>0.05') to 0.10 (e.g. -i 'MAF>0.10').

+ +

+ # + missing data (F_MISSING) +

+
+

Missing data is, frankly, not terribly useful. The amount of missing data you're willing to tolerate will depend on your study, but +it's common to remove sites with >20% missing data (e.g. -e 'F_MISSING>0.2'). This can be as strict (or lenient) as you want; it's not uncommon to see very +conservative filtering at 10% or 5% missing data. However, you can impute missing genotypes to recover +missing data! Harpy can leverage linked-read information to impute genotypes with the + impute + +module. You should try to impute genotypes first before filtering out sites based on missingness.


@@ -403,7 +612,7 @@

-

The workflow is parallelized over genomic intervals (--windowsize). All intermediate outputs are removed, leaving +

The workflow is parallelized over genomic intervals (--). All intermediate outputs are removed, leaving you only the raw variants file (in .bcf format), the index of that file, and some stats about it.

@@ -417,15 +626,12 @@

subgraph Inputs aln[BAM alignments]---gen[genome] end - Inputs --> B & A - A([split contigs]) --> B([bcftools mpileup]) + Inputs --> B([freebayes]) B-->C([bcftools call]) C-->D([index BCFs]) D-->E([combine BCFs]) C-->E - E-->G([normalize variants]) E-->F([generate reports]) - G-->F style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px

@@ -439,13 +645,10 @@

subgraph Inputs aln[BAM alignments]---gen[genome] end - Inputs --> B & A - A([split contigs]) --> B([freebayes]) + Inputs --> B([freebayes]) B-->D([index BCFs]) D-->E([combine BCFs]) - E-->G([normalize variants]) E-->F([generate reports]) - G-->F style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px @@ -455,8 +658,6 @@

The resulting folder also includes a workflow directory (not shown) with workflow-relevant runtime files and information.

SNP/method
-├── variants.normalized.bcf
-├── variants.normalized.bcf.csi
 ├── variants.raw.bcf
 ├── variants.raw.bcf.csi
 ├── logs
@@ -470,13 +671,11 @@ 

└── reports    ├── contig1.stats    ├── contig2.stats -    ├── variants.normalized.html -    ├── variants.normalized.stats    ├── variants.raw.html    └── variants.raw.stats

- +
@@ -489,10 +688,6 @@

- - - - @@ -532,7 +727,7 @@

By default, Harpy runs mpileup with these parameters (excluding inputs and outputs):

-
bcftools mpileup --region contigname:START-END --annotate AD --output-type b
+
bcftools mpileup --region contigname:START-END --annotate AD --output-type b --ploidy ploidy

The mpileup module of samtools has a lot of command line options. Listing them all here would be difficult to read, therefore please refer to the mpileup documentation to explore ways to configure your mpileup run.

@@ -541,7 +736,7 @@

By default, Harpy runs freebayes with these parameters (excluding inputs and outputs):

-
freebayes -f genome.fasta -L bam.list -p ploidy
+
freebayes -f genome.fasta -r contigname:START-END -L bam.list -p ploidy

Freebayes has a lot of command line options. Listing them all here would be difficult to read, therefore please refer to the freebayes documentation to explore ways to configure your freebayes run.

diff --git a/modules/sv/leviathan/index.html b/modules/sv/leviathan/index.html index 5b7e6569e..2d80be891 100644 --- a/modules/sv/leviathan/index.html +++ b/modules/sv/leviathan/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -239,9 +239,21 @@

  • at least 4 cores/threads available
  • -
  • sequence alignments, in .bam format
  • -
  • genome assembly in FASTA format
  • -
  • (optional) sample grouping file (see below)
  • +
  • sequence alignments, in BAM format: + .bam +
  • +
  • genome assembly in FASTA format: + .fasta + + .fa + + .fasta.gz + + .fa.gz +
  • +
  • + optional + sample grouping file (see below)
@@ -258,24 +270,32 @@

EMA-mapped reads

Leviathan relies on split-read information in the sequence alignments to call variants. The EMA aligner does not report split read alignments, instead it reports secondary alignments. It is recommended to use -BWA-generated alignments if intending to call variants with leviathan.

+BWA- or Minimap2-generated alignments if intending to call variants with leviathan.

- - + +

This file is optional and only useful if you want variant calling to happen on a per-population level.

    -
  • takes the format of sample<tab>group +
  • takes the format of sample + tab +group
    • spaces can be used as delimeters too
  • the groups can be numbers or text (i.e. meaningful population names)
  • you can comment out lines with # for Harpy to ignore them
  • -
  • create with harpy popgroup -d <samplefolder> or manually
  • -
  • if created with harpy popgroup, all the samples will be assigned to group pop1 +
  • create with + harpy popgroup + or manually
  • +
  • if created with + harpy popgroup +, all the samples will be assigned to group pop1
    • make sure to edit the second column to reflect your data correctly.
    @@ -311,10 +331,14 @@
    known quirk
    -

    After reads have been aligned, e.g. with harpy align, you can use those alignment files +

    After reads have been aligned, e.g. with + align bwa +, you can use those alignment files (.bam) to call structural variants in your data using LEVIATHAN. To make sure your data will work seemlessly with LEVIATHAN, the alignments in the input BAM files should end -with a BX:Z:AxxCxxBxxDxx tag. Use harpy preflight bam if you want to double-check file +with a BX:Z:AxxCxxBxxDxx tag. Use + preflight bam + if you want to double-check file format validity.

    usage
    @@ -330,9 +354,13 @@

    Running Options

    -

    In addition to the common runtime options, the harpy sv leviathan module is configured using these command-line arguments:

    +

    In addition to the + common runtime options +, the + sv leviathan + module is configured using these command-line arguments:

    -

itemvcf file produced from variant calling, contains all samples and loci
variants.normalized.bcfleft-aligned (parsimonious) variants with multiallelic sites decomposed and duplicates removed
variants.*.bcf.csi index file for variants.*.bcf
+
@@ -357,8 +385,8 @@

- - + + @@ -467,7 +495,7 @@

└── sample2.sv.stats
-

argument-g file path conditionallyGenome assembly for phasing bam filesyesGenome assembly that was used to create alignments
--populations
+
diff --git a/modules/sv/naibr/index.html b/modules/sv/naibr/index.html index cbeaeee6f..6f399498c 100644 --- a/modules/sv/naibr/index.html +++ b/modules/sv/naibr/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -239,25 +239,47 @@

  • at least 4 cores/threads available
  • -
  • sequence alignments, in .bam format
  • -
  • genome assembly in FASTA format
  • -
  • (optional) phased VCF file
  • -
  • (optional) sample grouping file (see below)
  • +
  • sequence alignments, in BAM format: + .bam +
  • +
  • genome assembly in FASTA format: + .fasta + + .fa + + .fasta.gz + + .fa.gz +
  • +
  • + optional + phased VCF file
  • +
  • + optional + sample grouping file (see below)
- - + +

This file is optional and only useful if you want variant calling to happen on a per-population level.

    -
  • takes the format of sample<tab>group +
  • takes the format of sample + tab +group
    • spaces can be used as delimeters too
  • the groups can be numbers or text (i.e. meaningful population names)
  • you can comment out lines with # for Harpy to ignore them
  • -
  • create with harpy popgroup -d <samplefolder> or manually
  • -
  • if created with harpy popgroup, all the samples will be assigned to group pop1 +
  • create with + harpy popgroup + or manually
  • +
  • if created with + harpy popgroup +, all the samples will be assigned to group pop1
    • make sure to edit the second column to reflect your data correctly.
    @@ -293,7 +315,9 @@
    known quirk
    -

    After reads have been aligned, e.g. with harpy align, you can use those alignment files +

    After reads have been aligned, e.g. with + align bwa +, you can use those alignment files (.bam) to call structural variants in your data using NAIBR. While our testing shows that NAIBR tends to find known inversions that LEVIATHAN misses, the program requires haplotype phased bam files as input. That means the alignments have a PS or HP tag that indicate @@ -317,9 +341,13 @@

    Running Options

    -

    In addition to the common runtime options, the harpy sv naibr module is configured using these command-line arguments:

    +

    In addition to the + common runtime options +, the + sv naibr + module is configured using these command-line arguments:

    -

item
+
@@ -438,7 +466,9 @@

In order to get the best variant calling performance out of NAIBR, it requires phased bam files as input. -The --vcf option is optional and not used by NAIBR. However, to use harpy sv naibr with +The --vcf option is optional and not used by NAIBR. However, to use + sv naibr + with bam files that are not phased, you will need to include --vcf, which Harpy uses with whatshap haplotag to phase your input BAM files prior to variant calling. See the whatshap documentation for more details on that process.

@@ -449,20 +479,24 @@

This file can be in vcf/vcf.gz/bcf format and most importantly it must be phased haplotypes. There are various -ways to haplotype SNPs, but you can use harpy phase to phase your SNPs into haplotypes using the haplotag barcode -information. The resulting phased VCF file can then be used as input here. For reasons unclear to us, we have found -NAIBR to have better detection of known structural variants when using BWA alignments that have been back-phased with -a phased VCF file than using alignments that were phased when mapped with EMA. That makes this process longer and more -circuitous (see the workflow diagram), but the results were noticeably better.

+ways to haplotype SNPs, but you can use + harpy phase + to phase your SNPs +into haplotypes using the haplotag barcode information. The resulting phased VCF file can then be used as input here. +Your VCF file should be filtered in some capacity to keep high quality data.

---
 title: Calling variants with NAIBR, starting with unphased alignments
 ---
 graph LR
     aln[alignments]-->|harpy snp|snps([SNPs])
     snps-->|bcftools filter -i 'QUAL>95' ...|filt([filtered SNPs])
-    filt-->|harpy phase|phasesnp([phased haplotypes])
+    filt-->|harpy phase|phasesnp([phased haplotypes])  
+    subgraph id1 ["Harpy does this part"]
     phasesnp-->|whatshap haplotag|aln2
-    aln2([phased alignments])-->|NAIBR|results((structural variants))
+ aln2([phased alignments])-->|NAIBR|results((structural variants)) + end + style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px +

@@ -480,18 +514,19 @@

This fork includes improved accuracy as well as quality-of-life updates.

graph LR
     subgraph id1 ["Phase"]
-    aln[unphased BAM alignments]-->phased([phased alignments])
+    aln[unphased alignments]---vcf[phased VCF]
     end
+    id1-->phased([phased alignments])
     subgraph id2 ["Population calling"]
     popsplit([merge by population])
     end
-    id1-->id2
+    phased-->id2
     popsplit-->A
-    id1-->A
+    phased-->A
     A([index alignments]) --> B([NAIBR])
     Z([create config file]) --> B
     popsplit --> Z
-    id1 --> Z
+    phased --> Z
     B-->C([generate reports])
     style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
     style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
@@ -524,7 +559,8 @@

├── sample1.vcf └── sample2.vcf -

| item | description | +

+| item | description | |:--------------|:-----------------------------------------------------------------| | *.bedpe | structural variants identified by NAIBR | | configs/ | the configuration files harpy generated for each sample | diff --git a/resources/js/config.js b/resources/js/config.js index 75e520bab..63ef0375b 100644 --- a/resources/js/config.js +++ b/resources/js/config.js @@ -1 +1 @@ -var __DOCS_CONFIG__ = {"id":"HezP146sz7320YxyRyKDUVUd4xR3RSW8upU","key":"QwCA6LX+UC2BGCh18+OCn6Mhs1C09ogHSdbfkTRwjio.Fl25mGRky4FBGUVMFBoq+v/yfgBEYPXL4GK+ycdHDyiRN3x1IvBedQCvqkV+LXZG4/A95LLNb+UGyjbUpt/w/A.26","base":"/HARPY/","host":"pdimens.github.io","version":"1.0.0","useRelativePaths":true,"documentName":"index.html","appendDocumentName":false,"trailingSlash":true,"preloadSearch":false,"cacheBustingToken":"3.5.0.767378786429","cacheBustingStrategy":"query","sidebarFilterPlaceholder":"Filter","toolbarFilterPlaceholder":"Filter","showSidebarFilter":true,"filterNotFoundMsg":"No member names found containing the query \"{query}\"","maxHistoryItems":15,"homeIcon":"","access":[{"value":"public","label":"Public"},{"value":"protected","label":"Protected"}],"toolbarLinks":[{"id":"fields","label":"Fields"},{"id":"properties","label":"Properties"},{"id":"methods","label":"Methods"},{"id":"events","label":"Events"}],"sidebar":[{"n":"/","l":"Home","s":""},{"n":"install","l":"Install","s":""},{"n":"modules","l":"Modules","c":false,"i":[{"n":"demultiplex","l":"Demultiplex","s":""},{"n":"preflight","l":"Preflight","s":""},{"n":"qc","l":"QC","s":""},{"n":"align","l":"Align","c":false,"i":[{"n":"bwa","l":"BWA","s":""},{"n":"ema","l":"EMA","s":""},{"n":"minimap","l":"Minimap","s":""}],"s":""},{"n":"simulate","l":"Simulate","c":false,"i":[{"n":"simulate-variants","l":"Variants","s":""}],"s":""},{"n":"snp","l":"SNP","s":""},{"n":"sv","l":"SV","c":false,"i":[{"n":"leviathan","l":"Leviathan","s":""},{"n":"naibr","l":"Naibr","s":""}],"s":""},{"n":"impute","l":"Impute","s":""},{"n":"phase","l":"Phase","s":""},{"n":"othermodules","l":"Other","s":""}],"s":""},{"n":"haplotagdata","l":"Haplotag data","s":""},{"n":"commonoptions","l":"Common Options","s":""},{"n":"issues","l":"Common Issues","s":""},{"n":"snakemake","l":"Sneaky Snakemake","s":""},{"n":"software","l":"Software","s":""},{"n":"development","l":"Development","s":""}],"search":{"mode":1,"minChars":2,"maxResults":20,"placeholder":"Search","hotkeys":["k"],"noResultsFoundMsg":"Sorry, no results found.","recognizeLanguages":true,"languages":[0],"preload":false},"resources":{"History_Title_Label":"History","History_ClearLink_Label":"Clear","History_NoHistory_Label":"No history items","API_AccessFilter_Label":"Access","API_ParameterSection_Label":"PARAMETERS","API_SignatureSection_Label":"SIGNATURE","API_CopyHint_Label":"Copy","API_CopyNameHint_Label":"Copy name","API_CopyLinkHint_Label":"Copy link","API_CopiedAckHint_Label":"Copied!","API_MoreOverloads_Label":"more","API_MoreDropdownItems_Label":"More","API_OptionalParameter_Label":"optional","API_DefaultParameterValue_Label":"Default value","API_InheritedFilter_Label":"Inherited","Search_Input_Placeholder":"Search","Toc_Contents_Label":"Contents","Toc_RelatedClasses_Label":"Related Classes","History_JustNowTime_Label":"just now","History_AgoTime_Label":"ago","History_YearTime_Label":"y","History_MonthTime_Label":"mo","History_DayTime_Label":"d","History_HourTime_Label":"h","History_MinuteTime_Label":"m","History_SecondTime_Label":"s"}}; +var __DOCS_CONFIG__ = {"id":"KHix/2G/7tW210/fnu0zhLSb14v3lGeMY7+","key":"0L65KIYXNF1I6elJ90lFeYnwrs4yvsyLsq/peAaRiBk.dXfaXqi2YT+h3kvFglAPEdE+c8YWjxE9nZLrJYg25EWSErIeGGO+XMKz/cj4UlFYf3QLFKu2mbkb59/MhYILEg.45","base":"/HARPY/","host":"pdimens.github.io","version":"1.0.0","useRelativePaths":true,"documentName":"index.html","appendDocumentName":false,"trailingSlash":true,"preloadSearch":false,"cacheBustingToken":"3.5.0.770859684227","cacheBustingStrategy":"query","sidebarFilterPlaceholder":"Filter","toolbarFilterPlaceholder":"Filter","showSidebarFilter":true,"filterNotFoundMsg":"No member names found containing the query \"{query}\"","maxHistoryItems":15,"homeIcon":"","access":[{"value":"public","label":"Public"},{"value":"protected","label":"Protected"}],"toolbarLinks":[{"id":"fields","label":"Fields"},{"id":"properties","label":"Properties"},{"id":"methods","label":"Methods"},{"id":"events","label":"Events"}],"sidebar":[{"n":"/","l":"Home","s":""},{"n":"install","l":"Install","s":""},{"n":"modules","l":"Modules","c":false,"i":[{"n":"demultiplex","l":"Demultiplex","s":""},{"n":"preflight","l":"Preflight","s":""},{"n":"qc","l":"QC","s":""},{"n":"align","l":"Align","c":false,"i":[{"n":"bwa","l":"BWA","s":""},{"n":"ema","l":"EMA","s":""},{"n":"minimap","l":"Minimap","s":""}],"s":""},{"n":"simulate","l":"Simulate","c":false,"i":[{"n":"simulate-linkedreads","l":"Linked Reads","s":""},{"n":"simulate-variants","l":"Variants","s":""}],"s":""},{"n":"snp","l":"SNP","s":""},{"n":"sv","l":"SV","c":false,"i":[{"n":"leviathan","l":"Leviathan","s":""},{"n":"naibr","l":"Naibr","s":""}],"s":""},{"n":"impute","l":"Impute","s":""},{"n":"phase","l":"Phase","s":""},{"n":"other","l":"Other","s":""}],"s":""},{"n":"haplotagdata","l":"Haplotag data","s":""},{"n":"commonoptions","l":"Common Options","s":""},{"n":"issues","l":"Common Issues","s":""},{"n":"snakemake","l":"Sneaky Snakemake","s":""},{"n":"software","l":"Software","s":""},{"n":"development","l":"Development","s":""}],"search":{"mode":1,"minChars":2,"maxResults":20,"placeholder":"Search","hotkeys":["k"],"noResultsFoundMsg":"Sorry, no results found.","recognizeLanguages":true,"languages":[0],"preload":false},"resources":{"History_Title_Label":"History","History_ClearLink_Label":"Clear","History_NoHistory_Label":"No history items","API_AccessFilter_Label":"Access","API_ParameterSection_Label":"PARAMETERS","API_SignatureSection_Label":"SIGNATURE","API_CopyHint_Label":"Copy","API_CopyNameHint_Label":"Copy name","API_CopyLinkHint_Label":"Copy link","API_CopiedAckHint_Label":"Copied!","API_MoreOverloads_Label":"more","API_MoreDropdownItems_Label":"More","API_OptionalParameter_Label":"optional","API_DefaultParameterValue_Label":"Default value","API_InheritedFilter_Label":"Inherited","Search_Input_Placeholder":"Search","Toc_Contents_Label":"Contents","Toc_RelatedClasses_Label":"Related Classes","History_JustNowTime_Label":"just now","History_AgoTime_Label":"ago","History_YearTime_Label":"y","History_MonthTime_Label":"mo","History_DayTime_Label":"d","History_HourTime_Label":"h","History_MinuteTime_Label":"m","History_SecondTime_Label":"s"}}; diff --git a/resources/js/search.json b/resources/js/search.json index 03ca41c45..6ad60b90b 100644 --- a/resources/js/search.json +++ b/resources/js/search.json @@ -1 +1 @@ -[[{"i":"#","p":["Using Harpy to process your haplotagged data"]},{"l":"Home","p":["Harpy is a haplotagging data processing pipeline for Linux-based systems. It uses all the magic of Snakemake under the hood to handle the worklfow decision-making, but as a user, you just interact with it like a normal command-line"]},{"i":"what-is-haplotagging","l":"What is haplotagging?","p":["Linked-read sequencing exists to combine the throughput and accuracy of short-read sequencing with the long range haplotype information of long-read sequencing. Haplotagging is an implementation of linked-read sequencing developed by"]},{"l":"Harpy Modules","p":["Harpy is modular, meaning you can use different parts of it independent from each other. Need to only align reads? Great! Only want to call variants? Awesome! All modules are called by"]},{"l":"Using Harpy","p":["You can call harpy without any arguments (or with --help) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with --help to see their usage (e.g."]}],[{"l":"Install HARPY","p":["Harpy is now hosted on Bioconda! That means to install it, you just need to have mamba(or conda) on your Linux-based system and install it with a simple command. You can install Harpy into an existing environment or create a new one for it (recommended)."]}],[{"i":"#","p":["Demultiplex raw sequences into haplotag barcoded samples"]},{"l":"Demultiplex Raw Sequences","p":["When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ files in return. These files contain sequences for all of your samples and need to be demultiplexed using barcodes to"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy demultiplex module is configured using these command-line arguments:"]},{"l":"Haplotag Types"},{"l":"Gen I Demultiplex Workflow"}],[{"i":"#","p":["Run file format checks on haplotagged FASTQ/BAM files"]},{"l":"Pre-flight checks for input files","p":["Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). These formatting opinions/specifics are at the mercy of the original developers and while there are times when Harpy can (and does)"]},{"l":"when to run"},{"l":"Running Options","p":["In addition to the common runtime options, the harpy preflight fastq|bam module is configured using these command-line arguments:"]},{"l":"Workflow"}],[{"i":"#","p":["Quality trim haplotagged sequences with Harpy"]},{"l":"Quality Trim Sequences","p":["Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence analyses is to remove these adapters and trim poor quality data. You can remove adapters"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy qc module is configured using these command-line arguments:"]},{"l":"QC Workflow"}],[{"i":"#","p":["Align haplotagged sequences with BWA MEM"]},{"l":"Map Reads onto a genome with BWA MEM","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align bwa module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used during the BWA alignment workflow to assign alignments a unique Molecular Identifier MI:i tag based on their haplotag barcode and the distance threshold you specify. See"]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["Harpy uses samtools markdup to mark putative PCR duplicates. By using the --barcode-tag BX option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate"]},{"l":"BWA workflow"}],[{"i":"#","p":["Align haplotagged sequences with EMA"]},{"l":"Map Reads onto a genome with EMA","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align ema module is configured using these command-line arguments:"]},{"l":"Barcode whitelist","p":["Some linked-read methods (e.g. 10x, Tellseq) require the inclusion of a barcode \"whitelist.\" This file is a simple text file that has one barcode per line so a given software knows what barcodes to expect in your data."]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["EMA marks duplicates in the resulting alignments, however the read with invalid barcodes are aligned separately with BWA. Therefore, Harpy uses samtools markdup to mark putative"]},{"l":"EMA workflow"}],[{"i":"#","p":["Align haplotagged sequences with Minimap2"]},{"l":"Map Reads onto a genome with Minimap2","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align minimap module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used during the BWA alignment workflow to assign alignments a unique Molecular Identifier MI:i tag based on their haplotag barcode and the distance threshold you specify. See"]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["Harpy uses samtools markdup to mark putative PCR duplicates. By using the --barcode-tag BX option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate"]},{"l":"Minimap workflow"}],[{"i":"#","p":["Simulate snps, indels, inversions, cnv, translocations"]},{"l":"Simulate Genomic Variants","p":["Simulate snps, indels, inversions, cnv, translocations"]},{"l":"Modules","p":["There are 4 submodules with very obvious names:"]},{"l":"Running Options","p":["While there are serveral differences between the submodule command line options, each has available all the common runtime options like other Harpy modules. Each requires and input genome at the"]},{"l":"Simulate known variants","p":["Rather than simulating random variants, you can use a VCF file as input to any of the submodules to have simuG simulate the variants (of that type) from the VCF file. This becomes particularly"]},{"l":"Heterozygosity","p":["Each submodule has a --heterozygosity parameter where you can specify the heterozygosity of an intended diploid genome, should you use the resulting VCF(s) to simulate variants again."]},{"l":"Simulate Diploid Assembly","p":["Here is a simple but realistic workflow of creating a diploid assembly with simulated variants. If you haven't already, please read the sections about simulating known variants"]},{"l":"Step 1","p":["Simulate random variants onto your haploid assembly with --heterozygosity(-z) set above 0. We aren't interested in the resulting genome, but rather the positions of the variants"]},{"l":"Step 2","p":["Use the resulting hap1 and hap2 VCF files to simulate those same variants, but shuffled into homozygotes and heterozygotes, onto the original haploid genome, creating two haplotype"]},{"l":"Step 3","p":["Use the one of the new genome haplotypes for simulating other kinds of variants. Again, use --heterozygosity(-z) with a value greater than 0. Like Step 1, we're only interested in the haplotype VCF files (positions of variants) and not the resulting"]},{"l":"Step 4","p":["Use the resulting haplotype VCFs to simulate known variants onto the haplotype genomes from Step 2."]},{"l":"Step 5","p":["Repeat Step 3 and Step 4 to your heart's content."]}],[{"i":"#","p":["Call SNPs and small indels"]},{"l":"Call SNPs and small indels","p":["After reads have been aligned, e.g., with harpy align, you can use those alignment files(.bam) to call variants in your data. Harpy can call SNPs and small indels using bcftools mpileup"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy snp module is configured using these command-line arguments:"]},{"l":"windowsize","p":["To speed things along, Harpy will call variants in parallel on different contig intervals, then merge everything at the end. You can control the level of parallelization by using"]},{"l":"populations","p":["Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to believe there is a biologically meaningful grouping scheme to your samples, then you should include"]},{"l":"SNP calling workflow"}],[{"i":"#","p":["Call structural variants using Leviathan"]},{"l":"Call Structural Variants using LEVIATHAN","p":["(like indels, insertions, duplications, breakends)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv leviathan module is configured using these command-line arguments:"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"LEVIATHAN workflow"}],[{"i":"#","p":["Call structural variants using NAIBR (plus)"]},{"l":"Call Structural Variants using NAIBR","p":["(like indels, insertions, duplications)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv naibr module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used to let the program determine how far apart alignments on a contig with the same barcode can be from each other and still considered as originating from the same DNA molecule. See"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"optional vcf file","p":["In order to get the best variant calling performance out of NAIBR, it requires phased bam files as input. The --vcf option is optional and not used by NAIBR. However, to use harpy sv naibr"]},{"i":"a-phased-input---vcf","l":"a phased input --vcf","p":["This file can be in vcf/vcf.gz/bcf format and most importantly it must be phased haplotypes. There are various ways to haplotype SNPs, but you can use harpy phase to phase your SNPs into haplotypes using the haplotag barcode"]},{"l":"NAIBR workflow"}],[{"i":"#","p":["Impute genotypes for haplotagged data with Harpy"]},{"l":"Impute Genotypes using Sequences","p":["After variants have been called, you may want to impute missing genotypes to get the most from your data. Harpy uses STITCH to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy impute module is configured using these command-line arguments:"]},{"l":"Extra STITCH parameters","p":["You may add additional parameters to STITCH by way of the--extra-params(or -x) option. Since STITCH is a function in the R language, the parameters you add must be in R syntax (e.g."]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Parameter file","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters (explained in next section). The solution Harpy uses for this is to have the user"]},{"l":"STITCH Parameters"},{"l":"Imputation Workflow"}],[{"i":"#","p":["Phase haplotypes for haplotagged data with Harpy"]},{"l":"Phase SNPs into Haplotypes","p":["You may want to phase your genotypes into haplotypes, as haplotypes tend to be more informative than unphased genotypes (higher polymorphism, captures relationship between genotypes). Phasing"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy phase module is configured using these command-line arguments:"]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Molecule distance","p":["The molecule distance refers to the base-pair distance dilineating separate molecules. In other words, when two alignments on a single contig share the same barcode, how far away from each other are we willing to say they were and still consider them having"]},{"l":"Pruning threshold","p":["The pruning threshold refers to a PHRED-scale value between 0-1 (a percentage) for removing low-confidence SNPs from consideration. With Harpy, you configure this value as an integer"]},{"l":"Phasing Workflow"}],[{"i":"#","p":["Generate extra files for analysis with Harpy"]},{"l":"Other Harpy modules","p":["Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules: The arguments represent different sub-commands and can be run in any order or combination to generate the files you need."]},{"l":"Other modules"},{"l":"popgroup"},{"l":"Sample grouping file for variant calling"},{"l":"arguments","p":["This file is entirely optional and useful if you want SNP variant calling to happen on a per-population level via harpy snp ... -p or on samples pooled-as-populations via harpy sv ... -p"]},{"l":"stitchparams"},{"l":"STITCH parameter file"},{"i":"arguments-1","l":"arguments","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters. The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model"]},{"l":"hpc"},{"l":"HPC cluster profile"},{"i":"arguments-2","l":"arguments","p":["For snakemake to work in harmony with an HPC scheduler, a \"profile\" needs to be provided that tells Snakemake how it needs to interact with the HPC scheduler to submit your jobs to the cluster. Using"]}],[{"l":"Haplotag data"},{"l":"Data Format"},{"l":"Barcodes","p":["While barcodes are actually combinatorial bases, in the read headers they are represented with the format AxxCxxBxxDxx, where each barcode segment is denoted as Axx(or Bxx, etc.)."]},{"l":"barcode protocol varieties","p":["If you think haplotagging is as simple as exactly 96^4 unique barcodes, you would only be half-correct. The original haplotagging protocol in Meier et al. is good, but the authors (and others) have been working to improve this linked-read technology to improve"]},{"l":"where the barcodes go","p":["Chromium 10X linked-reads have a particular format where the barcode is the leading 16 bases of the read. However, haplotagging data does not use that format, nor do the tools implemented in Harpy work correctly with it. Once demultiplexed, haplotagging sequences should look"]},{"l":"Read headers","p":["Like mentioned, the haplotag barcode is expected to be stored in the BX:Z: tag in the read header. This information is retained through the various Harpy steps. An example read header could look like:"]},{"l":"Read length","p":["Reads must be at least 30 base pairs in length for alignment. The qc module removes reads <50bp."]},{"l":"Compression","p":["Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. Compressed files are expected to end with the extension"]},{"l":"Naming conventions","p":["Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations."]},{"l":"Barcode thresholds","p":["By the nature of linked read technologies, there will (almost always) be more DNA fragments than unique barcodes for them. As a result, it's common for barcodes to reappear in sequences. Rather than incorrectly assume that all sequences/alignments with the same barcode"]}],[{"l":"Common Harpy Options"},{"l":"Input Arguments","p":["Each of the main Harpy modules (e.g. qc or phase) follows the format of"]},{"l":"Common command-line options","p":["Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments,"]},{"l":"The workflow folder","p":["When you run one of the main Harpy modules, the output directory will contain a workflow folder. This folder is both necessary for the module to run and is very useful to understand what the module did, be it for your own"]},{"l":"The Genome folder","p":["You will notice that many of the workflows will create a Genome folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated"]}],[{"l":"Common Issues","p":["Lots of stuff can go wrong during an analysis. The intent of this page is to highlight common issues you may experience during analysis and ways to address these issues."]},{"l":"Problem installing with conda","p":["Conda is an awesome package manager, but it's slow and uses a ton of memory as dependencies increase. Harpy has a lot of dependencies and you might stall out conda trying to install it. Use mamba instead-- it'll work where conda fails."]},{"l":"Failures during imputation or phasing","p":["If you use bamutils clipOverlap on alignments that are used for the impute or phase modules, they will cause both programs to error. We don't know why, but they do."]},{"i":"alignment-file-name-and-id-tag-mismatch","l":"Alignment file name and ID: tag mismatch","p":["Aligning a sample to a genome via Harpy will insert the sample name (based on the file name) into the alignment header (the @RG ID:name SM:name tag). It likewise expects, through various steps,"]}],[{"l":"Adding Snakamake parameters","p":["Harpy relies on Snakemake under the hood to handle file and job dependencies. Most of these details have been abstracted away from the end-user, but every module of Harpy (except"]},{"l":"Common use cases","p":["You likely wont need to invoke --snakemake very often, if ever. However, here are common use cases for this parameter."]}],[{"l":"Software used in Harpy","p":["HARPY is the sum of its parts, and out of tremendous respect for the developers involved in the included software, we would like to highlight the tools directly involved in HARPY's many moving pieces."]}],[{"l":"Developing Harpy","p":["Harpy is an open source program written using a combination of BASH, R, RMarkdown, Python, and Snakemake. This page provides information on Harpy's development and how to contribute to it, if you were inclined to do so."]},{"l":"Installing Harpy for development","p":["The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the misc/buildlocal.sh script to move all the necessary files to the"]},{"i":"harpys-components","l":"Harpy's components"},{"l":"source code","p":["Harpy runs in two stages:"]},{"l":"Bioconda recipe","p":["For the ease of installation for end-users, Harpy has a recipe and build script in Bioconda, which makes it available for download and installation. A copy of the recipe and build script is also stored in"]},{"l":"The Harpy repository"},{"l":"structure","p":["Harpy exists as a Git repository and has 5 standard branches that are used in specific ways during development. Git is a popular version control system and discussing its use is out of the scope of this documentation, however there is no"]},{"l":"development workflow","p":["The dev workflow is reasonably standard:"]},{"l":"Automations"},{"l":"Testing","p":["CI ( C ontinuous I ntegration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. Harpy has a series of GitHub Actions triggered by interactions with the"]},{"l":"Releases","p":["To save on disk space, there is an automation to strip out the unnecessary files and upload a cleaned tarball to the new release. This is triggered automatically when a new version is tagged."]}]] \ No newline at end of file +[[{"i":"#","p":["Using Harpy to process your haplotagged data"]},{"l":"Home","p":["Harpy is a haplotagging data processing pipeline for Linux-based systems. It uses all the magic of Snakemake under the hood to handle the worklfow decision-making, but as a user, you just interact with it like a normal command-line"]},{"i":"what-is-haplotagging","l":"What is haplotagging?","p":["Linked-read sequencing exists to combine the throughput and accuracy of short-read sequencing with the long range haplotype information of long-read sequencing. Haplotagging is an implementation of linked-read sequencing developed by"]},{"l":"Harpy Modules","p":["Harpy is modular, meaning you can use different parts of it independent from each other. Need to only align reads? Great! Only want to call variants? Awesome! All modules are called by"]},{"l":"Using Harpy","p":["You can call harpy without any arguments (or with --help) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with --help to see their usage (e.g."]}],[{"l":"Install HARPY","p":["Harpy is now hosted on Bioconda! That means to install it, you just need to have mamba(or conda) on your Linux-based system and install it with a simple command. You can install Harpy into an existing environment or create a new one for it (recommended)."]}],[{"i":"#","p":["Demultiplex raw sequences into haplotag barcoded samples"]},{"l":"Demultiplex Raw Sequences","p":["When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ files in return. These files contain sequences for all of your samples and need to be demultiplexed using barcodes to"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"Haplotag Types"},{"l":"Gen I Demultiplex Workflow"}],[{"i":"#","p":["Run file format checks on haplotagged FASTQ/BAM files"]},{"l":"Pre-flight checks for input files","p":["Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). These formatting opinions/specifics are at the mercy of the original developers and while there are times when Harpy can (and does)"]},{"l":"when to run"},{"l":"Running Options","p":["In addition to the , the and modules are configured using only command-line input arguments:"]},{"l":"Workflow"}],[{"i":"#","p":["Quality trim haplotagged sequences with Harpy"]},{"l":"Quality Trim Sequences","p":["Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence analyses is to remove these adapters and trim poor quality data. You can remove adapters"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"QC Workflow"}],[{"i":"#","p":["Align haplotagged sequences with BWA MEM"]},{"l":"Map Reads onto a genome with BWA MEM","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used during the BWA alignment workflow to assign alignments a unique Molecular Identifier MI:i tag based on their haplotag barcode and the distance threshold you specify. See"]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["Harpy uses samtools markdup to mark putative PCR duplicates. By using the --barcode-tag BX option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate"]},{"l":"BWA workflow"}],[{"i":"#","p":["Align haplotagged sequences with EMA"]},{"l":"Map Reads onto a genome with EMA","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"Barcode whitelist","p":["Some linked-read methods (e.g. 10x, Tellseq) require the inclusion of a barcode \"whitelist.\" This file is a simple text file that has one barcode per line so a given software knows what barcodes to expect in your data."]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["EMA marks duplicates in the resulting alignments, however the read with invalid barcodes are aligned separately with BWA. Therefore, Harpy uses samtools markdup to mark putative"]},{"l":"EMA workflow"}],[{"i":"#","p":["Align haplotagged sequences with Minimap2"]},{"l":"Map Reads onto a genome with Minimap2","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used during the BWA alignment workflow to assign alignments a unique Molecular Identifier MI:i tag based on their haplotag barcode and the distance threshold you specify. See"]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"Marking PCR duplicates","p":["Harpy uses samtools markdup to mark putative PCR duplicates. By using the --barcode-tag BX option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate"]},{"l":"Minimap workflow"}],[{"i":"#","p":["Simulate linked reads from a genome"]},{"l":"Simulate Linked Reads","p":["Simulate linked reads from a genome"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"Mutation Rate","p":["The read simulation is two-part: first dwgsim generates forward and reverse FASTQ files from the provided genome haplotypes( HAP1_GENOME and HAP2_GENOME), then LRSIM takes over and creates linked-reads from that. The"]},{"l":"Simulating a single sample","p":["If you intend to simulate a \"single individual\" (i.e. use this module once), then you might want no additonal SNPs beyond the variants you may have already introduced into the genome and set"]},{"l":"Simulating multiple samples","p":["If you intend on simulating \"multiple individuals\" (i.e. use this module multiple times on the same genome haplotypes), it may make sense to set this value larger than 0 so there is some \"natural\" variation between your simulated individuals."]},{"l":"Partitions","p":["TL;DR: 10X partitions ≈ haplotag beads"]},{"l":"Barcodes","p":["Barcodes, if provided, must be given as 16-basepair nucleotide sequences, one per line. If not provided, Harpy will download the standard 10X Genomics 4M-with-alts-february-2016.txt"]},{"l":"10X to Haplotag conversion","p":["Harpy will convert the simulated 10X-style reads, where the 16-basepair barcode is at the beginning of read 1, to haplotag format, where the barcode is coded in the sequence header under the"]},{"l":"Choosing parameters","p":["LRSIM does internal calculations to determine the number of reads per molecule based on --read-pairs,--partitions, and --molecules-per. Understanding how these parameters affect the resulting sequences"]},{"l":"Parameter calculator","p":["Conveniently, we provide a calculator to help you make informed decisions for these parameters:"]},{"l":"Simulate Linkedreads Workflow"}],[{"i":"#","p":["Simulate snps, indels, inversions, cnv, translocations"]},{"l":"Simulate Genomic Variants","p":["Simulate snps, indels, inversions, cnv, translocations"]},{"l":"Modules","p":["There are 4 submodules with very obvious names:"]},{"l":"Running Options","p":["While there are serveral differences between the submodule command line options, each has available all the like other Harpy modules. Each requires and input genome at the end of the command line, and each requires either a"]},{"l":"Simulate known variants","p":["Rather than simulating random variants, you can use a VCF file as input to any of the submodules to have simuG simulate the variants (of that type) from the VCF file. This becomes particularly"]},{"l":"Heterozygosity","p":["Each submodule has a --heterozygosity parameter where you can specify the heterozygosity of an intended diploid genome, should you use the resulting VCF(s) to simulate variants again."]},{"l":"Simulate Diploid Assembly","p":["Here is a simple but realistic workflow of creating a diploid assembly with simulated variants. Due to the roundabout complexity of the process, attempts were made to use color to help keep track of the"]},{"l":"Step 1","p":["Simulate random variants onto your haploid assembly with --heterozygosity(-z) set above 0. We aren't interested in the resulting genome, but rather the positions of the variants"]},{"l":"Step 2","p":["Use the resulting hap1 and hap2 VCF files to simulate those same variants, but shuffled into homozygotes and heterozygotes, onto the original haploid genome, creating two haplotype"]},{"l":"Step 3","p":["Use the one of the new genome haplotypes for simulating other kinds of variants. Again, use --heterozygosity(-z) with a value greater than 0. Like Step 1, we're only interested in the haplotype VCF files (positions of variants) and not the resulting"]},{"l":"Step 4","p":["Use the resulting haplotype VCFs to simulate known variants onto the haplotype genomes from Step 2."]},{"l":"Step 5","p":["Repeat Step 3 and Step 4 to your heart's content."]}],[{"i":"#","p":["Call SNPs and small indels"]},{"l":"Call SNPs and small indels","p":["After reads have been aligned, e.g., with , you can use those alignment files(.bam) to call variants in your data. Harpy can call SNPs and small indels using bcftools mpileup or with"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"regions","p":["The --regions(-r) option lets you specify the genomic regions you want to call variants on. Keep in mind that mpileup uses 1-based positions for genomic intervals, whereas freebayes"]},{"l":"populations","p":["Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to believe there is a biologically meaningful grouping scheme to your samples, then you should include"]},{"l":"Filtering Variants","p":["The discussion around filtering SNPs and indels is massive and many researchers go about it differently, each very opinionated as to why their method is the best. As a starting point, have a look at how the authors of"]},{"i":"genotype-quality-qual","l":"genotype quality (QUAL)","p":["You will obviously want higher quality genotype calls to remove false positives. The HTSlib guide suggests at least 50(e.g. -i 'QUAL=50'), but we typically filter much higher at"]},{"i":"read-depth-dp","l":"read depth (DP)","p":["Variant sites with too few reads backing up the genotype might be false positives, although this may not hold true for very low-coverage data. Conversely, a maximum cut off is important because sites with very high read depths (relative to the distribution of read depth)"]},{"i":"minor-allele-frequency-maf","l":"minor allele frequency (MAF)","p":["It's usually advisable to set a minor allele frequency threshold with which to remove sites below that threshold. The reasoning is that if a MAF is too low, it might be because of incorrectly called genotypes in a very small handful of individuals (e.g. one or two)."]},{"i":"missing-data-f_missing","l":"missing data (F_MISSING)","p":["Missing data is, frankly, not terribly useful. The amount of missing data you're willing to tolerate will depend on your study, but it's common to remove sites with >20% missing data (e.g."]},{"l":"SNP calling workflow"}],[{"i":"#","p":["Call structural variants using Leviathan"]},{"l":"Call Structural Variants using LEVIATHAN","p":["(like indels, insertions, duplications, breakends)"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"LEVIATHAN workflow"}],[{"i":"#","p":["Call structural variants using NAIBR (plus)"]},{"l":"Call Structural Variants using NAIBR","p":["(like indels, insertions, duplications)"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used to let the program determine how far apart alignments on a contig with the same barcode can be from each other and still considered as originating from the same DNA molecule. See"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"optional vcf file","p":["In order to get the best variant calling performance out of NAIBR, it requires phased bam files as input. The --vcf option is optional and not used by NAIBR. However, to use with"]},{"i":"a-phased-input---vcf","l":"a phased input --vcf","p":["This file can be in vcf/vcf.gz/bcf format and most importantly it must be phased haplotypes. There are various ways to haplotype SNPs, but you can use to phase your SNPs into haplotypes using the haplotag barcode information. The resulting phased VCF file can then be used as input here."]},{"l":"NAIBR workflow"}],[{"i":"#","p":["Impute genotypes for haplotagged data with Harpy"]},{"l":"Impute Genotypes using Sequences","p":["After variants have been called, you may want to impute missing genotypes to get the most from your data. Harpy uses STITCH to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"Extra STITCH parameters","p":["You may add additional parameters to STITCH by way of the--extra-params(or -x) option. Since STITCH is a function in the R language, the parameters you add must be in R syntax (e.g."]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the INPUTS, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Parameter file","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters (explained in next section). The solution Harpy uses for this is to have the user"]},{"l":"STITCH Parameters"},{"l":"Imputation Workflow"}],[{"i":"#","p":["Phase haplotypes for haplotagged data with Harpy"]},{"l":"Phase SNPs into Haplotypes","p":["You may want to phase your genotypes into haplotypes, as haplotypes tend to be more informative than unphased genotypes (higher polymorphism, captures relationship between genotypes). Phasing"]},{"l":"Running Options","p":["In addition to the , the module is configured using these command-line arguments:"]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the INPUTS, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Molecule distance","p":["The molecule distance refers to the base-pair distance dilineating separate molecules. In other words, when two alignments on a single contig share the same barcode, how far away from each other are we willing to say they were and still consider them having"]},{"l":"Pruning threshold","p":["The pruning threshold refers to a PHRED-scale value between 0-1 (a percentage) for removing low-confidence SNPs from consideration. With Harpy, you configure this value as an integer"]},{"l":"Phasing Workflow"}],[{"i":"#","p":["Generate extra files for analysis with Harpy"]},{"l":"Other Harpy modules","p":["Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules:"]},{"l":"Other modules"},{"l":"popgroup","p":["Creates a sample grouping file for variant calling"]},{"l":"arguments","p":["This optional file is useful if you want SNP variant calling to happen on a per-population level via or on samples pooled-as-populations via ."]},{"l":"stitchparams","p":["Create a template parameter file for the module. The file is formatted correctly and serves as a starting point for using parameters that make sense for your study."]},{"i":"arguments-1","l":"arguments","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters. The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model"]}],[{"l":"Haplotag data"},{"l":"Data Format"},{"l":"Barcodes","p":["While barcodes are actually combinatorial bases, in the read headers they are represented with the format AxxCxxBxxDxx, where each barcode segment is denoted as Axx(or Bxx, etc.)."]},{"l":"barcode protocol varieties","p":["If you think haplotagging is as simple as exactly 96^4 unique barcodes, you would only be half-correct. The original haplotagging protocol in Meier et al. is good, but the authors (and others) have been working to improve this linked-read technology to improve"]},{"l":"where the barcodes go","p":["Chromium 10X linked-reads use a format where the barcode is the leading 16 bases of the forward (R1) read. However, haplotagging data does not use that format and many of the tools"]},{"l":"Read headers","p":["Like mentioned, the haplotag barcode is expected to be stored in the BX:Z: tag in the read header. This information is retained through the various Harpy steps. An example read header could look like:"]},{"l":"Read length","p":["Reads must be at least 30 base pairs in length for alignment. By default, the module removes reads <30bp."]},{"l":"Compression","p":["Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. Compressed files are expected to end with the extension"]},{"l":"Naming conventions","p":["Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations."]},{"l":"Barcode thresholds","p":["By the nature of linked read technologies, there will (almost always) be more DNA fragments than unique barcodes for them. As a result, it's common for barcodes to reappear in sequences. Rather than incorrectly assume that all sequences/alignments with the same barcode"]}],[{"l":"Common Harpy Options"},{"l":"Input Arguments","p":["Each of the main Harpy modules (e.g. or ) follows the format of"]},{"l":"Common command-line options","p":["Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments,"]},{"l":"The workflow folder","p":["When you run one of the main Harpy modules, the output directory will contain a workflow folder. This folder is both necessary for the module to run and is very useful to understand what the module did, be it for your own"]},{"l":"The Genome folder","p":["You will notice that many of the workflows will create a Genome folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated"]}],[{"l":"Common Issues","p":["Lots of stuff can go wrong during an analysis. The intent of this page is to highlight common issues you may experience during analysis and ways to address these issues."]},{"l":"Problem installing with conda","p":["Conda is an awesome package manager, but it's slow and uses a ton of memory as dependencies increase. Harpy has a lot of dependencies and you might stall out conda trying to install it. Use mamba instead-- it'll work where conda fails."]},{"l":"Failures during imputation or phasing","p":["If you use bamutils clipOverlap on alignments that are used for the or modules, they will cause both programs to error. We don't know why, but they do."]},{"i":"alignment-file-name-and-id-tag-mismatch","l":"Alignment file name and ID: tag mismatch","p":["Aligning a sample to a genome via Harpy will insert the sample name (based on the file name) into the alignment header (the @RG ID:name SM:name tag). It likewise expects, through various steps,"]}],[{"l":"Adding Snakamake parameters","p":["Harpy relies on Snakemake under the hood to handle file and job dependencies. Most of these details have been abstracted away from the end-user, but every module of Harpy (except"]},{"l":"Common use cases","p":["You likely wont need to invoke --snakemake very often, if ever. However, here examples of some possible use cases for this parameter."]}],[{"l":"Software used in Harpy","p":["Harpy is the sum of its parts, and out of tremendous respect for the developers involved in the included software, we would like to highlight the tools directly involved in Harpy's many moving pieces."]},{"l":"Standalone Software"},{"l":"Software Packages"}],[{"l":"Developing Harpy","p":["Harpy is an open source program written using a combination of BASH, R, RMarkdown, Python, and Snakemake. This page provides information on Harpy's development and how to contribute to it, if you were inclined to do so."]},{"l":"Installing Harpy for development","p":["The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the resources/buildlocal.sh script to move all the necessary files to the"]},{"i":"harpys-components","l":"Harpy's components"},{"l":"source code","p":["Harpy runs in two stages:"]},{"l":"Bioconda recipe","p":["For the ease of installation for end-users, Harpy has a recipe and build script in Bioconda, which makes it available for download and installation. A copy of the recipe and build script is also stored in"]},{"l":"The Harpy repository"},{"l":"structure","p":["Harpy exists as a Git repository and has 5 standard branches that are used in specific ways during development. Git is a popular version control system and discussing its use is out of the scope of this documentation, however there is no"]},{"l":"development workflow","p":["The dev workflow is reasonably standard:"]},{"l":"containerization","p":["As of Harpy v1.0, the software dependencies that the Snakemake workflows use are pre-configured as a Docker image that is uploaded to Dockerhub. Updating or editing this container can be done automatically or manually."]},{"l":"automatically","p":["The testing GitHub Action will automatically create a Dockerfile with (a hidden harpy command) and build a new Docker container, then upload it to dockerhub with the latest tag. This process is triggered on"]},{"l":"manually","p":["The dockerfile for that container is created by using a hidden harpy command"]},{"l":"Automations"},{"l":"Testing","p":["CI ( C ontinuous I ntegration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. Harpy has a series of GitHub Actions triggered by interactions with the"]},{"l":"Releases","p":["There is an automation that gets triggered every time Harpy is tagged with the new version. It strips out the unnecessary files and will upload a cleaned tarball to the new release (reducing filesize by orders of magnitude). The automation will also"]}]] \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index afa74c8eb..89ed212b5 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ diff --git a/snakemake/index.html b/snakemake/index.html index df5fccf67..3870c0ab0 100644 --- a/snakemake/index.html +++ b/snakemake/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +

@@ -232,15 +232,15 @@

Harpy relies on Snakemake under the hood to handle file and job dependencies. Most of these details have been abstracted away from the end-user, but every -module of Harpy (except hpc, popgroup, and stitchparams) has an optional flag -s (--snakemake) +module of Harpy (except popgroup, and stitchparams) has an optional flag--snakemake that you can use to augment the Snakemake workflow if necessary. Whenever you use this flag, your argument must be enclosed in quotation marks, for example:

-
harpy qc -d rawseq -s "--dry-run"
+
harpy qc --snakemake "--dry-run" rawseq

This means you can add several Snakemake arguments at once, as long as the entire thing is enclosed in quotes:

-
harpy qc -d rawseq -s "--dry-run --debug --shadow-prefix /scratch"
+
harpy qc --snakemake "--dry-run --debug --shadow-prefix /scratch" rawseq
@@ -266,8 +266,9 @@
reserved/forbidden arguments
  • --configfile
  • --rerun-incomplete
  • --nolock
  • -
  • --use-conda
  • --conda-prefix
  • +
  • --software-deployment-method
  • +
  • --rerun-triggers
  • @@ -279,7 +280,7 @@

    You likely wont need to invoke --snakemake very often, if ever. However, -here are common use cases for this parameter.

    +here examples of some possible use cases for this parameter.

    @@ -294,7 +295,7 @@
    ahead of time. It's also useful for debugging during development. Here is an example of dry-running variant calling:

    -
    harpy snp mpileup -g genome.fasta  -d Align/ema -s "--dry-run"
    +
    harpy snp mpileup -g genome.fasta --snakemake "--dry-run" Align/ema
    @@ -303,7 +304,7 @@
    you want the beadtag report Harpy makes from the output of EMA count. To do this, just list the file/files (relative to your working directory) without any flags. Example for the beadtag report:

    -
    harpy align bwa -g genome.fasta -d QC/ -t 4 -s "Align/ema/stats/reads.bxstats.html"
    +
    harpy align bwa -g genome.fasta -t 4 --snakemake "Align/ema/reports/bxstats.html" QC/

    This of course necessitates knowing the names of the files ahead of time. See the individual modules for a breakdown of expected outputs.

    @@ -322,7 +323,7 @@
    you may use --shadow-prefix <dirname> where <dirname> is the path to the mandatory directory you need to work out of. By configuring this "shadow directory" setting, Snakemake will automatically move the files in/out of that directory for you:

    -
    harpy sv leviathan -g genome.fasta  -d Align/bwa --threads 8 -p samples.groups -s "--shadow-prefix /SCRATCH/username/"
    +
    harpy sv leviathan -g genome.fasta --threads 8 -p samples.groups --snakemake "--shadow-prefix /SCRATCH/username/" Align/bwa
    diff --git a/software/index.html b/software/index.html index 3557e4333..880da410d 100644 --- a/software/index.html +++ b/software/index.html @@ -4,7 +4,7 @@ - + @@ -12,7 +12,7 @@ Software used in Harpy | Harpy - + @@ -21,22 +21,22 @@ - + - + - + - + - +
    @@ -229,8 +229,15 @@

    Software used in Harpy

    -

    HARPY is the sum of its parts, and out of tremendous respect for the developers involved in the included software, we would like to highlight the tools directly involved in HARPY's many moving pieces.

    +

    Harpy is the sum of its parts, and out of tremendous respect for the developers involved in the included software, we would like to highlight the tools directly involved in Harpy's many moving pieces. +If any tools were missed, please let us know!

    Issues with specific tools might warrant a discussion with the authors/developers on the repositories of these projects.

    + +

    + # + Standalone Software +

    +
    argument
    @@ -257,10 +264,6 @@

    - - - - @@ -285,10 +288,18 @@

    + + + + + + + + @@ -305,45 +316,94 @@

    + + + + + + + + + + + + + + + + + + + + +
    website, publication
    clickwebsite
    conda website
    website, publication
    LRSIMwebiste publication
    mamba website
    minimap2website publication
    NAIBR website, fork, publication
    website
    samtoolswebsite
    seqtkwebsite
    simuGwebsite publication
    Snakemakewebsite, publication
    whatshapwebsite, publication
    +
    + +

    + # + Software Packages +

    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - - - - - - - - - - - + - - - -
    PackageLanguageLinks
    clickpythonwebsite
    pysampythonwebiste
    r-biocircosRwebsite
    r-circlizeR website, publication
    r-highcharterRwebsite (source), website (R-package)
    r-tidyverseR website, publication
    r-DTR website, js-website
    richpython website
    rich-clickpython website
    samtoolswebsite
    seqtkwebsite
    Snakemakewebsite, publication
    STITCHR website, publication
    whatshapwebsite, publication
    diff --git a/static/lr_conversion.png b/static/lr_conversion.png new file mode 100644 index 000000000..009399333 Binary files /dev/null and b/static/lr_conversion.png differ diff --git a/static/lr_conversion.svg b/static/lr_conversion.svg new file mode 100644 index 000000000..cb64466f0 --- /dev/null +++ b/static/lr_conversion.svg @@ -0,0 +1,422 @@ + + + +Forward readforward read@seq_id/1 TX:Z: BX:Z:@seq_id/2 TX:Z: BX:Z:reverse read@seq_id/1 @seq_id/2 barcodesbarcodeATACGAGTACAGATGGAGGATATACGAGTACAGATGGAGGATACBDACBD10X10X