diff --git a/404.html b/404.html index 1fa5a0df3..0f77cd4b9 100644 --- a/404.html +++ b/404.html @@ -4,7 +4,7 @@ - + @@ -29,11 +29,11 @@ - + - + - +
@@ -54,7 +54,7 @@ Harpy - +
diff --git a/commonoptions/index.html b/commonoptions/index.html index 4647d872c..6b97b537e 100644 --- a/commonoptions/index.html +++ b/commonoptions/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
@@ -58,7 +58,7 @@ Harpy - +
@@ -230,6 +230,12 @@

Common Harpy Options

+ +

+ # + Common command-line options +

+

Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments, the genome assembly, etc. All main modules (e.g. qc) also share a series of common runtime @@ -258,6 +264,22 @@

Number of threads to use +--print-only + +toggle + +no +Perform internal validations, build the workflow environment, and print the resulting Snakemake command, but don't run anything + + +--skipreports +-r +toggle + +no +Skip the processing and generation of HTML reports in a workflow + + --snakemake -s string @@ -292,6 +314,49 @@

harpy align bwa -t 20 -d samples/trimmedreads -q + +

+ # + The workflow folder +

+
+

When you run one of the main Harpy modules, the output directory will contain a workflow folder. This folder is +both necessary for the module to run and is very useful to understand what the module did, be it for your own +understanding or as a point of reference when writing the Methods within a manuscript. The presence of the folder +and the contents therein also allow you to rerun the workflow manually. The workflow folder may contain the following:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
itemcontentsutility
*.smkSnakefile with the full recipe of the workflowuseful for understanding the workflow
config.ymlConfiguration file generated from command-line arguments and consumed by the Snakefileuseful for bookkeeping
report/*.RmdRMarkdown files used to generate the fancy reportsuseful to understand math behind plots/tables or borrow code from
*.workflow.summaryPlain-text overview of the important parts of the workflowuseful for bookkeeping and writing Methods
+

# diff --git a/development/index.html b/development/index.html index ef8f3416a..be2036a46 100644 --- a/development/index.html +++ b/development/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + +
@@ -60,7 +60,7 @@ Harpy - +
diff --git a/haplotagdata/index.html b/haplotagdata/index.html index f9ae5ff5b..b9a8ef70b 100644 --- a/haplotagdata/index.html +++ b/haplotagdata/index.html @@ -4,7 +4,7 @@ - + @@ -32,11 +32,11 @@ - + - + - + @@ -59,7 +59,7 @@ Harpy - + diff --git a/index.html b/index.html index c0ffc99e8..776fb85d5 100644 --- a/index.html +++ b/index.html @@ -4,7 +4,7 @@ - + @@ -34,11 +34,11 @@ - + - + - +
@@ -59,7 +59,7 @@ Harpy - +
diff --git a/install/index.html b/install/index.html index 980bf30f3..f29165b98 100644 --- a/install/index.html +++ b/install/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
@@ -58,7 +58,7 @@ Harpy - +
diff --git a/issues/index.html b/issues/index.html index ce2ffe9df..dbbe5f256 100644 --- a/issues/index.html +++ b/issues/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
@@ -58,7 +58,7 @@ Harpy - +
diff --git a/modules/align/bwa/index.html b/modules/align/bwa/index.html index fbe5ac516..71c011dd4 100644 --- a/modules/align/bwa/index.html +++ b/modules/align/bwa/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -63,7 +63,7 @@ Harpy - + @@ -404,14 +404,10 @@

Align/bwa
 ├── Sample1.bam
 ├── Sample1.bam.bai
-├── align
-│   ├── Sample1.bam
-│   └── Sample1.bam.bai
 ├── logs
-│   ├── harpy.align.log
 │   └── markduplicates
 │       └── Sample1.markdup.log
-└── stats
+└── reports
     ├── bwa.stats.html
     ├── BXstats
     │   ├── Sample1.bxstats.html
@@ -442,47 +438,39 @@ 

sequence alignment indexes for each sample -align/*bam* -symlinks to the alignment files for snakemake purporses - - -logs/harpy.align.log -relevant runtime parameters for the align module - - logs/markduplicates everything sambamba markdup writes to stderr during operation -stats/ +reports/ various counts/statistics/reports relating to sequence alignment -stats/bwa.stats.html +reports/bwa.stats.html report summarizing samtools flagstat and stats results across all samples from multiqc -stats/reads.bxstats.html +reports/reads.bxstats.html interactive html report summarizing valid vs invalid barcodes across all samples -stats/BXstats/*.bxstats.html +reports/BXstats/*.bxstats.html interactive html report summarizing inferred molecule size -stats/coverage/*.html +reports/coverage/*.html summary plots of alignment coverage per contig -stats/coverage/data/*.gencov.gz +reports/coverage/data/*.gencov.gz output from samtools bedcov from all alignments, used for plots -stats/BXstats/ +reports/BXstats/ reports summarizing molecule size and reads per molecule -stats/BXstats/data/ +reports/BXstats/data/ tabular data containing the information used to generate the BXstats reports diff --git a/modules/align/ema/index.html b/modules/align/ema/index.html index 22063f5c0..61f9c449d 100644 --- a/modules/align/ema/index.html +++ b/modules/align/ema/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -63,7 +63,7 @@ Harpy - + @@ -299,6 +299,22 @@

Genome assembly for read mapping +--platform +-p +string +haplotag +yes +Linked read technology: haplotag or 10x + + +--whitelist +-w +file path + +no +Path to barcode whitelist (--platform 10x only) + + --directory -d folder path @@ -307,14 +323,6 @@

Directory with sample sequences ---molecule-distance --m -integer -100000 -no -Base-pair distance threshold to consider molecules as separate - - --ema-bins -e integer (1-1000) @@ -341,16 +349,16 @@

- +

- # - Molecule distance + # + Barcode whitelist

-

Unlike the manual MI:i assignment in the BWA workflow, the EMA aligner will assign -a unique Molecular Identifier MI:i tag to alignments using its own heuristics. -Instead, the EMA workflow uses this value to calculate statistics for the haplotag -barcodes identified in the alignments.

+

Some linked-read methods (e.g. 10x, Tellseq) require the inclusion of a barcode "whitelist." This file is a +simple text file that has one barcode per line so a given software knows what barcodes to expect in your data. +If you need to process 10x data, then you will need to include the whitelist file (usually provided by 10x). +Conveniently, haplotag data doesn't require this file.

# @@ -424,18 +432,14 @@

Align/ema
 ├── Sample1.bam
 ├── Sample1.bam.bai
-├── align
-│   ├── Sample1.bam
-│   └── Sample1.bam.bai
 ├── count
 │   └── Sample1.ema-ncnt
 ├── logs
-│   ├── harpy.align.log
 │   ├── markduplicates
 │   │   └── Sample1.markdup.nobarcode.log
 │   └── preproc
 │       └── Sample1.preproc.log
-└── stats
+└── reports
     ├── ema.stats.html
     ├── reads.bxcounts.html
     ├── BXstats
@@ -467,18 +471,10 @@ 

sequence alignment indexes for each sample -align/*bam* -symlinks to the alignment files for snakemake purporses - - count/ output of ema count -logs/harpy.align.log -relevant runtime parameters for the align module - - logs/markduplicates/ everything sambamba markdup writes to stderr during operation on alignments with invalid/missing barcodes @@ -487,39 +483,39 @@

everything ema preproc writes to stderr during operation -stats/ +reports/ various counts/statistics/reports relating to sequence alignment -stats/ema.stats.html +reports/ema.stats.html report summarizing samtools flagstat and stats results across all samples from multiqc -stats/reads.bxstats.html +reports/reads.bxstats.html interactive html report summarizing ema count across all samples -stats/coverage/*.html +reports/coverage/*.html summary plots of alignment coverage per contig -stats/coverage/data/*.all.gencov.gz +reports/coverage/data/*.all.gencov.gz output from samtools bedcov from all alignments, used for plots -stats/coverage/data/*.bx.gencov.gz +reports/coverage/data/*.bx.gencov.gz output from samtools bedcov from alignments with valid BX barcodes, used for plots -stats/BXstats/ +reports/BXstats/ reports summarizing molecule size and reads per molecule -stats/BXstats/*.bxstats.html +reports/BXstats/*.bxstats.html interactive html report summarizing inferred molecule size -stats/BXstats/data/ +reports/BXstats/data/ tabular data containing the information used to generate the BXstats reports diff --git a/modules/demultiplex/index.html b/modules/demultiplex/index.html index 400173bac..d3dbf7780 100644 --- a/modules/demultiplex/index.html +++ b/modules/demultiplex/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -61,7 +61,7 @@ Harpy - + @@ -366,9 +366,8 @@

├── Sample1.R.fq.gz ├── Sample2.F.fq.gz ├── Sample2.R.fq.gz -└── logs -    ├── demultiplex.QC.html -    └── harpy.demultiplex.log

+└── reports +    └── demultiplex.QC.html

@@ -388,13 +387,9 @@

- + - - - -
Reverse-reads from multiplexed input --file belonging to samples from the samplesheet
logs/demultiplex.QC.htmlreports/demultiplex.QC.html phased vcf annotated with phased blocks
logs/harpy.demultiplex.logrelevant runtime parameters for demultiplexing
diff --git a/modules/impute/index.html b/modules/impute/index.html index d960fbc79..93be612d5 100644 --- a/modules/impute/index.html +++ b/modules/impute/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -63,7 +63,7 @@ Harpy - + @@ -303,14 +303,6 @@

Path to VCF/BCF file ---vcf-samples - -toggle - -no -Use samples present in vcf file for imputation rather than those found the directory - - --directory -d folder path @@ -319,6 +311,22 @@

Directory with sequence alignments +--extra-params +-x +folder path + +no +Extra arguments to add to the STITCH R function, provided in quotes and R syntax + + +--vcf-samples + +toggle + +no +Use samples present in vcf file for imputation rather than those found the directory + + --parameters -p file path @@ -329,6 +337,20 @@

+ +

+ # + Extra STITCH parameters +

+
+

You may add additional parameters to STITCH by way of the +--extra-params (or -x) option. Since STITCH is a function in the R language, the parameters you add must be in R +syntax (e.g. regionStart=0, populations=c("GBA","CUE")). The argument should be wrapped in quotes (like in other Harpy modules), +however, if your additional parameters require the use of quotes (like the previous example), then wrap the -x argument +in single quotes. Otherwise, the format should take the form of "arg1=value, arg2=value2". Example:

+
+
harpy impute -v file.vcf -p stitch.params -t 15 -x 'regionStart=20, regionEnd=500'
+

# diff --git a/modules/othermodules/index.html b/modules/othermodules/index.html index e84c97bb9..969f0a4ad 100644 --- a/modules/othermodules/index.html +++ b/modules/othermodules/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
@@ -58,7 +58,7 @@ Harpy - +
diff --git a/modules/phase/index.html b/modules/phase/index.html index a4a3fab15..c56dda6cc 100644 --- a/modules/phase/index.html +++ b/modules/phase/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -61,7 +61,7 @@ Harpy - + @@ -438,8 +438,6 @@

│   ├── Sample1.linked.frags │   └── logs │      └── Sample1.linked.log -├── logs -│   └── harpy.phase.log ├── reports │   ├── blocks.summary.gz │   └── phase.html @@ -500,10 +498,6 @@

everything linkFragments prints to stderr -logs/harpy.phase.log -relevant runtime parameters for the phase module - - reports/blocks.summary.gz summary information of all the samples' block files diff --git a/modules/preflight/index.html b/modules/preflight/index.html index 6b98cfd50..99f5fba6f 100644 --- a/modules/preflight/index.html +++ b/modules/preflight/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + +
@@ -60,7 +60,7 @@ Harpy - +
diff --git a/modules/qc/index.html b/modules/qc/index.html index d86be750e..b5f16ef47 100644 --- a/modules/qc/index.html +++ b/modules/qc/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -61,7 +61,7 @@ Harpy - + @@ -333,7 +333,6 @@

│   ├── summary.bx.valid.html │   └── trim.report.html └── logs - ├── harpy.trim.log    ├── err    │   ├── Sample1.log    │   └── Sample2.log @@ -363,10 +362,6 @@

all debug/diagnostic files that aren't the trimmed reads fastp creates -logs/harpy.trim.log -relevant runtime parameters for the trim module - - logs/err what fastp prints to stderr when running diff --git a/modules/snp/index.html b/modules/snp/index.html index 0e33b22d8..4f989df92 100644 --- a/modules/snp/index.html +++ b/modules/snp/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -61,7 +61,7 @@ Harpy - + @@ -254,8 +254,8 @@

  • the groups can be numbers or text (i.e. meaningful population names)
  • you can comment out lines with # for Harpy to ignore them
  • -
  • create with harpy extra -p <samplefolder> or manually
  • -
  • if created with harpy extra -p, all the samples will be assigned to group pop1, so make sure to edit the second column to reflect your data correctly.
  • +
  • create with harpy extra popgroup -d <samplefolder> or manually
  • +
  • if created with harpy extra popgroup, all the samples will be assigned to group pop1, so make sure to edit the second column to reflect your data correctly.
  • example file for --populations
    @@ -440,7 +440,7 @@

    -

    The harpy variants snp module creates a Variants/METHOD directory with the folder structure below where METHOD is what +

    The harpy snp module creates a Variants/METHOD directory with the folder structure below where METHOD is what you specify as the --method (mpileup or freebayes). contig1 and contig2 are generic contig names from an imaginary genome.fasta for demonstration purposes.

    @@ -454,11 +454,10 @@

    │   ├── contig1.METHOD.log │  ├── contig2.call.log # mpileup only │   ├── contig2.METHOD.log -│   ├── harpy.variants.log │   ├── sample.groups │   ├── samples.files │   └── samples.names -└── stats +└── reports    ├── contig1.stats    ├── contig2.stats    ├── variants.normalized.html @@ -496,10 +495,6 @@

    what bcftools mpileup or freebayes writes to stderr -logs/harpy.variants.log -relevant runtime parameters for the variants module - - logs/sample.groups if provided, a copy of the file provided to --populations with commented lines removed @@ -512,11 +507,11 @@

    list of sample names associated with alignment files used for variant calling -stats/*.stats +reports/*.stats output of bcftools stats -stats/variants.*.html +reports/variants.*.html report summarizing variants diff --git a/modules/sv/leviathan/index.html b/modules/sv/leviathan/index.html index 5c8220a47..43305ad66 100644 --- a/modules/sv/leviathan/index.html +++ b/modules/sv/leviathan/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -61,7 +61,7 @@ Harpy - +

    @@ -274,8 +274,8 @@
    EMA-mapped reads
  • the groups can be numbers or text (i.e. meaningful population names)
  • you can comment out lines with # for Harpy to ignore them
  • -
  • create with harpy extra -p <samplefolder> or manually
  • -
  • if created with harpy extra -p, all the samples will be assigned to group pop1 +
  • create with harpy extra popgroup -d <samplefolder> or manually
  • +
  • if created with harpy extra popgroup, all the samples will be assigned to group pop1
    • make sure to edit the second column to reflect your data correctly.
    @@ -453,7 +453,6 @@

    ├── sample1.bcf ├── sample2.bcf ├── logs -│   ├── harpy.variants.log │ ├── sample1.leviathan.log │ ├── sample1.candidates │ ├── sample2.leviathan.log @@ -483,10 +482,6 @@

    if provided, a copy of the file provided to --populations with commented lines removed -logs/*.leviathan.log -what LEVIATHAN writes to stderr during operation - - logs/*candidates candidate structural variants LEVIATHAN identified diff --git a/modules/sv/naibr/index.html b/modules/sv/naibr/index.html index 1c2e41eb4..1fdc8d09a 100644 --- a/modules/sv/naibr/index.html +++ b/modules/sv/naibr/index.html @@ -4,7 +4,7 @@ - + @@ -34,12 +34,12 @@ - + - + - - + + @@ -61,7 +61,7 @@ Harpy - +

  • @@ -256,8 +256,8 @@

  • the groups can be numbers or text (i.e. meaningful population names)
  • you can comment out lines with # for Harpy to ignore them
  • -
  • create with harpy extra -p <samplefolder> or manually
  • -
  • if created with harpy extra -p, all the samples will be assigned to group pop1 +
  • create with harpy extra popgroup -d <samplefolder> or manually
  • +
  • if created with harpy extra popgroup, all the samples will be assigned to group pop1
    • make sure to edit the second column to reflect your data correctly.
    @@ -493,7 +493,7 @@

    -

    The harpy variants --method naibr module creates a Variants/naibr (or naibr-pop) +

    The harpy sv --method naibr module creates a Variants/naibr (or naibr-pop) directory with the folder structure below. sample1 and sample2 are generic sample names for demonstration purposes.

    @@ -510,7 +510,6 @@

    │ ├── sample1.reformat.bedpe │ └── sample2.reformat.bedpe ├── logs -│   ├── harpy.variants.log │ ├── sample1.log │ └── sample2.log ├── reports @@ -525,7 +524,7 @@

    | *.bedpe | structural variants identified by NAIBR | | configs/ | the configuration files harpy generated for each sample | | filtered/ | the variants that failed NAIBR's internal filters | -| IGV/ | same as the output .bedpefiles but in IGV format | |logs/harpy.variants.log| relevant runtime parameters for the variants module | |logs/sample.groups | if provided, a copy of the file provided to--populationswith commented lines removed | |logs/*.log | what NAIBR writes tostderrduring operation | |reports/ | summary reports with interactive plots of detected SV | |vcf/ | the resulting variants, but in.VCF` format |

    +| IGV/ | same as the output .bedpefiles but in IGV format | |logs/sample.groups | if provided, a copy of the file provided to--populationswith commented lines removed | |logs/*.log | what NAIBR writes tostderrduring operation | |reports/ | summary reports with interactive plots of detected SV | |vcf/ | the resulting variants, but in.VCF` format |

    diff --git a/resources/js/config.js b/resources/js/config.js index 053cad556..66545582d 100644 --- a/resources/js/config.js +++ b/resources/js/config.js @@ -1 +1 @@ -var __DOCS_CONFIG__ = {"id":"sPAhg3s6m8z1RsI215dOFr/iJQN5qdnK/Wy","key":"5UuTXmzulzCzt5be/CjvKxHT4ID3wsg8cPIDHPHrtmg.ZV3OMGIcfUYX3gbiJH0gYe0yQpuW/bmHIIiuRHVJXGGcmnP2NLR0Lh7pZYHdl7ExBRzsPYwRpPXe9/okj6VVQA.98","base":"/HARPY/","host":"pdimens.github.io","version":"1.0.0","useRelativePaths":true,"documentName":"index.html","appendDocumentName":false,"trailingSlash":true,"preloadSearch":false,"cacheBustingToken":"3.5.0.760737878453","cacheBustingStrategy":"query","sidebarFilterPlaceholder":"Filter","toolbarFilterPlaceholder":"Filter","showSidebarFilter":true,"filterNotFoundMsg":"No member names found containing the query \"{query}\"","maxHistoryItems":15,"homeIcon":"","access":[{"value":"public","label":"Public"},{"value":"protected","label":"Protected"}],"toolbarLinks":[{"id":"fields","label":"Fields"},{"id":"properties","label":"Properties"},{"id":"methods","label":"Methods"},{"id":"events","label":"Events"}],"sidebar":[{"n":"/","l":"Home","s":""},{"n":"install","l":"Install","s":""},{"n":"modules","l":"Modules","c":false,"i":[{"n":"demultiplex","l":"Demultiplex","s":""},{"n":"preflight","l":"Preflight","s":""},{"n":"qc","l":"QC","s":""},{"n":"align","l":"Align","c":false,"i":[{"n":"bwa","l":"BWA","s":""},{"n":"ema","l":"EMA","s":""}],"s":""},{"n":"snp","l":"SNP","s":""},{"n":"sv","l":"SV","c":false,"i":[{"n":"leviathan","l":"Leviathan","s":""},{"n":"naibr","l":"Naibr","s":""}],"s":""},{"n":"impute","l":"Impute","s":""},{"n":"phase","l":"Phase","s":""},{"n":"othermodules","l":"Other","s":""}],"s":""},{"n":"haplotagdata","l":"Haplotag data","s":""},{"n":"commonoptions","l":"Common Options","s":""},{"n":"issues","l":"Common Issues","s":""},{"n":"snakemake","l":"Sneaky Snakemake","s":""},{"n":"software","l":"Software","s":""},{"n":"development","l":"Development","s":""}],"search":{"mode":1,"minChars":2,"maxResults":20,"placeholder":"Search","hotkeys":["k"],"noResultsFoundMsg":"Sorry, no results found.","recognizeLanguages":true,"languages":[0],"preload":false},"resources":{"History_Title_Label":"History","History_ClearLink_Label":"Clear","History_NoHistory_Label":"No history items","API_AccessFilter_Label":"Access","API_ParameterSection_Label":"PARAMETERS","API_SignatureSection_Label":"SIGNATURE","API_CopyHint_Label":"Copy","API_CopyNameHint_Label":"Copy name","API_CopyLinkHint_Label":"Copy link","API_CopiedAckHint_Label":"Copied!","API_MoreOverloads_Label":"more","API_MoreDropdownItems_Label":"More","API_OptionalParameter_Label":"optional","API_DefaultParameterValue_Label":"Default value","API_InheritedFilter_Label":"Inherited","Search_Input_Placeholder":"Search","Toc_Contents_Label":"Contents","Toc_RelatedClasses_Label":"Related Classes","History_JustNowTime_Label":"just now","History_AgoTime_Label":"ago","History_YearTime_Label":"y","History_MonthTime_Label":"mo","History_DayTime_Label":"d","History_HourTime_Label":"h","History_MinuteTime_Label":"m","History_SecondTime_Label":"s"}}; +var __DOCS_CONFIG__ = {"id":"LhfowaO+kWy/1KnsR215zPJtpGE+j1tuWlR","key":"juKlxHqnC9bmHdr+w+TUsu/WMT+ula/JPrjafJ1VCfY.C+jSckLaTFs5huXLU05ySy5OxzJJlgEd2Wp/gQ/GeUI+/VL+xVnNQV3vVTKkghHhQiyrMf6OdCuK7ZtEVyyWZA.39","base":"/HARPY/","host":"pdimens.github.io","version":"1.0.0","useRelativePaths":true,"documentName":"index.html","appendDocumentName":false,"trailingSlash":true,"preloadSearch":false,"cacheBustingToken":"3.5.0.761346101834","cacheBustingStrategy":"query","sidebarFilterPlaceholder":"Filter","toolbarFilterPlaceholder":"Filter","showSidebarFilter":true,"filterNotFoundMsg":"No member names found containing the query \"{query}\"","maxHistoryItems":15,"homeIcon":"","access":[{"value":"public","label":"Public"},{"value":"protected","label":"Protected"}],"toolbarLinks":[{"id":"fields","label":"Fields"},{"id":"properties","label":"Properties"},{"id":"methods","label":"Methods"},{"id":"events","label":"Events"}],"sidebar":[{"n":"/","l":"Home","s":""},{"n":"install","l":"Install","s":""},{"n":"modules","l":"Modules","c":false,"i":[{"n":"demultiplex","l":"Demultiplex","s":""},{"n":"preflight","l":"Preflight","s":""},{"n":"qc","l":"QC","s":""},{"n":"align","l":"Align","c":false,"i":[{"n":"bwa","l":"BWA","s":""},{"n":"ema","l":"EMA","s":""}],"s":""},{"n":"snp","l":"SNP","s":""},{"n":"sv","l":"SV","c":false,"i":[{"n":"leviathan","l":"Leviathan","s":""},{"n":"naibr","l":"Naibr","s":""}],"s":""},{"n":"impute","l":"Impute","s":""},{"n":"phase","l":"Phase","s":""},{"n":"othermodules","l":"Other","s":""}],"s":""},{"n":"haplotagdata","l":"Haplotag data","s":""},{"n":"commonoptions","l":"Common Options","s":""},{"n":"issues","l":"Common Issues","s":""},{"n":"snakemake","l":"Sneaky Snakemake","s":""},{"n":"software","l":"Software","s":""},{"n":"development","l":"Development","s":""}],"search":{"mode":1,"minChars":2,"maxResults":20,"placeholder":"Search","hotkeys":["k"],"noResultsFoundMsg":"Sorry, no results found.","recognizeLanguages":true,"languages":[0],"preload":false},"resources":{"History_Title_Label":"History","History_ClearLink_Label":"Clear","History_NoHistory_Label":"No history items","API_AccessFilter_Label":"Access","API_ParameterSection_Label":"PARAMETERS","API_SignatureSection_Label":"SIGNATURE","API_CopyHint_Label":"Copy","API_CopyNameHint_Label":"Copy name","API_CopyLinkHint_Label":"Copy link","API_CopiedAckHint_Label":"Copied!","API_MoreOverloads_Label":"more","API_MoreDropdownItems_Label":"More","API_OptionalParameter_Label":"optional","API_DefaultParameterValue_Label":"Default value","API_InheritedFilter_Label":"Inherited","Search_Input_Placeholder":"Search","Toc_Contents_Label":"Contents","Toc_RelatedClasses_Label":"Related Classes","History_JustNowTime_Label":"just now","History_AgoTime_Label":"ago","History_YearTime_Label":"y","History_MonthTime_Label":"mo","History_DayTime_Label":"d","History_HourTime_Label":"h","History_MinuteTime_Label":"m","History_SecondTime_Label":"s"}}; diff --git a/resources/js/search.json b/resources/js/search.json index a24d8fb20..0845a6ad7 100644 --- a/resources/js/search.json +++ b/resources/js/search.json @@ -1 +1 @@ -[[{"i":"#","p":["Using Harpy to process your haplotagged data"]},{"l":"Home","p":["Harpy is a haplotagging data processing pipeline for Linux-based systems. It uses all the magic of Snakemake under the hood to handle the worklfow decision-making, but as a user, you just interact with it like a normal command-line"]},{"i":"what-is-haplotagging","l":"What is haplotagging?","p":["Linked-read sequencing exists to combine the throughput and accuracy of short-read sequencing with the long range haplotype information of long-read sequencing. Haplotagging is an implementation of linked-read sequencing developed by"]},{"l":"Harpy Modules","p":["Harpy is modular, meaning you can use different parts of it independent from each other. Need to only align reads? Great! Only want to call variants? Awesome! All modules are called by"]},{"l":"Using Harpy","p":["You can call harpy without any arguments (or with --help) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with --help to see their usage (e.g."]}],[{"l":"Install HARPY","p":["Harpy is now hosted on Bioconda! That means to install it, you just need to have mamba(or conda) on your Linux-based system and install it with a simple command. You can install Harpy into an existing environment or create a new one for it (recommended)."]}],[{"i":"#","p":["Demultiplex raw sequences into haplotag barcoded samples"]},{"l":"Demultiplex Raw Sequences","p":["When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ files in return. These files contain sequences for all of your samples and need to be demultiplexed using barcodes to"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy demultiplex module is configured using these command-line arguments:"]},{"l":"Haplotag Types"},{"l":"Gen I Demultiplex Workflow"}],[{"i":"#","p":["Run file format checks on haplotagged FASTQ/BAM files"]},{"l":"Pre-flight checks for input files","p":["Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). These formatting opinions/specifics are at the mercy of the original developers and while there are times when Harpy can (and does)"]},{"l":"when to run"},{"l":"Running Options","p":["In addition to the common runtime options, the harpy preflight fastq|bam module is configured using these command-line arguments:"]},{"l":"Workflow"}],[{"i":"#","p":["Quality trim haplotagged sequences with Harpy"]},{"l":"Quality Trim Sequences","p":["Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence analyses is to remove these adapters and trim poor quality data. You can remove adapters"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy qc module is configured using these command-line arguments:"]},{"l":"QC Workflow"}],[{"i":"#","p":["Align haplotagged sequences with BWA MEM"]},{"l":"Map Reads onto a genome with BWA MEM","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align bwa module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used during the BWA alignment workflow to assign alignments a unique Molecular Identifier MI:i tag based on their haplotag barcode and the distance threshold you specify. See"]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"BWA workflow"}],[{"i":"#","p":["Align haplotagged sequences with EMA"]},{"l":"Map Reads onto a genome with EMA","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align ema module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["Unlike the manual MI:i assignment in the BWA workflow, the EMA aligner will assign a unique Molecular Identifier MI:i tag to alignments using its own heuristics. Instead, the EMA workflow uses this value to calculate statistics for the haplotag"]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"EMA workflow"}],[{"i":"#","p":["Call SNPs and small indels"]},{"l":"Call SNPs and small indels","p":["After reads have been aligned, e.g., with harpy align, you can use those alignment files(.bam) to call variants in your data. Harpy can call SNPs and small indels using bcftools mpileup"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy snp module is configured using these command-line arguments:"]},{"l":"windowsize","p":["To speed things along, Harpy will call variants in parallel on different contig intervals, then merge everything at the end. You can control the level of parallelization by using"]},{"l":"populations","p":["Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to beleive there is a biologically meaningful grouping scheme to your samples, then you should include"]},{"l":"SNP calling workflow"}],[{"i":"#","p":["Call structural variants using Leviathan"]},{"l":"Call Structural Variants using LEVIATHAN","p":["(like indels, insertions, duplications, breakends)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv leviathan module is configured using these command-line arguments:"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"LEVIATHAN workflow"}],[{"i":"#","p":["Call structural variants using NAIBR (plus)"]},{"l":"Call Structural Variants using NAIBR","p":["(like indels, insertions, duplications)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv naibr module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used to let the program determine how far apart alignments on a contig with the same barcode can be from each other and still considered as originating from the same DNA molecule. See"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"optional vcf file","p":["In order to get the best variant calling performance out of NAIBR, it requires phased bam files as input. The --vcf option is optional and not used by NAIBR. However, to use harpy sv naibr"]},{"i":"a-phased-input---vcf","l":"a phased input --vcf","p":["This file can be in vcf/vcf.gz/bcf format and most importantly it must be phased haplotypes. There are various ways to haplotype SNPs, but you can use harpy phase to phase your SNPs into haplotypes using the haplotag barcode"]},{"l":"NAIBR workflow"}],[{"i":"#","p":["Impute genotypes for haplotagged data with Harpy"]},{"l":"Impute Genotypes using Sequences","p":["After variants have been called, you may want to impute missing genotypes to get the most from your data. Harpy uses STITCH to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy impute module is configured using these command-line arguments:"]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Parameter file","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters (explained in next section). The solution Harpy uses for this is to have the user"]},{"l":"STITCH Parameters"},{"l":"Imputation Workflow"}],[{"i":"#","p":["Phase haplotypes for haplotagged data with Harpy"]},{"l":"Phase SNPs into Haplotypes","p":["You may want to phase your genotypes into haplotypes, as haplotypes tend to be more informative than unphased genotypes (higher polymorphism, captures relationship between genotypes). Phasing"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy phase module is configured using these command-line arguments:"]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Molecule distance","p":["The molecule distance refers to the base-pair distance dilineating separate molecules. In other words, when two alignments on a single contig share the same barcode, how far away from each other are we willing to say they were and still consider them having"]},{"l":"Pruning threshold","p":["The pruning threshold refers to a PHRED-scale value between 0-1 (a percentage) for removing low-confidence SNPs from consideration. With Harpy, you configure this value as an integer"]},{"l":"Phasing Workflow"}],[{"i":"#","p":["Generate extra files for analysis with Harpy"]},{"l":"Other Harpy modules","p":["Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules: The arguments represent different sub-commands and can be run in any order or combination to generate the files you need."]},{"l":"Other modules"},{"l":"popgroup"},{"l":"Sample grouping file for variant calling"},{"l":"arguments","p":["This file is entirely optional and useful if you want SNP variant calling to happen on a per-population level via harpy snp ... -p or on samples pooled-as-populations via harpy sv ... -p"]},{"l":"stitchparams"},{"l":"STITCH parameter file"},{"i":"arguments-1","l":"arguments","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters. The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model"]},{"l":"hpc"},{"l":"HPC cluster profile"},{"i":"arguments-2","l":"arguments","p":["For snakemake to work in harmony with an HPC scheduler, a \"profile\" needs to be provided that tells Snakemake how it needs to interact with the HPC scheduler to submit your jobs to the cluster. Using"]}],[{"l":"Haplotag data"},{"l":"Data Format"},{"l":"Barcodes","p":["While barcodes are actually combinatorial bases, in the read headers they are represented with the format AxxCxxBxxDxx, where each barcode segment is denoted as Axx(or Bxx, etc.)."]},{"l":"barcode protocol varieties","p":["If you think haplotagging is as simple as exactly 96^4 unique barcodes, you would only be half-correct. The original haplotagging protocol in Meier et al. is good, but the authors (and others) have been working to improve this linked-read technology to improve"]},{"l":"where the barcodes go","p":["Chromium 10X linked-reads have a particular format where the barcode is the leading 16 bases of the read. However, haplotagging data does not use that format, nor do the tools implemented in Harpy work correctly with it. Once demultiplexed, haplotagging sequences should look"]},{"l":"Read headers","p":["Like mentioned, the haplotag barcode is expected to be stored in the BX:Z: tag in the read header. This information is retained through the various Harpy steps. An example read header could look like:"]},{"l":"Read length","p":["Reads must be at least 30 base pairs in length for alignment. The qc module removes reads <50bp."]},{"l":"Compression","p":["Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. Compressed files are expected to end with the extension"]},{"l":"Naming conventions","p":["Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations."]},{"l":"Barcode thresholds","p":["By the nature of linked read technologies, there will (almost always) be more DNA fragments than unique barcodes for them. As a result, it's common for barcodes to reappear in sequences. Rather than incorrectly assume that all sequences/alignments with the same barcode"]}],[{"l":"Common Harpy Options","p":["Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments,"]},{"l":"The Genome folder","p":["You will notice that many of the workflows will create a Genome folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated"]}],[{"l":"Common Issues","p":["Lots of stuff can go wrong during an analysis. The intent of this page is to highlight common issues you may experience during analysis and ways to address these issues."]},{"l":"Problem installing with conda","p":["Conda is an awesome package manager, but it's slow and uses a ton of memory as dependencies increase. Harpy has a lot of dependencies and you might stall out conda trying to install it. Use mamba instead-- it'll work where conda fails."]},{"l":"Failures during imputation or phasing","p":["If you use bamutils clipOverlap on alignments that are used for the impute or phase modules, they will cause both programs to error. We don't know why, but they do."]},{"i":"alignment-file-name-and-id-tag-mismatch","l":"Alignment file name and ID: tag mismatch","p":["Aligning a sample to a genome via Harpy will insert the sample name (based on the file name) into the alignment header (the @RG ID:name SM:name tag). It likewise expects, through various steps,"]}],[{"l":"Adding Snakamake parameters","p":["Harpy relies on Snakemake under the hood to handle file and job dependencies. Most of these details have been abstracted away from the end-user, but every module of Harpy (except"]},{"l":"Common use cases","p":["You likely wont need to invoke --snakemake very often, if ever. However, here are common use cases for this parameter."]}],[{"l":"Software used in Harpy","p":["HARPY is the sum of its parts, and out of tremendous respect for the developers involved in the included software, we would like to highlight the tools directly involved in HARPY's many moving pieces."]}],[{"l":"Developing Harpy","p":["Harpy is an open source program written using a combination of BASH, R, RMarkdown, Python, and Snakemake. This page provides information on Harpy's development and how to contribute to it, if you were inclined to do so."]},{"l":"Installing Harpy for development","p":["The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the misc/buildlocal.sh script to move all the necessary files to the"]},{"i":"harpys-components","l":"Harpy's components"},{"l":"source code","p":["Harpy runs in two stages:"]},{"l":"Bioconda recipe","p":["For the ease of installation for end-users, Harpy has a recipe and build script in Bioconda, which makes it available for download and installation. A copy of the recipe and build script is also stored in"]},{"l":"The Harpy repository"},{"l":"structure","p":["Harpy exists as a Git repository and has 5 standard branches that are used in specific ways during development. Git is a popular version control system and discussing its use is out of the scope of this documentation, however there is no"]},{"l":"development workflow","p":["The dev workflow is reasonably standard:"]},{"l":"Testing and CI","p":["CI ( C ontinuous I ntegration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. Harpy has a series of GitHub Actions triggered by interactions with the"]}]] \ No newline at end of file +[[{"i":"#","p":["Using Harpy to process your haplotagged data"]},{"l":"Home","p":["Harpy is a haplotagging data processing pipeline for Linux-based systems. It uses all the magic of Snakemake under the hood to handle the worklfow decision-making, but as a user, you just interact with it like a normal command-line"]},{"i":"what-is-haplotagging","l":"What is haplotagging?","p":["Linked-read sequencing exists to combine the throughput and accuracy of short-read sequencing with the long range haplotype information of long-read sequencing. Haplotagging is an implementation of linked-read sequencing developed by"]},{"l":"Harpy Modules","p":["Harpy is modular, meaning you can use different parts of it independent from each other. Need to only align reads? Great! Only want to call variants? Awesome! All modules are called by"]},{"l":"Using Harpy","p":["You can call harpy without any arguments (or with --help) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with --help to see their usage (e.g."]}],[{"l":"Install HARPY","p":["Harpy is now hosted on Bioconda! That means to install it, you just need to have mamba(or conda) on your Linux-based system and install it with a simple command. You can install Harpy into an existing environment or create a new one for it (recommended)."]}],[{"i":"#","p":["Demultiplex raw sequences into haplotag barcoded samples"]},{"l":"Demultiplex Raw Sequences","p":["When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ files in return. These files contain sequences for all of your samples and need to be demultiplexed using barcodes to"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy demultiplex module is configured using these command-line arguments:"]},{"l":"Haplotag Types"},{"l":"Gen I Demultiplex Workflow"}],[{"i":"#","p":["Run file format checks on haplotagged FASTQ/BAM files"]},{"l":"Pre-flight checks for input files","p":["Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). These formatting opinions/specifics are at the mercy of the original developers and while there are times when Harpy can (and does)"]},{"l":"when to run"},{"l":"Running Options","p":["In addition to the common runtime options, the harpy preflight fastq|bam module is configured using these command-line arguments:"]},{"l":"Workflow"}],[{"i":"#","p":["Quality trim haplotagged sequences with Harpy"]},{"l":"Quality Trim Sequences","p":["Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence analyses is to remove these adapters and trim poor quality data. You can remove adapters"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy qc module is configured using these command-line arguments:"]},{"l":"QC Workflow"}],[{"i":"#","p":["Align haplotagged sequences with BWA MEM"]},{"l":"Map Reads onto a genome with BWA MEM","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align bwa module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used during the BWA alignment workflow to assign alignments a unique Molecular Identifier MI:i tag based on their haplotag barcode and the distance threshold you specify. See"]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"BWA workflow"}],[{"i":"#","p":["Align haplotagged sequences with EMA"]},{"l":"Map Reads onto a genome with EMA","p":["Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input,"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy align ema module is configured using these command-line arguments:"]},{"l":"Barcode whitelist","p":["Some linked-read methods (e.g. 10x, Tellseq) require the inclusion of a barcode \"whitelist.\" This file is a simple text file that has one barcode per line so a given software knows what barcodes to expect in your data."]},{"l":"Quality filtering","p":["The --quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1"]},{"l":"EMA workflow"}],[{"i":"#","p":["Call SNPs and small indels"]},{"l":"Call SNPs and small indels","p":["After reads have been aligned, e.g., with harpy align, you can use those alignment files(.bam) to call variants in your data. Harpy can call SNPs and small indels using bcftools mpileup"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy snp module is configured using these command-line arguments:"]},{"l":"windowsize","p":["To speed things along, Harpy will call variants in parallel on different contig intervals, then merge everything at the end. You can control the level of parallelization by using"]},{"l":"populations","p":["Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to beleive there is a biologically meaningful grouping scheme to your samples, then you should include"]},{"l":"SNP calling workflow"}],[{"i":"#","p":["Call structural variants using Leviathan"]},{"l":"Call Structural Variants using LEVIATHAN","p":["(like indels, insertions, duplications, breakends)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv leviathan module is configured using these command-line arguments:"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"LEVIATHAN workflow"}],[{"i":"#","p":["Call structural variants using NAIBR (plus)"]},{"l":"Call Structural Variants using NAIBR","p":["(like indels, insertions, duplications)"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy sv naibr module is configured using these command-line arguments:"]},{"l":"Molecule distance","p":["The --molecule-distance option is used to let the program determine how far apart alignments on a contig with the same barcode can be from each other and still considered as originating from the same DNA molecule. See"]},{"l":"Single-sample variant calling","p":["When not using a population grouping file via --populations, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way"]},{"l":"Pooled-sample variant calling","p":["With the inclusion of a population grouping file via --populations, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false"]},{"l":"optional vcf file","p":["In order to get the best variant calling performance out of NAIBR, it requires phased bam files as input. The --vcf option is optional and not used by NAIBR. However, to use harpy sv naibr"]},{"i":"a-phased-input---vcf","l":"a phased input --vcf","p":["This file can be in vcf/vcf.gz/bcf format and most importantly it must be phased haplotypes. There are various ways to haplotype SNPs, but you can use harpy phase to phase your SNPs into haplotypes using the haplotag barcode"]},{"l":"NAIBR workflow"}],[{"i":"#","p":["Impute genotypes for haplotagged data with Harpy"]},{"l":"Impute Genotypes using Sequences","p":["After variants have been called, you may want to impute missing genotypes to get the most from your data. Harpy uses STITCH to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy impute module is configured using these command-line arguments:"]},{"l":"Extra STITCH parameters","p":["You may add additional parameters to STITCH by way of the--extra-params(or -x) option. Since STITCH is a function in the R language, the parameters you add must be in R syntax (e.g."]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Parameter file","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters (explained in next section). The solution Harpy uses for this is to have the user"]},{"l":"STITCH Parameters"},{"l":"Imputation Workflow"}],[{"i":"#","p":["Phase haplotypes for haplotagged data with Harpy"]},{"l":"Phase SNPs into Haplotypes","p":["You may want to phase your genotypes into haplotypes, as haplotypes tend to be more informative than unphased genotypes (higher polymorphism, captures relationship between genotypes). Phasing"]},{"l":"Running Options","p":["In addition to the common runtime options, the harpy phase module is configured using these command-line arguments:"]},{"l":"Prioritize the vcf file","p":["Sometimes you want to run imputation on all the samples present in the --directory, but other times you may want to only impute the samples present in the --vcf file. By default, Harpy assumes you want to use all the samples"]},{"l":"Molecule distance","p":["The molecule distance refers to the base-pair distance dilineating separate molecules. In other words, when two alignments on a single contig share the same barcode, how far away from each other are we willing to say they were and still consider them having"]},{"l":"Pruning threshold","p":["The pruning threshold refers to a PHRED-scale value between 0-1 (a percentage) for removing low-confidence SNPs from consideration. With Harpy, you configure this value as an integer"]},{"l":"Phasing Workflow"}],[{"i":"#","p":["Generate extra files for analysis with Harpy"]},{"l":"Other Harpy modules","p":["Some parts of Harpy (variant calling, imputation) want or need extra files. You can create various files necessary for different modules using these extra modules: The arguments represent different sub-commands and can be run in any order or combination to generate the files you need."]},{"l":"Other modules"},{"l":"popgroup"},{"l":"Sample grouping file for variant calling"},{"l":"arguments","p":["This file is entirely optional and useful if you want SNP variant calling to happen on a per-population level via harpy snp ... -p or on samples pooled-as-populations via harpy sv ... -p"]},{"l":"stitchparams"},{"l":"STITCH parameter file"},{"i":"arguments-1","l":"arguments","p":["Typically, one runs STITCH multiple times, exploring how results vary with different model parameters. The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model"]},{"l":"hpc"},{"l":"HPC cluster profile"},{"i":"arguments-2","l":"arguments","p":["For snakemake to work in harmony with an HPC scheduler, a \"profile\" needs to be provided that tells Snakemake how it needs to interact with the HPC scheduler to submit your jobs to the cluster. Using"]}],[{"l":"Haplotag data"},{"l":"Data Format"},{"l":"Barcodes","p":["While barcodes are actually combinatorial bases, in the read headers they are represented with the format AxxCxxBxxDxx, where each barcode segment is denoted as Axx(or Bxx, etc.)."]},{"l":"barcode protocol varieties","p":["If you think haplotagging is as simple as exactly 96^4 unique barcodes, you would only be half-correct. The original haplotagging protocol in Meier et al. is good, but the authors (and others) have been working to improve this linked-read technology to improve"]},{"l":"where the barcodes go","p":["Chromium 10X linked-reads have a particular format where the barcode is the leading 16 bases of the read. However, haplotagging data does not use that format, nor do the tools implemented in Harpy work correctly with it. Once demultiplexed, haplotagging sequences should look"]},{"l":"Read headers","p":["Like mentioned, the haplotag barcode is expected to be stored in the BX:Z: tag in the read header. This information is retained through the various Harpy steps. An example read header could look like:"]},{"l":"Read length","p":["Reads must be at least 30 base pairs in length for alignment. The qc module removes reads <50bp."]},{"l":"Compression","p":["Harpy generally doesn't require the input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. Compressed files are expected to end with the extension"]},{"l":"Naming conventions","p":["Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations."]},{"l":"Barcode thresholds","p":["By the nature of linked read technologies, there will (almost always) be more DNA fragments than unique barcodes for them. As a result, it's common for barcodes to reappear in sequences. Rather than incorrectly assume that all sequences/alignments with the same barcode"]}],[{"l":"Common Harpy Options"},{"l":"Common command-line options","p":["Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments,"]},{"l":"The workflow folder","p":["When you run one of the main Harpy modules, the output directory will contain a workflow folder. This folder is both necessary for the module to run and is very useful to understand what the module did, be it for your own"]},{"l":"The Genome folder","p":["You will notice that many of the workflows will create a Genome folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated"]}],[{"l":"Common Issues","p":["Lots of stuff can go wrong during an analysis. The intent of this page is to highlight common issues you may experience during analysis and ways to address these issues."]},{"l":"Problem installing with conda","p":["Conda is an awesome package manager, but it's slow and uses a ton of memory as dependencies increase. Harpy has a lot of dependencies and you might stall out conda trying to install it. Use mamba instead-- it'll work where conda fails."]},{"l":"Failures during imputation or phasing","p":["If you use bamutils clipOverlap on alignments that are used for the impute or phase modules, they will cause both programs to error. We don't know why, but they do."]},{"i":"alignment-file-name-and-id-tag-mismatch","l":"Alignment file name and ID: tag mismatch","p":["Aligning a sample to a genome via Harpy will insert the sample name (based on the file name) into the alignment header (the @RG ID:name SM:name tag). It likewise expects, through various steps,"]}],[{"l":"Adding Snakamake parameters","p":["Harpy relies on Snakemake under the hood to handle file and job dependencies. Most of these details have been abstracted away from the end-user, but every module of Harpy (except"]},{"l":"Common use cases","p":["You likely wont need to invoke --snakemake very often, if ever. However, here are common use cases for this parameter."]}],[{"l":"Software used in Harpy","p":["HARPY is the sum of its parts, and out of tremendous respect for the developers involved in the included software, we would like to highlight the tools directly involved in HARPY's many moving pieces."]}],[{"l":"Developing Harpy","p":["Harpy is an open source program written using a combination of BASH, R, RMarkdown, Python, and Snakemake. This page provides information on Harpy's development and how to contribute to it, if you were inclined to do so."]},{"l":"Installing Harpy for development","p":["The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the misc/buildlocal.sh script to move all the necessary files to the"]},{"i":"harpys-components","l":"Harpy's components"},{"l":"source code","p":["Harpy runs in two stages:"]},{"l":"Bioconda recipe","p":["For the ease of installation for end-users, Harpy has a recipe and build script in Bioconda, which makes it available for download and installation. A copy of the recipe and build script is also stored in"]},{"l":"The Harpy repository"},{"l":"structure","p":["Harpy exists as a Git repository and has 5 standard branches that are used in specific ways during development. Git is a popular version control system and discussing its use is out of the scope of this documentation, however there is no"]},{"l":"development workflow","p":["The dev workflow is reasonably standard:"]},{"l":"Testing and CI","p":["CI ( C ontinuous I ntegration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. Harpy has a series of GitHub Actions triggered by interactions with the"]}]] \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 0e116df94..100dd3570 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ diff --git a/snakemake/index.html b/snakemake/index.html index a446f97ed..8cdfe8d59 100644 --- a/snakemake/index.html +++ b/snakemake/index.html @@ -4,7 +4,7 @@ - + @@ -32,12 +32,12 @@ - + - + - - + +
    @@ -58,7 +58,7 @@ Harpy - +
    @@ -263,9 +263,11 @@
    reserved/forbidden arguments
  • --directory
  • --cores
  • --snakefile
  • -
  • --config
  • +
  • --configfile
  • --rerun-incomplete
  • --nolock
  • +
  • --use-conda
  • +
  • --conda-prefix
  • @@ -292,7 +294,7 @@
    ahead of time. It's also useful for debugging during development. Here is an example of dry-running variant calling:

    -
    harpy variants snp -g genome.fasta  -d Align/ema -s "--dry-run"
    +
    harpy snp mpileup -g genome.fasta  -d Align/ema -s "--dry-run"
    @@ -301,7 +303,7 @@
    you want the beadtag report Harpy makes from the output of EMA count. To do this, just list the file/files (relative to your working directory) without any flags. Example for the beadtag report:

    -
    harpy align -g genome.fasta -d QC/ -t 4 -s "Align/ema/stats/reads.bxstats.html"
    +
    harpy align bwa -g genome.fasta -d QC/ -t 4 -s "Align/ema/stats/reads.bxstats.html"

    This of course necessitates knowing the names of the files ahead of time. See the individual modules for a breakdown of expected outputs.

    @@ -320,7 +322,7 @@
    you may use --shadow-prefix <dirname> where <dirname> is the path to the mandatory directory you need to work out of. By configuring this "shadow directory" setting, Snakemake will automatically move the files in/out of that directory for you:

    -
    harpy variants sv --method leviathan -g genome.fasta  -d Align/bwa --threads 8 -p samples.groups -s "--shadow-prefix /SCRATCH/username/"
    +
    harpy sv leviathan -g genome.fasta  -d Align/bwa --threads 8 -p samples.groups -s "--shadow-prefix /SCRATCH/username/"
    diff --git a/software/index.html b/software/index.html index 5c6e07bff..2415fe2f0 100644 --- a/software/index.html +++ b/software/index.html @@ -4,7 +4,7 @@ - + @@ -32,11 +32,11 @@ - + - + - +
    @@ -57,7 +57,7 @@ Harpy - +