update to 1.14.2

pdimens · Dec 13, 2024 · b8338f6 · b8338f6
1 parent 993428b
commit b8338f6
Show file tree

Hide file tree

Showing 5 changed files with 534 additions and 431 deletions.
diff --git a/Workflows/downsample.md b/Workflows/downsample.md
@@ -7,21 +7,19 @@ order: 10
 
 # :icon-fold-down: Downsample data by barcode
 
-===  :icon-checklist: You will need one of either
-- one alignment file [!badge variant="success" text=".bam"] [!badge variant="success" text=".sam"] [!badge variant="secondary" text="case insensitive"]
-- one set of paired-end reads in FASTQ format [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="gzip recommended"] [!badge variant="secondary" text="case insensitive"]
+===  :icon-checklist: You will need
+- One of either:
+  - one alignment file [!badge variant="success" text=".bam"] [!badge variant="success" text=".sam"] [!badge variant="secondary" text="case insensitive"]
+  - one set of paired-end reads in FASTQ format [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="gzip recommended"] [!badge variant="secondary" text="case insensitive"]
+- Barcodes in the `BX:Z` SAM tag for both BAM and FASTQ inputs
+  - See [Section 1 of the SAM Spec here](https://samtools.github.io/hts-specs/SAMtags.pdf) for details
+  - `BX:Z` tags **must be the last tag** in the FASTQ/BAM record
+    - use [bx_to_end.py](/utilities.md#bx_to_endpy) to move the BX tags to the ends, if needed
 ===
 
 While downsampling (subsampling) FASTQ and BAM files is relatively simple with tools such as `awk`, `samtools`, `seqtk`, `seqkit`, etc.,
 [!badge corners="pill" text="downsample"] allows you to downsample a BAM file (or paired-end FASTQ) _by barcodes_. That means you can
-keep all the reads associated with `d` number of barcodes. The `--invalid` proportion will determine what proportion of invalid barcodes appear in the barcode
-pool that gets subsampled, where `0` is none, `1` is all invalid barcodes, and a number in between is that proportion, e.g. `0.5` is half.
-Bear in mind that the barcode pool still gets subsampled, so the `--invalid` proportion doesn't necessarily reflect how many end up getting
-sampled, rather what proportion will be considered for sampling. 
-
-!!! Barcode tag
-Barcodes must be in the `BX:Z` SAM tag for both BAM and FASTQ inputs. See [Section 1 of the SAM Spec here](https://samtools.github.io/hts-specs/SAMtags.pdf).
-!!!
+keep all the reads associated with `d` number of barcodes.
 
 ```bash usage
 harpy downsample OPTIONS... INPUT(S)...
@@ -48,6 +46,15 @@ module is configured using the command-line arguments below.
 | `--prefix`      |    `-p`    | `downsampled` | Prefix for output files                                                                                                           |
 | `--random-seed` |            |               | Random seed for sampling [!badge variant="secondary" text="optional"]                                                             |
 
+## invalid barcodes
+The `--invalid` options determines what proportion of invalid barcodes appear in the barcode
+pool. Bear in mind that the barcode pool still gets subsampled, so the `--invalid` proportion
+doesn't necessarily reflect how many end up getting sampled, rather what proportion will be
+considered for sampling. The proportions equate to:
+- `0`: invalid barcodes are skipped
+- `1`: all invalid barcodes appear in the barcode pool that gets subsampled
+- `0`<`i`<`1`: that proportion of barcodes appear in the barcode pool that gets subsampled
+
 ----
 ## :icon-git-pull-request: Downsample Workflow
 ```mermaid

diff --git a/blog/filteringsnps.md b/blog/filteringsnps.md
@@ -70,5 +70,5 @@ fairly common to use `0.05` (e.g. `-i 'MAF>0.05'`) to `0.10` (e.g. `-i 'MAF>0.10
 Missing data is, frankly, not terribly useful. The amount of missing data you're willing to tolerate will depend on your study, but
 it's common to remove sites with >20% missing data (e.g. `-e 'F_MISSING>0.2'`). This can be as strict (or lenient) as you want; it's not uncommon to see very
 conservative filtering at 10% or 5% missing data. **However**, you can impute missing genotypes to recover
-missing data! Harpy can leverage linked-read information to impute genotypes with the [!badge corners="pill" text="impute"](../Modules/impute.md)
+missing data! Harpy can leverage linked-read information to impute genotypes with the [!badge corners="pill" text="impute"](../Workflows/impute.md)
 module. You should try to impute genotypes first before filtering out sites based on missingness.
diff --git a/static/bc_threshold.png b/static/bc_threshold.png