Skip to content

Commit

Permalink
make own page
Browse files Browse the repository at this point in the history
  • Loading branch information
pdimens committed Dec 11, 2024
1 parent 44247a9 commit 94ce116
Show file tree
Hide file tree
Showing 2 changed files with 69 additions and 41 deletions.
69 changes: 69 additions & 0 deletions Workflows/downsample.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
label: Downsample
description: Downsample data by barcode
icon: fold-down
order: 10
---

# :icon-fold-down: Downsample data by barcode

=== :icon-checklist: You will need one of either
- one alignment file [!badge variant="success" text=".bam"] [!badge variant="success" text=".sam"] [!badge variant="secondary" text="case insensitive"]
- one set of paired-end reads in FASTQ format [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="gzip recommended"] [!badge variant="secondary" text="case insensitive"]
===

While downsampling (subsampling) FASTQ and BAM files is relatively simple with tools such as `awk`, `samtools`, `seqtk`, `seqkit`, etc.,
[!badge corners="pill" text="downsample"] allows you to downsample a BAM file (or paired-end FASTQ) _by barcodes_. That means you can
keep all the reads associated with `d` number of barcodes. The `--invalid` proportion will determine what proportion of invalid barcodes appear in the barcode
pool that gets subsampled, where `0` is none, `1` is all invalid barcodes, and a number in between is that proportion, e.g. `0.5` is half.
Bear in mind that the barcode pool still gets subsampled, so the `--invalid` proportion doesn't necessarily reflect how many end up getting
sampled, rather what proportion will be considered for sampling.

!!! Barcode tag
Barcodes must be in the `BX:Z` SAM tag for both BAM and FASTQ inputs. See [Section 1 of the SAM Spec here](https://samtools.github.io/hts-specs/SAMtags.pdf).
!!!

```bash usage
harpy downsample OPTIONS... INPUT(S)...
```

```bash example
# BAM file
harpy downsample -d 1000 -i 0.3 -p sample1.sub1000 sample1.bam

# FASTQ file
harpy downsample -d 1000 -i 0 -p sample1.sub1000 sample1.F.fq.gz sample1.R.fq.gz
```

## :icon-terminal: Running Options
In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="downsample"]
module is configured using the command-line arguments below.

{.compact}
| argument | short name | default | description |
| :-------------- | :--------: | :-----------: | :-------------------------------------------------------------------------------------------------------------------------------- |
| `INPUT(S)` | | | [!badge variant="info" text="required"] One BAM file or both read files from a paired-end FASTQ pair |
| `--downsample` | `-d` | | [!badge variant="info" text="required"] Number of barcodes to downsample to |
| `--invalid` | `-i` | `1` | Proportion of barcodes to sample |
| `--prefix` | `-p` | `downsampled` | Prefix for output files |
| `--random-seed` | | | Random seed for sampling [!badge variant="secondary" text="optional"] |

----
## :icon-git-pull-request: Downsample Workflow
```mermaid
graph LR
subgraph fastq
R1([read 1]):::clean---R2([read 2]):::clean
end
subgraph bam
bamfile([bam]):::clean
end
fastq-->|bam conversion|bam
bam-->sub([extract and\n subsample barcodes]):::clean
sub-->exreads([extract reads]):::clean
bam-->exreads
fastq-->exreads
style fastq fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
style bam fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
```
41 changes: 0 additions & 41 deletions Workflows/other.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,54 +11,13 @@ On this page you'll find Harpy functions that do other, ancillary things.
{.compact}
| module | description |
| :------------- | :------------------------------------------------------------------------------- |
| `downsample` | Downsample BAM or FASTQ files by barcode |
| `imputeparams` | Create a template imputation parameter file |
| `resume` | Continue a Harpy workflow from an existing output folder |
| `popgroup` | Create generic sample-group file using existing sample file names (fq.gz or bam) |
| `view` | View a workflow log, config, or snakefile |

---

### downsample
While downsampling (subsampling) FASTQ and BAM files is relatively simple with tools such as `awk`, `samtools`, `seqtk`, `seqkit`, etc.,
Harpy offers the `downsample` module, which allows you to downsample a BAM file (or paired-end FASTQ) _by barcodes_. That means you can
keep all the reads associated with `d` number of barcodes. First, barcodes are extracted, then subsampled, then the reads associated
with those barcodes are extracted. The `--invalid` proportion will determine what proportion of invalid barcodes appear in the barcode
pool that gets subsampled, where `0` is none, `1` is all invalid barcodes, and a number in between is that proportion, e.g. `0.5` is half.
Bear in mind that the barcode pool still gets subsampled, so the `--invalid` proportion doesn't necessarily reflect how many end up getting
sampled, rather what proportion will be considered for sampling.

!!! Barcode tag
Barcodes must be in the `BX:Z` SAM tag for both BAM and FASTQ inputs. See [Section 1 of the SAM Spec here](https://samtools.github.io/hts-specs/SAMtags.pdf).
!!!

```bash usage
harpy downsample OPTIONS... INPUT(S)...
```

```bash example
# BAM file
harpy downsample -d 1000 -i 0.3 -p sample1.sub1000 sample1.bam

# FASTQ file
harpy downsample -d 1000 -i 0 -p sample1.sub1000 sample1.F.fq.gz sample1.R.fq.gz
```

#### arguments
{.compact}
| argument | short name | default | description |
| :-------------- | :--------: | :-----------: | :-------------------------------------------------------------------------------------------------------------------------------- |
| `INPUT(S)` | | | [!badge variant="info" text="required"] One BAM file or both read files from a paired-end FASTQ pair |
| `--downsample` | `-d` | | [!badge variant="info" text="required"] Number of barcodes to downsample to |
| `--invalid` | `-i` | 1 | Proportion of barcodes to sample |
| `--prefix` | `-p` | `downsampled` | Prefix for output files |
| `--random-seed` | | | Random seed for sampling [!badge variant="secondary" text="optional"] |
| `--snakemake` | | | Additional Snakemake arguments, in quotes |
| `--threads` | `-t` | `4` | Number of threads to use |
| `--quiet` | | | Don't show output text while running |

---

### imputeparams
Create a template parameter file for the [!badge corners="pill" text="impute"](/Workflows/impute.md) module.
The file is formatted correctly and serves as a starting point for using parameters that make sense for your study.
Expand Down

0 comments on commit 94ce116

Please sign in to comment.