ALSO: Allele Segregation Pipeline

Composed of a small set of Python, Shell, and R scripts, the ALSO pipeline segregates next-generation sequencing alignments to alleles of origin based on alignment scores.

News and Updates

The three most recent updates are shown below; a complete list of updates can be found here.

[0.9.8] - 2022-10-13

changed

changelog format from @GangLiTarheel.

Installation

argparser = 0.7.1
bedtools >= 2.29.0
parallel >= 20200101
Picard >= 2.26.4
pigz >= 2.3
pysam >= 0.19 # exclude-reads-from-bam.py
R >= 4.0
Rsamtools = 2.8.0
samtools >= 1.13
scales >= 1.1.1
subread = 2.0.1
Tidyverse = 1.3.1
tqdm >= 4.64.0 # exclude-reads-from-bam.py

Workflow

For sci-ATAC-seq experiments part of the 4dn-mouse-cross project, the user needs to run the following steps prior to using the ALSO pipeline:

Demux. (Example Code 1)
sci-ATAC-seq analysis pipeline from the Shendure Lab. (Example Code 2)
Preprocess the bam. (Example Code 3)

Example for calling driver_allelic-segregation.sh

#  Call from 2021_kga0_4dn-mouse-cross or a directory containing
#+ functions-in-progress.sh, functions-preprocessing-HPC.sh,
#+ get-AS-per-qname.R, find-set-intersection-set-complement.sh,
#+ generate-assignment-lists.R, and filter-qnames-by-assignment.sh
bash ./bin/workflow/driver_allelic-segregation.sh \
-u FALSE \
-l TRUE \
-d TRUE \
-m "512m" \
-x "4096m" \
-r "mm10" \
-s "CAST" \
-1 "path/to/mm10/Disteche_sample_N.dedup.corrected.bam" \
-2 "path/to/CAST/Disteche_sample_N.dedup.corrected.bam" \
-p "Disteche_sample_1_downsampled_test-again" \
-o "./path/to/directory/for/results" \
-b 100000 \
-c 1000000 \
-t 0 \
-a TRUE \
-n 4

# Arguments:
# -h  print this help message and exit
# -u  use safe mode: TRUE or FALSE [logical; default: FALSE]
# -l  run on GS HPC: TRUE or FALSE [logical; default: FALSE]
# -d  run pipeline in ${TMPDIR}: TRUE or FALSE [logical; default:
#     TRUE]
# -m  initial memory allocation pool for JVM [chr; default: "512m"]
# -x  maximum memory allocation pool for JVM [chr; default: "4096m"]
# -r  string for "sample #1" [chr]
# -s  string for "sample #2" [chr]
# -1  bam infile #1, including path [chr]
# -2  bam infile #2, including path [chr]
# -p  prefix for outfiles [chr]
# -o  results directory for outfiles [chr]; path will be made if it
#     does not exist
# -b  number of records to read into memory at one time when running
#     the script for Part #1, get-AS-per-qname.R [int > 0; default:
#     100000]
# -c  number of records to read into memory at one time when running
#     the script for Part #3, generate-assignment-lists.R [int > 0;
#     default: 1000000]
# -t  alignment score threshold [int >= 0; default: 0]; the absolute
#     value of the difference in alignment scores between "sample
#     #1" and "sample #2" must be greater than this value in order
#     for a sample-specific assignment to be made; if not greater than
#     this value, then the assignment will be "ambiguous"
# -a  count lines: TRUE or FALSE [logical; default: TRUE]
# -n  step in pipeline to run up to [int 1-4; default: 4]

## 05.11
# run for all 22 samples
chmod 751 *.sh
Date="05-11"
for i in {1..22} ## run 22 samples in parallel
do
    echo "Running: "
    echo $i
    strain="mm10"
    Job_name="mm10_pre"$i
    qsub -l mfree=12G -m bea -M [email protected] -N $Job_name submit.sh $i ${strain} > results/${strain}-preprocessed/2022-${Date}-${strain}-pre-${i}.submit
    echo "Submitted."

    strain="CAST-EiJ"
    Job_name="CAST_pre"$i
    qsub -l mfree=12G -m bea -M [email protected] -N $Job_name submit.sh $i ${strain} > results/${strain}-preprocessed/2022-${Date}-${strain}-pre-${i}.submit
    echo "Submitted."
done

Test code for proprocessing (workflow/03-filter-qname.sh)

#  Call from 2021_kga0_4dn-mouse-cross
file="Disteche_sample_1.dedup.bam"
bash ./filter-qnames.sh \
-u FALSE \
-c FALSE \
-l TRUE \
-i ./"${strain}"/get_unique_fragments/"${bam_input}" \
-o "${outpath}" \
-f TRUE \
-r FALSE \
-p 4 \
> "${outpath}"preprocess_${strain}_${sample_id}.o \
2>"${outpath}"preprocess_${strain}_${sample_id}.e

# Arguments:
# -u is for "safe mode" (set -Eeuxo)
# -c is for whether using on the GS HPC or not (T or F)
# -i is for infile
# -o is for outpath
# -p is for number of cores for parallelization (for calls to samtools)

Example for calling 03-remove-duplicate-qnames.sh #TODO Update this...

#  Call from 2021_kga0_4dn-mouse-cross or a directory containing
#+ functions-in-progress.sh and functions-preprocessing-HPC.sh
bash ./bin/workflow/03-remove-duplicate-qnames.sh \
-u FALSE \
-c TRUE \
-m "512m" \
-x "4048m" \
-i "${dir_data}/${infile}" \
-o "${dir_data}" \
-n TRUE \
-t FALSE \
-e TRUE \
-r TRUE \
-p "${parallelize}" \
> "${dir_log}/rm-dup-qnames_${strain}_${ID}.o.txt" \
2> "${dir_log}/rm-dup-qnames_${strain}_${ID}.e.txt"

# ./bin/workflow/03-remove-duplicate-qnames.sh:
# Run pipeline to filter duplicate QNAMEs from bam file.
#   - Step 01: Copy files of interest to ${TMPDIR}
#   - Step 02: Sort bam by QNAME
#   - Step 03: List and tally QNAMEs in the sorted bam file
#   - Step 04: Create txt.gz outfiles for QNAME > 2
#   - Step 05: Count lines in infile, outfiles (optional)
#   - Step 06: Tally entries in infile, outfiles (optional)
#   - Step 07: Exclude problematic QNAME reads from bam infile
#   - Step 08: Sort corrected bam by QNAME (optional)
#   - Step 09: List and tally QNAMEs in the corrected bam file
#              (optional)
#   - Step 10: Create txt.gz outfiles for QNAME >, <, = 2 (optional)
#   - Step 11: Remove temporary bams, move ${TMPDIR} outfiles to
#              ${outpath}
#
#
# Dependencies:
#   - parallel >= 20200101
#   - picard >= 2.27.1
#   - samtools >= 1.13
#
#
# Arguments:
# -h print this help message and exit
# -u use safe mode: "TRUE" or "FALSE" (logical)
# -c run on GS HPC: "TRUE" or "FALSE" (logical)
# -m initial memory allocation pool for JVM (chr; default "512m")
# -x maximum memory allocation pool for JVM (chr; default "1g")
# -i bam infile, including path (chr)
# -o path for outfiles (chr); path will be made if it does not exist
# -n count lines: "TRUE" or "FALSE" (logical)
# -t tally entries: "TRUE" or "FALSE" (logical)
# -e evaluate corrected bam: "TRUE" or "FALSE" (logical)
# -r remove intermediate files: "TRUE" or "FALSE" (logical)
# -p number of cores for parallelization (int >= 1; default: 1)

This ALSO pipeline takes as input two paired parental bam files (one aligned to a "sample 1" reference genome and the other aligned to a "sample 2" reference genome) that have been sorted and subjected to duplicate removal; ALSO outputs 6 bam files:

strain-1 bam comprised of only "strain 1" assignments
strain-1 bam comprised of only "strain 2" assignments
strain-1 bam comprised of only "ambiguous" assignments
strain-2 bam comprised of only "strain 1" assignments
strain-2 bam comprised of only "strain 2" assignments
strain-2 bam comprised of only "ambiguous" assignments

Proposed changes to ALSO

Regarding setting -d TRUE in various ALSO scripts: instead of copying files into and running all steps in ${TMPDIR}, make it so that only certain programs make use of ${TMPDIR}, e.g., Picard, Samtools; steps will be run in and files will be written to -o "./path/to/directory/for/results"
~~Index bam outfiles while writing them out with Picard: --CREATE_INDEX <Boolean>~~
~~Set Picard FilterSamReads --MAX_RECORDS_IN_RAM <Integer>; but seems applicable to only sorting output, which we currently do not do...~~
Remove use of Picard FilterSamReads
For using exclude-reads-from-bam.py, reverse-sort QNAME lists and sort bams with picard SortSam SORT_ORDER=queryname

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
bin		bin
docs		docs
.gitignore		.gitignore
ALSO.png		ALSO.png
README.md		README.md
README.rst		README.rst
changelog.md		changelog.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALSO: Allele Segregation Pipeline

News and Updates

[0.9.8] - 2022-10-13

changed

Installation

Workflow

Proposed changes to ALSO

About

Releases

Packages

Contributors 4

Languages

Noble-Lab/ALSO

Folders and files

Latest commit

History

Repository files navigation

ALSO: Allele Segregation Pipeline

News and Updates

[0.9.8] - 2022-10-13

changed

Installation

Workflow

Proposed changes to ALSO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages