Skip to content

jacob-greene/sciCTextract

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sciCTextract

Simple sciCUT&Tag demultiplexing.

Installation

Installation within a conda environment or virtualenv is recomended. In an active enviroment with python3 installed, clone the git repository and install with pip:

pip install .

This will install dependencies such as Biopython and put the sciCTextract command on your path.

Input Files

Demultiplexing requires four input Fastq files following Illumina naming conventions, including read headers of the form:

@VH00319:342:AACKYJMM5:1:1101:31410:1000 1:N:0:0

On multi-lane flowcells, reads should not be split by lane, unless needed to process lanes independently (to support XP worflows for example). A demultiplexing run requires exactly four Fastq files for paired sequence reads (_R1 & _R2) and paired index reads (_I1 & _I2).

The process also requires two barcode tables in comma-sepearated value format, one for Tn5 barcodes that will define the prefixes of the samples names and one for the Primer barcodes that will defined the suffixes.

The Tn5 barcode table should have this form at minimum:

Sample Name Tn5_s7 Tn5_s7_seq Tn5_s5 Tn5_s5_seq
Hs_H3K27ac P7_i7_1 ATTACTCG P5_i5_1 TATAGCCT
Hs_H3K27ac P7_i7_1 ATTACTCG P5_i5_2 ATAGAGGC

The Primer barcode table should have this form at minimum:

i7_index_seq i5_index_seq i7_index_id i5_index_id ID
GGACTCCT TAGATCGC P7_5 P5_1 10pM
GGACTCCT CTCTCTAT P7_5 P5_2 10pM

Note that barcodes should always be specified in forward strand ("Workflow A") orientation. This allows the same barcode tables to be used with different types of Illumina instruments. All barcodes are currently required to be 8nt.

Running

With four Fastq files in hand and two barcode tables defined, create an output directory (e.g., mkdir fastq_out) and launch demultiplexing, for example:

sciCTextract \
    --outdir fastq_out \
    --Tn5_Barcode Tn5_Barcode_Annotation.csv \
    --Primer_Barcode Primer_Barcode_Annotation.csv \
    Undetermined_S0_R1_001.fastq.gz \
    Undetermined_S0_R2_001.fastq.gz \
    Undetermined_S0_I1_001.fastq.gz \
    Undetermined_S0_I2_001.fastq.gz

Note that our current typical use is to run on the Illumina NextSeq 2000. Default settings should work for NextSeq 1000/2000, NovaSeq 6000 (v1.5 or more recent). For instruments that use forward-strand workflows (MiSeq, HiSeq, MiniSeq Rapid, etc.) we provide the --forward-mode option to override the default reverse-complementing of the i5 barcode reads.

Output

Output consists of one pair of gzip compressed Fastq files per sample. The read headers are re-written to include the error-corrected barcode sequences and to be compact while retaining enough information to unambihguously identify each source read. For example:

@HMH53BCX3:1:1105:11433:2512_GCGTTAAA_GTGTATCG_AGCGATAG_CAGGACGT 1:N:0:0

About

Simple sciCUT&Tag demultiplexing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%