Skip to content

03.Data analysis

Jinyang Zhang edited this page Jul 18, 2023 · 1 revision

Step 1. Re-Basecall reads using hac/sup basecalling model

# e.g.
sudo su - minknow
/opt/ont/guppy/bin/guppy_basecall_client \
    -r --input_path path_to_input --save_path path_to_output \
    -c dna_r9.4.1_450bps_hac.cfg \
    --port ///tmp/.guppy/5555 \
    --barcode_kits "EXP-NBD114" \
    --compress_fastq

/opt/ont/guppy/bin/guppy_barcoder \
    -i path_to_output/pass \
    -s path_to_output/barcoded \
    --compress_fastq \
    --disable_pings

Step 2. Demultiplex reads acording to channel number

For instance, if you want to demultiplex reads from channel 1 to 256

python3 scripts/step1_demultiplex.py -i input.fastq.gz -o sample1.fastq.gz --start 1 --end 256
Usage: step1_demultiplex.py [OPTIONS]

Options:
  -i, --infile TEXT      input gzipped fastq file.  [required]
  -o, --outfile TEXT     output gzipped fastq file.  [required]
  -st, --start INTEGER   start channel number.  [required]
  -en, --end INTEGER     end channel number.  [required]
  -t, --threads INTEGER  number of threads.

Step 3. Adapter trimming

Trim nanopore sequencing adapters using Porechop

porechop -i sample1.fastq.gz -o sample1.trimmed.fastq.gz --threads 32 --check_reads 1000

Step 4. Consensus calling & orientation

Generate full-length consensus reads and non full-length reads

python3 scripts/step2_consensus.py \
    -i sample1.trimmed.fastq.gz \
    -s sequencing_summary_FAQ85160_399ee876.txt \
    -o ./output_sample1 \
    -p sample1 \
    -t 16 \
    --trimA
Usage: step2_consensus.py [OPTIONS]

Options:
  -i, --input PATH       input trimmed fastq.  [required]
  -s, --summary PATH     input sequencing summary generate by MinKNOW.  [required]
  -o, --outdir PATH      output directory.  [required]
  -p, --prefix TEXT      output prefix name.  [required]
  -r, --adapter PATH     Adapter sequences file. Defaults to embedded splint adapter sequences.
  -t, --threads INTEGER  number of threads. Defaults to number of cpu cores.
  --trimA                trim 3' poly(A) sequences
  • The output sample1.fl.fa is the full-length consensus reads with both 5' and 3' primers.
  • The output sample1.recovered.fa is the partial fragents that only have one 5'/3' primer.

Step 5. Isoform assembly & quantification

Use full-length reads for full-length isoform assembly and quantification

python step3_analysis.py \
    -i ./output_sample1 \
    -p sample1 \
    -r GRCh38.primary_assembly.genome.fa \
    -a gencode.v37.annotation.gtf \
    -t 16 \
    --assemble \
    --bed ../cancer_panel.bed
Usage: step3_analysis.py [OPTIONS]

Options:
  -i, --workspace PATH   directory of step2_consensus.py output  [required]
  -p, --prefix TEXT      sample prefix for step2_consensus.py  [required]
  -r, --genome PATH      reference genome fasta.  [required]
  -a, --gtf PATH         gene annotation gtf.  [required]
  -b, --bed PATH         bed file for target regions.  [required]
  -t, --threads INTEGER  number of threads. Defaults to number of cpu cores.
  --assemble             perform transcript isoform assemble.
  --help                 Show this message and exit.
  • The output sample1_isoforms.gtf is the assembled and annotated full-length transcript isoforms in GTF format (requires --assemble).
  • The output sample1_isoforms.transcripts.sf is transcript-level quantification results in the Salmon tsv format. The output contains five columns: transcript name, transcript length, effective length, CPM, number of reads.
  • The output sample1_isoforms.genes.sf is gene-level quantification results. The output contains three columns: gene name, CPM, number of reads.