Skip to content

The1stMartian/FastqAnalysisPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The fastqAnalysis script is a simple Seq data analysis pipeline written in Python3 for Linux. It will map fastq files to the user-specified genome using Bowtie2 and output both sorted .bam files and normalized wiggle files that can be visualized in MochiView. It will also create a read-counts file based upon the features listed in the accompanying .saf file. Read count data is particularly useful for ChIP-Seq and RNA-Seq analyses.

Pre-requisites:

Pipeline:

  1. Users should normalize all fastq file names. The script will be expecting paired-end reads with format:

    sampleName1_R1.fq
    sampleName1_R2.fq
    sampleName2_R1.fq
    sampleName2_R2.fq

    Place all files in a folder named "fastq" in the same directory as the fastqAnalysis.py script.

  2. If needed, create a genome for your model organism:

    • In the command line (linux/linux subsystem) bowtie2-build <fasta_file_path> <chosen_genome_name>
    • Place the resulting output files into the folder named "genomefiles"
    • Create a .saf file (feature coordinates/straind) for your genome by opening an existing .saf file and saving it as <chosen_genome_name>.saf
    • To input features:
      - Download a features file from NCBI genome (example using bacteria)
      https://www.ncbi.nlm.nih.gov/genome/browse#!/overview/
      - Enter your species name and click enter
      - Click "prokaryotes", NOT the species name link
      - Find your species and under the Accession tab, click the link formatted like "GCA_ ..."
      - Click the link for the FTP Directory for Genbank on the right
      - Click the link that ends with "_feature_table.txt.gz"
      - After downloading, open the file in excel
      - Copy/Paste the coordinates, systematic gene name, and strand information columns into the .saf file.
      - Make sure the new .saf file is in the genomefiles folder
    • Create .mochi formatted gene coordinates files for later use:
      - Copy an existing .mochi file as <chosen_genome_name>.mochi
      - Copy/Paste feature information from your features file into the .mochi file and save.
  3. Execute the mapping program: python fastqAnalysis_v8.py

    Cmd line

  4. Set up MochiView:

  • Download from the Johnson Lab at UCSF MochiView Website
  • Ensure you have an updated version of Java installed (Windows or Linux)
  • Instructions for launching the app can be found here

Set up your genome:

  • Import your .fasta file:

  • Import gene coordinate files

  • Import your normalized .wig files

  • Plot the data (Click "New Plot")

  • Select feature annotations to be shown

  • Select the data set(s) you want to visualize and pick line or column displays (and color)

  • Plot the data

  • Results:

Troubleshooting:

  • One common error is that the .fasta file genome name doesn't match up the .wig file headers

  • To see if this is causing a problem, look at the fasta file header using either "head -n 1 " or simply by opening the file in notepad. The file is large so notepad can be slow.

  • Write down the name in the first line, i.e. for JH642 the first line is >NZ_CP007800, so the systematic name is "NZ_CP007800".

  • Now open the .wig files and look at the header. The name should match. For JH642 it looks like:

    track variableStep chrom=NZ_CP007800 span=1

    If your genome's systematic name doesn't appear like "chrom=", then replace whatever is there with the proper name. No spaces!

  • Now try re-importing the .wig file.

Notes on wiggle file normalization:

The normalization script essentially allows all wiggle files to be directly compared by normalizing to the total reads. Without normalization, slight differences in the number of reads per library would throw off the scale. For example, if the same library was sequenced on two occasions, the resulting fastq files would likely have a different number of reads, resulting in one library simply having higher values than the other in visualization software. To account for differences in depth, the normalization script divides the value (read depth) at every nucleotide position by the sum of all values, then multiplies the resulting number (which is typically very small) by 10^6 to produce a values over 1 which is simply a convenient number to work with. The multiplier can be adjusted as need be by modifying the normalization script.

About

A Python3-based FastQ mapping and analysis pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published