Skip to content

Latest commit

 

History

History
96 lines (65 loc) · 2.73 KB

04_GENE_PREDICTION.md

File metadata and controls

96 lines (65 loc) · 2.73 KB

Predicting ORFs from contigs generated by metagenome assembly

  • Start by launching a VM wither as a droplet in Digital Ocean or locally via boot2docker and and ssh into it.
  • Download the docker bwawrik/bioinformatics:latest
docker pull bwawrik/bioinformatics:latest
  • Make a data directory
mkdir /data
  • Start the docker and mount /data
docker run -t -i -v /data:/data bwawrik/bioinformatics:latest
  • Change your directory to /data
cd /data
  • Download the sample data and unzip it. This file represents some of the contigs that were generated from a metagenome dataset. The complete assembly fasta file is much larger. Only a subsample of contigs is included here to illustate the procedure.
wget https://github.com/bwawrik/MBIO5810/raw/master/sequence_data/pipeline_mg_contigs.fas.gz
gunzip *.gz
  • Now make an output directory
mkdir /data/output

Prodigal

  • Predict ORFs as nucleotide (fna) and amio acid (faa) sequences
prodigal -d output/temp.orfs.fna -a output/temp.orfs.faa -i pipeline_mg_contigs.fas -m -o output/tempt.txt -p meta -q
cut -f1 -d " " output/temp.orfs.fna > output/prodigal.orfs.fna
cut -f1 -d " " output/temp.orfs.faa > output/prodigal.orfs.faa
rm -f output/temp*

You can do this separately by just call the ' -d output/temp.orfs.fna' or '-a output/temp.orfs.faa' flags. The last command removes the temporary files.

FragGeneScan

  • First you need to copy the model files to the local directory. (This is a workaround; I'm not sure why it doesn't work without copying these files; sorry !)
mkdir Ftrain
cp /opt/local/software/FragGeneScan1.19/train/* Ftrain
  • Now lets predict the ORFs
FragGene_Scan -s VigP03RayK31Contigs.fasta -o output/VigP03RayK31.FragGeneScan -w 1 -t complete
  • Clean up
rm -rf Ftrain
  • Run the N50.pl script on both results (see assembly tutorial).

Which one produces longers ORFs ? Which produces more ORFs ? Which is better ? Why ? What would be a better way to assess the quality of ORF calling ?

wget https://github.com/bwawrik/MBIO5810/raw/master/perl_scripts/N50.pl
perl N50.pl output/VigP03RayK31.FragGeneScan.ffn
perl N50.pl output/VigP03RayK31.prodigal.orfs.fna

Retrieving your output

  • If you are using boot2docker or a local machine, there is no need for this step.
  • Log out of your VM or droplet.
  • Then use secure copy (scp) to retrieve your files to your local drive. In this example, I used a droplet with the IP 45.55.160.193 and retrieved the files to my desktop on my macbook. Make sure you replace this with the IP for your droplet.
scp [email protected]:/data/output/* ~/Desktop/
  • If you are using a PC, use an FTP program to retrieve your files.