This document contains all of the code needed to generate the figures for the potato paper from the raw data. It can be executed with org-mode to generate the figures automatically. The following org-mode settings are required:
;;; use bash to evaluate shell code
(setq org-babel-sh-command "bash")
;;; export with CSS classes instead of explicit colours
;;; this uses the included CSS for syntax highlighting
;;; instead of your own emacs colours
(setq org-html-htmlize-output-type 'css)
(setq org-html-htmlize-font-prefix "org-")
;;; the same but with bootstrap export, requires `ox-twbs'
(setq org-twbs-htmlize-output-type 'css)
(setq org-twbs-htmlize-font-prefix "org-")
The code in this document will read input data from the data
directory, make
intermediate files in an output
directory, and put the final figures in a
figures
directory.
#!/bin/bash
mkdir -p output figures
These plots show the cumulative content of the assemblies as successively smaller contigs are added in. The table gives the names of the assemblies and the filenames we use for each. Much of the code in this section uses this table. It is read into the code blocks line by line into the name and filename variables.
name | filename |
---|---|
Supernova + BioNano | 10x-bn |
Supernova | 10x |
Discovar | discovar-contig |
Discovar + MP + DT + BioNano | discovar-mp-dt-bn |
Discovar + MP + Dovetail | discovar-mp-dt |
Discovar + MP + DT + PBJelly | discovar-mp-dt-jelly |
Discovar + MP + BioNano | discovar-mp-bn |
Discovar + MP | discovar-mp |
Falcon + BioNano | falcon-bn |
Falcon + DT + BioNano | falcon-dt-bn |
Falcon + Dovetail | falcon-dt |
Falcon | falcon |
First we calculate the lengths of the contigs/scaffolds in each assembly and sort in descending order of size.
#!/bin/bash
IFS=','
while read name filename; do
if [[ ! -s output/${filename}.length ]]; then
echo "Calculating lengths of ${filename}..."
gzip -cd data/${filename}.fasta.gz \
| awk '
/^>/ {
if (len) {
print len, name
}
split($0,s," ")
name=substr(s[1],2)
len=0
next
}
{
len += length($0)
}
END {
if (len) {
print len, name
}
}
' \
| sort -nr -k1,1 \
> output/${filename}.length
fi
done <<< "$table"
Now we calculate the data for the plot. The minimum length of contigs considered and the total length of all contigs of that size or bigger.
#!/bin/bash
IFS=','
while read name filename; do
if [[ ! -s output/${filename}.minlen-cumulative ]]; then
echo "Calculating cumulative lengths of ${filename}..."
awk '
BEGIN {
OFS="\t"
last = 0
sum = 0
}
{
if($1 != last && last != 0) {
print last, sum
}
last = $1
sum += $1
}
' \
output/${filename}.length \
> output/${filename}.minlen-cumulative
fi
done <<< "$table"
Now we prepare the table for ggplot by concatenating the minlen-cumulative
tables and putting the name of the assembly in the third column.
#!/bin/bash
IFS=','
while read name filename; do
awk -v OFS=',' -v fname="$filename" -v name="$name" \
'
{
print $1,$2,fname,name
}
' \
output/${filename}.minlen-cumulative
done <<< "$table" \
> output/all-assemblies.minlen-cumulative
Now we can plot these lines using ggplot:
Make PDF versions:
source(file="cumulative-content.R")
pdf(file="figures/bigcompplot.pdf", width=5, height=5)
print(bigcompplot)
dev.off()
pdf(file="figures/dvcompplot.pdf", width=5, height=5)
print(dvcompplot)
dev.off()
pdf(file="figures/pbcompplot.pdf", width=5, height=5)
print(pbcompplot)
dev.off()
pdf(file="figures/all.pdf", width=15, height=5)
grid.draw(cbind(ggplotGrob(bigcompplot),
ggplotGrob(dvcompplot),
ggplotGrob(pbcompplot),
size="last"))
dev.off()
#+RESULTS[f009dcf707ee2ca01361b8d13cb6cab67ede2e1a]:
#+RESULTS[aa1a6cf66be8279fe82bbabf45d2683693d7832b]:
#+RESULTS[7596061d4673f3c8a66b958e046b4692fe0a439d]:
For the KAT plots we have to count k-mers in the Discovar library reads and
the assemblies. The read files LIB12786_R1.fastq
and LIB12786_R2.fastq
need to be in the data
directory to run KAT as below. This generates three
matrix files with filenames ending in -main.mx
. KAT takes a long time and a
lot of memory to run so we provide these matrix files.
for asm in discovar-contig falcon-pilon 10x; do
kat comp -t 8 -o data/kat-comp-${asm} \
'data/LIB12786_R?.fastq' \
data/${asm}.fasta
done
KAT comes with its own plotting tools using Python and matplotlib, but we use the ggplot version to match the other figures in the paper.
Make PDF versions:
commandArgs <- function() c("data/kat-comp-discovar-contig-main.mx",
"200", "0", "5", "PCR free", "Discovar")
source(file="plot-comp.R")
pdf(file="figures/kat-comp-discovar.pdf", width=5, height=4, onefile=TRUE)
print(p)
dev.off()
p1 <- p
commandArgs <- function() c("data/kat-comp-falcon-pilon-main.mx",
"200", "0", "5", "PCR free", "Falcon")
source(file="plot-comp.R")
pdf(file="figures/kat-comp-falcon.pdf", width=5, height=4, onefile=TRUE)
print(p)
dev.off()
p2 <- p
commandArgs <- function() c("data/kat-comp-10x-main.mx",
"200", "0", "5", "PCR free", "Supernova")
source(file="plot-comp.R")
pdf(file="figures/kat-comp-10x.pdf", width=5, height=4, onefile=TRUE)
print(p)
dev.off()
p3 <- p
p2 <- p2 + scale_y_continuous(labels=NULL) + ylab(NULL)
p3 <- p3 + scale_y_continuous(labels=NULL) + ylab(NULL)
pdf(file="figures/kat-comp-all.pdf", width=15, height=5)
grid.draw(cbind(ggplotGrob(p1),
ggplotGrob(p2),
ggplotGrob(p3),
size="last"))
dev.off()
The PNG output from R does not look very good for some reason.
#+RESULTS[32137840f717255b2f052518ecf515d519ff97e8]:
#+RESULTS[c9d7b7e74d0350603766dedd8e8e153404a12e7e]:
#+RESULTS[6fdd7bd42b7f97cbd98fc31141e539780593c990]:
For the BAC plot we use a PacBio assembly of a BAC (labelled number 22). We
have several tracks of data: three contig alignments, three read alignments,
and a GC content and homopolymer track. The PacBio assembly of BAC 22 is
contained in bac22.fasta
.
For the contig alignments we use bwa-mem.
name | filename |
---|---|
Supernova | 10x |
Discovar | discovar-contig |
Falcon | falcon |
#!/bin/bash
bwa index data/bac22.fasta
IFS=','
while read name filename; do
bwa mem data/bac22.fasta data/${filename}.fasta \
| samtools view -b - > data/pb-${filename}.bam
sambamba sort data/pb-${filename}.bam
done <<< "$table"
The sam-to-R.py
script takes multiple SAM input files and outputs a table
for plotting with ggplot. Optional parameters are -m
for minimum length of
the alignment and -i
for minimum percent identity, both of which are
calculated from the CIGAR string for each alignment in the SAM file. The first
positional parameter provides a short name for each of the following
alignments, which is embedded in the table, and then the alignment files
follow. We use process substitution so we can use BAM files, ensuring that the
SAM header is present by using the -h
flag for samtools view.
To generate the table for our three contig tracks we run:
#!/bin/bash
python sam-to-R.py \
-m 1000 -i 99.9 dv,fal,10x \
<(samtools view -h data/pb-discovar-contig.sorted.bam) \
<(samtools view -h data/pb-falcon.sorted.bam) \
<(samtools view -h data/pb-10x.sorted.bam) \
> output/pb-r-table
For the read alignment we use bowtie2. The ord
variable tells the aligner if
the reads are in forward-reverse or reverse-forward order.
name | filename | ord |
---|---|---|
Discovar | discovar | fr |
Mate pair | mp | rf |
Dovetail | dovetail | fr |
We align each set of reads and suppress the unaligned reads to keep the file size smaller. Then they are filtered for reads with a high mapping quality and an edit distance with respect to the reference of no more than 1.
#!/bin/bash
bowtie2-build data/bac22.fasta data/bac22
IFS=','
while read name filename ord; do
bowtie2 -p 8 --no-unal --${ord} -x data/bac22 \
-1 data/${filename}-R1.fastq -2 data/${filename}-R2.fastq \
| samtools view -b - > data/${filename}-aln.bam
sambamba view -f bam -F "mapping_quality >= 30 and [NM] <= 1" \
data/${filename}-aln.bam \
> data/${filename}-aln-acc.bam
done <<< "$table"
For the paired-end (Discovar) we simply generate the depth at each reference position.
#!/bin/bash
sambamba sort data/discovar-aln-acc.bam
samtools depth -a data/discovar-aln-acc.sorted.bam \
> output/discovar-aln-acc.cov
For the mate pair and Dovetail we calculate the coverage the aligned fragments. First we need to sort the alignments by name to bring the pairs together:
#!/bin/bash
for filename in mp dovetail; do
samtools sort -n -O bam -T data/${filename}-aln-acc.sorted.name.bam \
<(samtools view -bh -F 1804 -q 1 data/${filename}-aln-acc.bam) \
> data/${filename}-aln.sorted.name.bam
done
Now we use bedtools to make a bed
file with the physical fragments:
#!/bin/bash
for filename in mp dovetail; do
bamToBed -i data/${filename}-aln.sorted.name.bam -bedpe \
| cut -f 1,2,6 | sort -k1,1 > output/${filename}.phys.bed
done
And finally calculate the coverage at each reference position:
#!/bin/bash
for filename in mp dovetail; do
bedtools genomecov -d -i output/${filename}.phys.bed \
-g data/bac22.fasta.fai > output/${filename}.phys.cov
done
For the GC content we use bedtools. We make 100bp windows and calculate the GC content in the windows.
#!/bin/bash
bedtools makewindows -g data/bac22.fasta.fai -w 100 \
> output/bac22.100bps.bed
bedtools nuc -fi data/bac22.fasta -bed output/bac22.100bps.bed \
> output/bac22.gc.txt
awk -v w=100 -vOFS='\t' -vFS='\t' 'NR > 1 {print $1,($2+$3)/2,$5}' \
output/bac22.gc.txt \
> output/bac22.gc.100bps
The homopolymers are calculated using the following python script:
#!/bin/bash
python ~/code/tools/homopolymers.py data/bac22.fasta 4 \
> output/bac22.homopolymers.4
Finally, after generating the data we can plot it all using ggplot.
#!/bin/bash
./bac-graphs2.R \
bac22 \
figures/bac22-pacbio \
output/discovar-aln-acc.cov \
output/mp.phys.cov \
output/dovetail.phys.cov \
output/bac22.gc.100bps \
output/bac22.homopolymers.4 \
output/pb-r-table \
10x,fal,dv
name | filename |
---|---|
Supernova + BioNano | 10x-bn |
Discovar + MP + DT + BioNano | discovar-mp-dt-bn |
Falcon + DT + BioNano | falcon-dt-bn |
Tuberosum reference | tuberosum |
We align the transcripts of the S. tuberosum genome to our assemblies using BLAST version 2.2.31. The transcripts were retrieved from here: http://solanaceae.plantbiology.msu.edu/data/PGSC_DM_v3.4_transcript-update.fasta.zip
for asm in 10x-bn discovar-mp-dt-bn falcon-dt-bn; do
blastn -outfmt '6 qseqid qstart qend qlen sseqid sstart send slen pident length evalue' \
-evalue 1e-3 \
-subject ${asm}.fasta \
-query PGSC_DM_v3.4_transcript-update.fasta \
> ${asm}-transcripts.blastn
done
We can adjust the minimum percentage identity of BLAST alignments that we
consider and calculate the percent coverage of each transcript in the
alignment. We use the transcript-coverage.py
script to calculate the percent
coverage for a transcript:
This is run after filtering the alignments with awk and sorting by alignment start position:
IFS=','
while read name filename; do
for i in {85..100}; do
./make-coverages data/${filename}-transcripts.blastn ${i} output/
done
done <<EOF
${table}
EOF
Now we can count the number of transcripts which appear with various thresholds:
for filename in 10x-bn discovar-mp-dt-bn falcon-dt-bn tuberosum; do
for cov in $(seq 0.85 0.01 1.0); do
for i in {85..100}; do
num=$(awk -v c=${cov} '$4 >= c' output/${filename}-transcripts-i${i}.coverage | wc -l)
echo -ne "${filename}\t${cov}\t${i}\t${num}\n"
done
done
done > output/transcript-asm-cov-id-count-2
#!/usr/bin/env Rscript
library("ggplot2")
library("reshape2")
library("grid")
t <- read.table("../genes-table", header=TRUE)
tm <- melt(t, id.vars="asm", variable.name="type", value.name="genes")
p <- ggplot(tm, aes(x=asm, y=genes, fill=type)) +
scale_fill_discrete(name="Type",
labels=c("cegcom"="Cegma complete",
"cegpar"="Cegma partial",
"buscom"="Busco complete",
"buspar"="Busco partial",
"busmis"="Busco missing")) +
scale_x_discrete(name="Assembly") +
scale_y_continuous(name="Number of genes") +
geom_bar(stat="identity", position="dodge") +
geom_bar(stat="identity", position="dodge", colour="black", show_guide=FALSE) +
theme(legend.position="bottom", legend.key.size=unit(3, "mm"),
legend.text=element_text(size=8),legend.title=element_text(size=8),
plot.margin=unit(c(2,2,0,0), "mm")) +
guides(fill=guide_legend(nrow=2,byrow=TRUE))
pdf(file="genes-bar.pdf", width=4, height=4)
print(p)
dev.off()
We generate alignments between the S. tuberosum reference and our assemblies
using nucmer
. We are interesting in our three “large” hybrid assemblies.
name | filename |
---|---|
Supernova + BioNano | 10x-bn |
Discovar + MP + DT + BioNano | discovar-mp-dt-bn |
Falcon + DT + BioNano | falcon-dt-bn-2 |
The file falcon-dt-bn-2
is generated by replacing the pipe symbols in the
fasta file with underscores because some or all of mummer’s scripts do not
support the pipe symbol.
#!/bin/bash
gzip -cd data/falcon-dt-bn.fasta.gz \
| sed 's/|/_/g' > data/falcon-dt-bn-2.fasta
The following code requires mummer and the S. tuberosum pseudomolecule
assembly version 4.03 in the data
directory. It is available here:
http://solanaceae.plantbiology.msu.edu/pgsc_download.shtml
#!/bin/bash
IFS=','
while read name filename; do
nucmer -c 65 -l 65 \
-p output/${filename} \
data/PGSC_DM_v4.03_pseudomolecules.fasta \
data/${filename}.fasta
done <<< "$table"
First we filter the delta to only include alignments of at least 90% identity and 10kb in length.
#!/bin/bash
IFS=','
while read name filename; do
if [[ ! -s output/${filename}-i${i}-l${l}.delta ]]; then
echo "Filtering delta..."
delta-filter -q -r -i ${i} -l ${l} output/${filename}.delta \
> output/${filename}-i${i}-l${l}.delta
fi
done <<< "$table"
Now we generate the coords files which we’ll need later to sort and orientate the scaffolds for each chromosome. By default mummer will plot every scaffold in the assembly even if we select only one chromosome of the reference, therefore we need to find which scaffolds actually have alignments with each reference sequence.
#!/bin/bash
IFS=','
while read name filename; do
if [[ ! -s output/${filename}-i${i}-l${l}.coords ]]; then
echo "Making coords..."
show-coords -THrcl output/${filename}-i${i}-l${l}.delta \
> output/${filename}-i${i}-l${l}.coords
fi
done <<< "$table"
Mummer will also require the length of each assembly scaffold that we wish to plot.
#!/bin/bash
IFS=','
while read name filename; do
if [[ ! -s output/${filename}.length ]]; then
echo "Calculating lengths of ${filename}..."
cat data/${filename}.fasta \
| awk '
/^>/ {
if (len) {
print len, name
}
split($0,s," ")
name=substr(s[1],2)
len=0
next
}
{
len += length($0)
}
END {
if (len) {
print len, name
}
}
' \
| sort -nr -k1,1 \
> output/${filename}.length
fi
done <<< "$table"
Finally we make a mummerplot for each assembly and each chromosome. We are
going to manually select which assembly scaffolds to display and also manually
order and orientate them. mummerplot
attempts to do the ordering and
orientation itself but doesn’t do a very good job. We want to order the
scaffolds such that most of the bases in the alignment appear on the
diagonal. We do this by sorting by the average base position in the alignment
of each scaffold. We want to orientate the scaffolds such that as much as
possible is “forwards” (more red than blue in the plots). We do this by
counting how much of the alignment length is forward and reverse and setting
the orientation accordingly.
#!/bin/bash
IFS=','
while read name filename; do
for chr in {01..12}; do
echo "Doing chr ${chr}..."
# find scaffold order and orientation
awk '$12 == "ST4.03ch'${chr}'" {print $1,$2,$3,$4,$13;}' \
output/${filename}-i${i}-l${l}.coords \
| sort -n -k1,1 \
| sort -s -k5 \
| awk '
BEGIN {
lastn=""; lastm=0; lastb=0; lastl=0;
}
{
n=$5; s=$1; e=$2; p=(s+e)/2; b=e-s; l=$4-$3;
if (lastn=="") {
lastn=n; lastm=p; lastb=b; lastl=l;
} else if (lastn!=n) {
print lastn,int(lastm),lastl
lastn=n; lastm=p; lastb=b; lastl=l;
} else {
lastm=(lastm*lastb+p*b)/(lastb+b); lastb+=b; lastl+=l
}
}
END {
print lastn,int(lastm),lastl
}
' \
| sort -k1,1 \
> \
output/${filename}-chr${chr}-contigs-i${i}-l${l}.scaf-pos-orn
join -1 1 -2 2 \
output/${filename}-chr${chr}-contigs-i${i}-l${l}.scaf-pos-orn \
<(sort -k2,2 output/${filename}.length) \
| sort -n -k2,2 \
| awk '
{if ($3>0) {
sign="+"
} else {
sign="-"
}
printf "%s\t%d\t%s\n", $1, $4, sign;}
' \
> output/${filename}-chr${chr}-contigs-i${i}-l${l}.mumlist
mummerplot -t png -r ST4.03ch${chr} \
-p output/mummerplot-${filename}-i${i}-l${l}-${chr} \
output/${filename}-i${i}-l${l}.delta \
-Q output/${filename}-chr${chr}-contigs-i${i}-l${l}.mumlist
mv output/mummerplot-${filename}-i${i}-l${l}-${chr}.png figures/
done
done <<< "$table"
For the rest of the paper we ggplot for the plots so we now convert the mummerplots to ggplot and plot them. The data is prepared with the following script:
Where mummerplot.R
is:
Convert all the figures to ggplot:
IFS=','
while read name filename; do
for chr in {01..12}; do
./mummer2ggplot \
output/mummerplot-${filename}-i${i}-l${l}-${chr} \
figures
done
done <<EOF
$table
EOF
Here is a quick preview of the figures.
IFS=','
while read name filename; do
montage figures/mummerplot-${filename}-i${i}-l${l}-??-gg.png \
-geometry 240x240+10+5 -tile 3x4 \
figures/mummerplot-${filename}-i${i}-l${l}-all-gg.png
echo "*** ${name}"
echo "[[file:figures/mummerplot-${filename}-i${i}-l${l}-all.png]]"
done <<< "$table"