-
Notifications
You must be signed in to change notification settings - Fork 136
Read extraction
Often when investigating an incomplete/suspicious/off part of an assembly, it is useful to gather up the long reads which align to that region. This page describes a method for doing just that.
First, you'll need to save the sequence of interest to a fasta file. If you're examining an assembly graph in Bandage, you can save selected contigs to fasta or save a path through the graph to fasta. If you're analysing the assembly with some other tool, it too hopefully has output-to-fasta options.
Next, use this one-liner monstrosity to save reads which align to the sequence of interest:
zgrep -A 1 --no-group-separator -E $(minimap2 -x map-ont -c -t 8 sequence.fasta long_reads.fastq.gz | awk '{if ($2 >= 2000 && $11 >= 1000 && $10/$11 >= 0.85) print $1;}' | sort | uniq | paste -s -d '|') long_reads.fastq.gz | sed 's|^@|>|' > reads.fasta
To use:
- Replace
long_reads.fastq.gz
with the actual path to your reads. - Replace
sequence.fasta
with the actual path to your sequence of interest. - Replace
2000
with the minimum allowed read length. - Replace
1000
with the minimum allowed alignment length. - Replace
0.85
with the minimum allowed alignment identity.
Explanation:
-
minimap2 -x map10k -c -t 8 sequence.fasta long_reads.fastq.gz | awk '{if ($2 >= 2000 && $11 >= 1000 && $10/$11 >= 0.85) print $1;}' | sort | uniq | paste -s -d '|'
produces a pipe-delimited list of string names which is used as a grep query.-
minimap2 -x map10k -c -t 8 sequence.fasta long_reads.fastq.gz
is the actual alignment command.-
-x map10k
is the minimap2 preset for Nanopore-read-to-reference alignment. -
-c
tells minimap2 to do a base-by-base alignment, which allows us to get the alignment identity. -
-t 8
is the thread count, but minimap2 is pretty fast so this doesn't matter too much.
-
-
awk '{if ($2 >= 2000 && $11 >= 1000 && $10/$11 >= 0.85) print $1;}'
gives the read name if the alignment is good enough.- A description of the columns in minimap2's PAF output can be found here.
-
$2 >= 2000
restricts output to when the read length is long enough. -
$11 >= 1000
restricts output to when the alignment length is long enough. -
$10/$11 >= 0.85
restricts output to when the alignment identity is good enough. - Adjust these thresholds as appropriate. E.g. if you have very deep long reads, you can get away with more stringent thresholds.
-
| sort | uniq
ensures we don't save the same read multiple times. -
| paste -s -d '|'
replaces newlines with pipes, so the result is ready to pass to grep.
-
-
zgrep -A 1 --no-group-separator -E $(...) long_reads.fastq.gz | sed 's|^@|>|'
gets the header and sequences lines for relevant reads out of the file of full reads.-
zgrep
works on gzipped files - change to justgrep
if your long reads aren't gzipped. -
-A 1
makes grep return both the matching line (header) and the following line (sequence). -
--no-group-separator
turns off the--
lines between matches so we can more easily turn the result into a fasta file. -
-E
uses extended regex which lets us have multiple queries separated by pipes. -
$(...)
is the output of the minimap/sort/awk/paste command which produces the grep queries. -
sed 's|^@|>|
replaces@
with>
so the result is fasta format.
-
-
reads.fasta
is made by this command and will contain the relevant long reads.
I struggled to make that command work on my Mac (some of the command line tools are frustratingly different). Here's an alternative that works on my Mac:
gunzip -c long_reads.fastq.gz | grep -A 1 --no-group-separator -E $(minimap2 -x map10k -c -t 8 sequence.fasta long_reads.fastq.gz | awk '{if ($2 >= 2000 && $11 >= 1000 && $10/$11 >= 0.85) print $1;}' | sort | uniq | tr '\n' '|' | sed 's/.$//') | sed 's|^@|>|' > reads.fasta
Note that you might need to brew install grep
to get the GNU version of grep instead of the BSD one Macs come with.
Sometimes, after extracting the reads, I like to sort them by their order on the reference/query sequence. This command does that:
minimap2 -a -x map10k -t 4 assembly.fasta reads.fasta | samtools view -F 258 -F 2048 | sort -u -k1,1 | sort -k3,3 -k4,4n | awk '{print ">"$1"\n"$10}' > reads_sorted.fasta
This command also flips reverse-strand reads to the forward strand, making reads on the same strand. This can make the reads a bit easier to work with, especially if you're going to be viewing their alignments in Bandage.
Explanation:
-
minimap2 -a -x map10k -t 4 assembly.fasta reads.fasta
aligns the reads to the sequence of interest, this time outputting in SAM format. -
samtools view -F 258 -F 2048
gets rid of non-primary alignments -
sort -u -k1,1
makes sure we have just one line per read -
sort -k3,3 -k4,4n
sorts the SAM by reference and position -
awk '{print ">"$1"\n"$10}'
converts the SAM to FASTA