-
Notifications
You must be signed in to change notification settings - Fork 6
Motivations
When we first started looking at the structural variants April 2016, there was no end-to-end structural variant detection pipeline available for Nanopore long reads.
We began analysis with existing tools such as bwa-mem with lumpy and soon found that although they work, they are understandably suboptimal in a number of ways. For instance, in the world of short reads, the chance that a read may span a complete tandem duplication is almost negligible. Thus, most structural variant identification software fail to identify such tandem duplication even when such the alignments is provided. Most critically, whether a structural variant can be identified is hinged on the alignment provided to the structural variant caller. Most aligners assume low sequencing error rate and may not provide optimal alignment. We then decided to explore tools in the two domains, namely long read alignment and structural variant calling.
As good alignment is critical to one's ability to call structural variants, we explore bwa-mem, lastal, damapper, graphmap, NextGenMap-LR, and blast for their ability to align long read containing structural variants (tandem duplication, inversion, deletion, and insertion). It turns out that lastal, blastn and damapper have very similar alignment performance (not computational time performance). Bwa-mem comes close, but with noticeable sensitivity differences. Graphmap tends to force fit sort of a "global" alignment. NextGenMap-LR is promising, but was not usable then as it focuses on PacBio long read and was unstable when working on ONT reads. This is by no mean a thorough comparison, but serve to make a practical choice of aligner that we may choose. In the end, we decided to go with lastal as it has been used in the genome-to-genome alignment context and thus address both the sensitivity and computational time concern.
As we explore the various structural variant callers, it becomes clear that most of the tools are focused on or perform very well on a subset of structural variants. There are also short-read specifics optimization that prevented them from working correctly in long-read scenarios. The reporting in a .vcf format is a good standardization but kind of make it hard for biologist to track the structural variant at the conceptual level when presented mostly at the breakpoint level. I have no doubt the community will continue to improve the standard to take care of more scenarios. For now, we will report in tab-delimited text file format for convenience and work towards .vcf format in foreseeable future.