diff --git a/README.md b/README.md index 761fc6f6..266e5917 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,12 @@ strobealign Strobealign is a fast short-read aligner. It achieves the speedup by using a dynamic seed size obtained from syncmer-thinned strobemers. Strobealign is multithreaded, implements alignment (SAM) and mapping (PAF), and benchmarked for SE and PE reads of lengths between 100-300bp. A preprint describing **v0.4** is available [here](https://doi.org/10.1101/2021.06.18.449070). -**Current version is 0.6.1**. See the performance of v0.6.1 [here](https://github.com/ksahlin/StrobeAlign#v061-performance). +**Current version is 0.7**. See the performance of v0.7 [here](https://github.com/ksahlin/StrobeAlign#v061-performance). + +v0.7 implements: +1. New parallelization (scrapping OpenMP) with better fileIO and CPU usage makes it substaintially faster with many cores. +2. Remove parameter -r automatic inferece of read length +3. Bugfixes. v0.6.1 implements: 1. Runtime bugfix introduced in v0.6. @@ -33,7 +38,7 @@ If you want to compile from the source, you need to have a newer `g++` and [zlib git clone https://github.com/ksahlin/StrobeAlign cd StrobeAlign # Needs a newer g++ version. Tested with version 8 and upwards. -g++ -std=c++14 main.cpp source/index.cpp source/xxhash.c source/ksw2_extz2_sse.c source/ssw_cpp.cpp source/ssw.c -lz -fopenmp -o strobealign -O3 -mavx2 +g++ -std=c++14 main.cpp source/index.cpp source/ksw2_extz2_sse.c source/xxhash.c source/ssw_cpp.cpp source/ssw.c source/pc.cpp source/aln.cpp -lz -lpthread -o strobealign -O3 -mavx2 ``` ### Zlib linking error @@ -50,7 +55,7 @@ compilation terminated. add `-I/path/to/zlib/include -L/path/to/zlib/lib` to the compilation, that is ``` -g++ -std=c++14 -I/path/to/zlib/include -L/path/to/zlib/lib main.cpp source/index.cpp source/xxhash.c source/ksw2_extz2_sse.c source/ssw_cpp.cpp source/ssw.c -lz -fopenmp -o strobealign -O3 -mavx2 +g++ -std=c++14 -I/path/to/zlib/include -L/path/to/zlib/lib main.cpp source/index.cpp source/ksw2_extz2_sse.c source/xxhash.c source/ssw_cpp.cpp source/ssw.c source/pc.cpp source/aln.cpp -lz -lpthread -o strobealign -O3 -mavx2 ``` @@ -59,12 +64,10 @@ USAGE ### Alignment -Strobealign comes with a parameter `-r read_length` that sets suitable seed parameters for the rough read length. Specifically, it sets parameters `-k`, `-l` and `-u`. If not specified, it defaults to 150. The value of `r` does not have to match the exact read length. - For alignment to SAM file: ``` -strobealign -r ref.fa reads.fa > output.sam +strobealign ref.fa reads.fa > output.sam ``` To report secondary alignments, set parameter `-N [INT]` for maximum of `[INT]` secondary alignments. @@ -74,29 +77,29 @@ To report secondary alignments, set parameter `-N [INT]` for maximum of `[INT]` For mapping to PAF file (option -x): ``` -strobealign -r -x ref.fa reads.fa > output.sam +strobealign -x ref.fa reads.fa > output.sam ``` -V0.6.1 PERFORMANCE +V0.7 PERFORMANCE --------------- -We have in below three sections investigated accuracy and runtime metrics for v0.6.1 on SIM3 and REPEATS datasets included in the preprint, as well as performance of SNV and small indel calling for additional simulated and biological (GIAB) datasets. +We have in below three sections investigated accuracy and runtime metrics for v0.7 on SIM3 and REPEATS datasets included in the preprint, as well as performance of SNV and small indel calling for additional simulated and biological (GIAB) datasets. For the biological SNV and indel experiments, we used GIAB datasets (HG004; Mother) with 2x150bp reads (subsampled to ~26x coverage) and 2x250bp reads (~17x coverage). ## Mapping accuracy and runtime -Below shows the accuracy (panel A) runtime (panel B) and %-aligned reads (panel C) for the SIM3 (Fig 1) and REPEATS (Fig 2) datasets in the preprint using strobealign v0.6.1. On all but the 2x100 datasets, strobealign has comparable or higher accuracy than BWA MEM while being substantially faster. On the 2x100 datasets, strobealign has the second highest accuracy after BWA MEM on SIM3 while being substantially faster, and comparable accuracy to minimap2 and BWA MEM on the REPEATS dataset while being twice as fast. +Below shows the accuracy (panel A) runtime (panel B) and %-aligned reads (panel C) for the SIM3 (Fig 1) and REPEATS (Fig 2) datasets in the preprint using strobealign v0.7. On all but the 2x100 datasets, strobealign has comparable or higher accuracy than BWA MEM while being substantially faster. On the 2x100 datasets, strobealign has the second highest accuracy after BWA MEM on SIM3 while being substantially faster, and comparable accuracy to minimap2 and BWA MEM on the REPEATS dataset while being twice as fast. ![v0 6 1_sim3 001 jpeg 001](https://user-images.githubusercontent.com/1714667/155574282-35bd370c-e7f5-4e59-896d-465473e6a71f.jpeg) -Fig 1. Accuracy (panel A) runtime (panel B) and %-aligned reads (panel C) for the SIM3 dataset +Figure 1. Accuracy (panel A) runtime (panel B) and %-aligned reads (panel C) for the SIM3 dataset ![v0 6 1_repeats_experiment 001](https://user-images.githubusercontent.com/1714667/155572146-9d51b822-e9bc-4dda-8703-71ea0306330b.jpeg) -Fig 2. Accuracy (panel A) runtime (panel B) and %-aligned reads (panel C) for the REPEATS dataset +Figure 2. Accuracy (panel A) runtime (panel B) and %-aligned reads (panel C) for the REPEATS dataset ## Variant calling benchmark (simulated REPEATS) -A small SNV and INDEL calling benchmark with strobealign v0.6 is provided below. We used `bcftools` to call SNPs and indels on a simulated repetitive genome based on alignments from strobealign, BWA-MEM, and minimap2. The genome is a 16.8Mbp sequence consisting of 500 concatenated copies of a 40kbp sequence which is mutated through substitutions (5%) and removing segments of size 1bp-1kbp (0.5%) along the oringinal 20Mbp string. +A small SNV and INDEL calling benchmark with strobealign v0.7 is provided below. We used `bcftools` to call SNPs and indels on a simulated repetitive genome based on alignments from strobealign, BWA-MEM, and minimap2 (ran with 1 core). The genome is a 16.8Mbp sequence consisting of 500 concatenated copies of a 40kbp sequence which is mutated through substitutions (5%) and removing segments of size 1bp-1kbp (0.5%) along the oringinal 20Mbp string. Then, 2 million paired-end reads (lengths 100, 150, 200, 250, 300) from a related genome with high variation rate: 0.5% SNVs and 0.5% INDELs. The challange is to find the right location of reads in the repetitive genome to predict the SNVs and INDELs in the related genome. In the genome where the reads are simulated from there is about 78k SNVs and INDELS, respectively. Locations of true SNVs and INDELs and provided by the read simulator. The precision (P), recall (R), and [F-score](https://en.wikipedia.org/wiki/F-score) are computed based on the true variants (for details see section [Variant calling benchmark method](https://github.com/ksahlin/StrobeAlign#variant-calling-benchmark-method)). Results in table below. @@ -130,23 +133,31 @@ There are frequent indels in this dataset (every 200th bases on average) requiri ## Variant calling benchmark (simulated SIM3) -We simulated 2x150 and 2x250 reads at 30x coverage from a human genome with SNV and indel rate according to the SIM3 genome (described in the preprint). We aligned the reads to hg38 without alternative haplotypes as proposed [here](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use). +We simulated 2x150 and 2x250 reads at 30x coverage from a human genome with SNV and indel rate according to the SIM3 genome (described in the preprint). We aligned the reads to hg38 without alternative haplotypes as proposed [here](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use). We used 16 cores for all aligners. Results are shown for SNVs and indels separately in Figure 3. For SNVs, predictions with strobealign as the aligner have an [F-score](https://en.wikipedia.org/wiki/F-score) on par with most other aligners. BWA has the best performance on this dataset. However, indel predictions have both the highest recall and precision using strobealign. Minimap2 is the close second best aligner for calling indels on this dataset, having only 0.1% lower recall and precision to strobealign. ![sv_calling_sim 001](https://user-images.githubusercontent.com/1714667/156549014-66d7b015-877e-48c2-a3b2-131e7a4a2db0.jpeg) -Fig. 3. Recall precision and F-score for the aligners on 2x150 and 2x250 datasets from SIM3. +Figure 3. Recall precision and F-score for the aligners on 2x150 and 2x250 datasets from SIM3. ## Variant calling benchmark (GIAB) -We used Illumina paired-end reads from the [GIAB datasets](https://github.com/genome-in-a-bottle/giab_data_indexes) HG004 (Mother) with the 2x150bp reads (subsampled to ~26x coverage; using only the reads in `140818_D00360_0047_BHA66FADXX/Project_RM8392`) and 2x250bp reads (~17x coverage). We aligned the reads to hg38 without alternative haplotypes as proposed [here](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use). We obtain the "true" SNVs and INDELs from the GIAB gold standard predictions formed from several sequencing technologies. They are provided [here](https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/latest/GRCh38/). +We used Illumina paired-end reads from the [GIAB datasets](https://github.com/genome-in-a-bottle/giab_data_indexes) HG004 (Mother) with the 2x150bp reads (subsampled to ~26x coverage; using only the reads in `140818_D00360_0047_BHA66FADXX/Project_RM8392`) and 2x250bp reads (~17x coverage). We aligned the reads to hg38 without alternative haplotypes as proposed [here](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use). We used 16 cores for all aligners. We obtain the "true" SNVs and INDELs from the GIAB gold standard predictions formed from several sequencing technologies. They are provided [here](https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/latest/GRCh38/). Results are shown for SNVs and indels separately in Figure 4. For SNVs, predictions with strobealign as the aligner have the highest [F-score](https://en.wikipedia.org/wiki/F-score) of all benchmarked aligners on both datasets. Strobealign's alignments yield the highest precision at the cost of a slightly lower recall. As for indels, predictions have a low recall, precision, and F-score with all aligners. This may be because we benchmarked against all gold standard SVs for HG004 that were not SNVs (see below method for the evaluation). Overall, predictions using Bowtie2 are the most desirable on these datasets. ![sv_calling 001](https://user-images.githubusercontent.com/1714667/156128690-266ccfad-14c0-48a7-8fe7-de794d98cc65.jpeg) -Fig. 4. Recall precision and F-score for the aligners on 2x150 and 2x250 datasets from HG004. +Figure 4. Recall precision and F-score for the aligners on 2x150 and 2x250 datasets from HG004. + +## Runtime + +For the four larger datasets above we show the runtime of aligners using 16 cores in Figure 5. The two SIM3 datasets are denoted SIM150 and SIM250, and the two GIAB datasets are denoted BIO150 and BIO250. Urmap was excluded from the timing benchmark because we can only get it to run with 1 core on our server as reported [here](https://github.com/rcedgar/urmap/issues/8). Strobealign is the fastest aligner across datasets. While urmap could be faser (based on the singlethreaded benchmarks) strobealign has substaintially better accuracy and downstream SV calling statistics. + +![runtime_sv](https://user-images.githubusercontent.com/1714667/161267752-8ff8d0ca-512d-46bc-a5f7-c3548dc603a3.png) +Figure 5. + ## Variant calling benchmark method