Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vcfdist requires too much memory when using VCF with SVs included #34

Open
rickymagner opened this issue Nov 1, 2024 · 5 comments
Open

Comments

@rickymagner
Copy link

Hi, I've been trying to use Vcfdist to compare a SNP + SV callset against truth with both, and it seems to require a prohibitive amount of memory when the SVs are included (it runs relatively quickly with just subsetting to the SNPs + small INDELs). Here is the full command used:

vcfdist comp-svs.vcf.gz /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -b /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed -v 1 --max-threads 8 --max-ram 64 --largest-variant 100000 --credit-threshold 0.7 --phasing-threshold 0.6

In this example I took a NIST-Q100 fully phased truth set here (e.g. file GRCh38_HG2-T2TQ100-V1.1.vcf.gz) and ran it against itself as both truth and query VCF. With 64GB of RAM and 8 cpus it crashes, and running with 256GB or so with many cores runs for a prohibitively long time.

It seems like given how fast it is with the SNPs that there might be something unoptimized when comparing SVs, either in the algorithm or the implementation. I wonder if running a profiler can help see if there's a memory leak or an unreasonable combinatorial explosion in this case that can be managed better.

Here is the full logs for the command:

[INFO vcfdist 22:27:29] Command: 'vcfdist comp-svs.vcf.gz /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -b /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed -v 1 --max-threads 8 --max-ram 64 --largest-variant 100000 --credit-threshold 0.7 --phasing-threshold 0.6'
[INFO vcfdist 22:27:29]
[INFO vcfdist 22:27:29] [0/8] Loading reference FASTA '/cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta'
[INFO vcfdist 22:27:45]
[INFO vcfdist 22:27:45] [Q 0/8] Parsing QUERY VCF 'comp-svs.vcf.gz'
[WARN vcfdist 22:27:45] 'PS' tag not defined in QUERY VCF header, assuming one phase set per contig
[INFO vcfdist 22:27:46] Genotypes:
[INFO vcfdist 22:27:46] 0: 667
[INFO vcfdist 22:27:46] 1: 966
[INFO vcfdist 22:27:46] 0|1: 13558
[INFO vcfdist 22:27:46] 1|0: 13671
[INFO vcfdist 22:27:46] 1|1: 6085
[INFO vcfdist 22:27:46] .|.: 11557
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] 6085 homozygous and multi-allelic variants in QUERY VCF, split for evaluation
[WARN vcfdist 22:27:46] 18014 variants with unknown (./.) alleles in QUERY VCF, skipped
[WARN vcfdist 22:27:46] 1335 variants spanned by deletion in QUERY VCF, skipped
[INFO vcfdist 22:27:46] 6952 variants outside selected regions in QUERY VCF, skipped
[INFO vcfdist 22:27:46] 39 variants on border of selected regions in QUERY VCF, skipped
[WARN vcfdist 22:27:46] 4 large (size > 100000) variants in QUERY VCF, skipped
[WARN vcfdist 22:27:46] 10 complex (CPX) variants in QUERY VCF, split into INS + DEL
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Variant types:
[INFO vcfdist 22:27:46] REF: 1335
[INFO vcfdist 22:27:46] INS: 20761
[INFO vcfdist 22:27:46] DEL: 12787
[INFO vcfdist 22:27:46] CPX: 10
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Contigs:
[INFO vcfdist 22:27:46] [ 0] chr1: 1219 | 1182 variants
[INFO vcfdist 22:27:46] [ 1] chr2: 1210 | 1249 variants
[INFO vcfdist 22:27:46] [ 2] chr3: 857 | 881 variants
[INFO vcfdist 22:27:46] [ 3] chr4: 1066 | 1102 variants
[INFO vcfdist 22:27:46] [ 4] chr5: 881 | 874 variants
[INFO vcfdist 22:27:46] [ 5] chr6: 1104 | 1105 variants
[INFO vcfdist 22:27:46] [ 6] chr7: 1023 | 1053 variants
[INFO vcfdist 22:27:46] [ 7] chr8: 816 | 792 variants
[INFO vcfdist 22:27:46] [ 8] chr9: 640 | 626 variants
[INFO vcfdist 22:27:46] [ 9] chr10: 918 | 881 variants
[INFO vcfdist 22:27:46] [10] chr11: 790 | 798 variants
[INFO vcfdist 22:27:46] [11] chr12: 812 | 784 variants
[INFO vcfdist 22:27:46] [12] chr13: 661 | 645 variants
[INFO vcfdist 22:27:46] [13] chr14: 415 | 371 variants
[INFO vcfdist 22:27:46] [14] chr15: 406 | 395 variants
[INFO vcfdist 22:27:46] [15] chr16: 585 | 564 variants
[INFO vcfdist 22:27:46] [16] chr17: 653 | 623 variants
[INFO vcfdist 22:27:46] [17] chr18: 511 | 516 variants
[INFO vcfdist 22:27:46] [18] chr19: 605 | 581 variants
[INFO vcfdist 22:27:46] [19] chr20: 469 | 492 variants
[INFO vcfdist 22:27:46] [20] chr21: 363 | 364 variants
[INFO vcfdist 22:27:46] [21] chr22: 334 | 328 variants
[INFO vcfdist 22:27:46] [22] chrX: 797 | 202 variants
[INFO vcfdist 22:27:46] [23] chrY: 25 | 0 variants
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] QUERY VCF overview:
[INFO vcfdist 22:27:46] TOTAL: 52599
[INFO vcfdist 22:27:46] KEPT : 33568
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] [T 0/8] Parsing TRUTH VCF '/cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz'
[WARN vcfdist 22:27:46] 'PS' tag not defined in TRUTH VCF header, assuming one phase set per contig
[INFO vcfdist 22:27:46] Genotypes:
[INFO vcfdist 22:27:46] 0: 667
[INFO vcfdist 22:27:46] 1: 966
[INFO vcfdist 22:27:46] 0|1: 13558
[INFO vcfdist 22:27:46] 1|0: 13671
[INFO vcfdist 22:27:46] 1|1: 6085
[INFO vcfdist 22:27:46] .|.: 11557
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] 6085 homozygous and multi-allelic variants in TRUTH VCF, split for evaluation
[WARN vcfdist 22:27:46] 18014 variants with unknown (./.) alleles in TRUTH VCF, skipped
[WARN vcfdist 22:27:46] 1335 variants spanned by deletion in TRUTH VCF, skipped
[INFO vcfdist 22:27:46] 6952 variants outside selected regions in TRUTH VCF, skipped
[INFO vcfdist 22:27:46] 39 variants on border of selected regions in TRUTH VCF, skipped
[WARN vcfdist 22:27:46] 4 large (size > 100000) variants in TRUTH VCF, skipped
[WARN vcfdist 22:27:46] 10 complex (CPX) variants in TRUTH VCF, split into INS + DEL
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Variant types:
[INFO vcfdist 22:27:46] REF: 1335
[INFO vcfdist 22:27:46] INS: 20761
[INFO vcfdist 22:27:46] DEL: 12787
[INFO vcfdist 22:27:46] CPX: 10
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Contigs:
[INFO vcfdist 22:27:46] [ 0] chr1: 1219 | 1182 variants
[INFO vcfdist 22:27:46] [ 1] chr2: 1210 | 1249 variants
[INFO vcfdist 22:27:46] [ 2] chr3: 857 | 881 variants
[INFO vcfdist 22:27:46] [ 3] chr4: 1066 | 1102 variants
[INFO vcfdist 22:27:46] [ 4] chr5: 881 | 874 variants
[INFO vcfdist 22:27:46] [ 5] chr6: 1104 | 1105 variants
[INFO vcfdist 22:27:46] [ 6] chr7: 1023 | 1053 variants
[INFO vcfdist 22:27:46] [ 7] chr8: 816 | 792 variants
[INFO vcfdist 22:27:46] [ 8] chr9: 640 | 626 variants
[INFO vcfdist 22:27:46] [ 9] chr10: 918 | 881 variants
[INFO vcfdist 22:27:46] [10] chr11: 790 | 798 variants
[INFO vcfdist 22:27:46] [11] chr12: 812 | 784 variants
[INFO vcfdist 22:27:46] [12] chr13: 661 | 645 variants
[INFO vcfdist 22:27:46] [13] chr14: 415 | 371 variants
[INFO vcfdist 22:27:46] [14] chr15: 406 | 395 variants
[INFO vcfdist 22:27:46] [15] chr16: 585 | 564 variants
[INFO vcfdist 22:27:46] [16] chr17: 653 | 623 variants
[INFO vcfdist 22:27:46] [17] chr18: 511 | 516 variants
[INFO vcfdist 22:27:46] [18] chr19: 605 | 581 variants
[INFO vcfdist 22:27:46] [19] chr20: 469 | 492 variants
[INFO vcfdist 22:27:46] [20] chr21: 363 | 364 variants
[INFO vcfdist 22:27:46] [21] chr22: 334 | 328 variants
[INFO vcfdist 22:27:46] [22] chrX: 797 | 202 variants
[INFO vcfdist 22:27:46] [23] chrY: 25 | 0 variants
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] TRUTH VCF overview:
[INFO vcfdist 22:27:46] TOTAL: 52599
[INFO vcfdist 22:27:46] KEPT : 33568
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] Checking contigs:
[INFO vcfdist 22:27:46] All contig checks passed!
[INFO vcfdist 22:27:46]
[INFO vcfdist 22:27:46] [Q 3/8] Wavefront clustering QUERY VCF 'comp-svs.vcf.gz'
/cromwell_root/script: line 46: 17 Killed vcfdist comp-svs.vcf.gz /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -b /cromwell_root/fc-36a7fa15-2513-416e-bcbf-0322bb0e076a/nist_q100/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed -v 1 --max-threads 8 --max-ram 64 --largest-variant 100000 --credit-threshold 0.7 --phasing-threshold 0.6
@TimD1
Copy link
Owner

TimD1 commented Nov 2, 2024

Thanks for starting a discussion on this. At the moment, I would not recommend running vcfdist on datasets with variants larger than 10Kb.

The reason for this is that unlike other tools, vcfdist performs full alignment of all variants, and can guarantee that equivalent variants will always be found. With a 100Kb variant, vcfdist will have to do ~300Kb by ~300Kb alignment (since the variant can shift in repetitive regions). The alignment algorithm used in vcfdist v2 uses O(n*n) memory. With 4-byte integers, that's a 360GB dynamic programming matrix.

I'm currently working on vcfdist v3, which will use a graph wavefront alignment algorithm that requires O(n*s) memory, where s is the alignment score. This will scale better, but will still not be a magic bullet. I will likely have to limit the maximum alignment score, and provide special behavior when this limit is hit (which should only occur for FNs). I'll have to mark the truth variant that caused the largest score increase as a FN, and try aligning again without it.

I apologize for the inconvenience, but vcfdist won't be able to support variants that large for a while.

@rickymagner
Copy link
Author

Hi @TimD1, thanks for the reply. I tried running the same file against itself with --largest-variant capped at 10,000 but I still ran into a memory error. Is it possible there is something else causing excessive memory use, or even if the flag isn't properly implemented to filter out longer variants? I'm still trying to debug a bit with the file but thought I'd ask to see if you have some parameters you could use to get the above command to complete.

@TimD1
Copy link
Owner

TimD1 commented Nov 5, 2024

vcfdist was basically designed to achieve better accuracy at the expense of increased memory/computation. The alignment is working correctly, it just requires a lot of memory.

I would try things in the following order:

  1. Make sure your benchmarking BED includes only the regions you're interested in. If you need to use a particular BED, then there's nothing you can do here, but it's more efficient to filter out variants by region before evaluation than after.

  2. Reduce the maximum variant size with --largest-variant. You've already gone down from 100Kb to 10Kb, so you probably won't want to do this further. Just note that vcfdist memory usage and runtime is O(n*n) in regards to this parameter, so running with --largest-variant 1000 will be at least 100x faster and use 100x less memory, so you can run this very quickly to get some initial results.

  3. The default clustering algorithm is --cluster biwfa, which dynamically calculates the left and rightmost reaches of variants (which quickly gets expensive for SVs). This means that a 10Kb variant in a highly repetitive region may be able to shift left or right 50Kb and lead to an 100Kb by 100Kb alignment. Note that all changes from the default clustering may result in undetected equivalent variants between truth and query VCFs (e.g. false FPs), but this should happen at a very low rate (and only to equivalent variants that are located far away from one another on the reference) with large runtime/memory usage improvements. This trade-off is most likely worth it.

    1. My first suggestion for changing clustering is to use --cluster biwfa --max-iterations 1 (which doesn't allow re-calculation of the reaches if e.g. the reaches of two 10Kb variants overlap). This should impact your results very minimally.
    2. If that's not enough, my second suggestion is to use --cluster size 100, which will use a simpler clustering algorithm. Basically, the evaluation will assume that query variants are within 100 reference bases or the size of the variant (whichever is larger) from its corresponding counterpart in the truth VCF.
    3. The last option is to use --cluster gap 100, which basically assumes that evaluated variants are within 100 bases of corresponding variants in the truth VCF.

Note that cheaper clustering algorithms will impact the results of SV benchmarking more than SNPs or INDELs (because equivalent SVs can generally be located farther apart on the reference. Supplementary Table 3 of our manuscript has a comparison of a few of these clustering options. The wiki has more info on vcfdist's clustering parameters as well.

@rickymagner
Copy link
Author

Hi Tim, I've tried using your three suggestions, and still end up running out of memory even at 64GB. Here's an excerpt of the logs before the process gets killed:

...
[INFO vcfdist 19:29:41]
[INFO vcfdist 19:29:41] [4/8] Superclustering TRUTH and QUERY variants
[INFO vcfdist 19:29:42] Total superclusters: 2663733
[INFO vcfdist 19:29:42] Largest supercluster (bases): 30616
[INFO vcfdist 19:29:42] Largest supercluster (vars): 2182
[INFO vcfdist 19:29:42] Average supercluster (bases): 74.261
[INFO vcfdist 19:29:42] Average supercluster (vars): 4.758
[WARN vcfdist 19:29:42] 52 total superclusters contain variants from multiple phase sets (PS)
[INFO vcfdist 19:29:42] QUERY phase sets: 2217
[INFO vcfdist 19:29:42] QUERY phase block NG50: 2503804
[INFO vcfdist 19:29:42] TRUTH phase sets: 24
[INFO vcfdist 19:29:42] TRUTH phase block NG50: 156040895
[INFO vcfdist 19:29:42]
[INFO vcfdist 19:29:42] Sorting superclusters by size
[INFO vcfdist 19:29:42]
[INFO vcfdist 19:29:42] [5/8] Calculating precision and recall
[INFO vcfdist 19:29:42] Superclusters using 0.000 to 5.250 GB RAM each ( 12 threads): 2663544
[INFO vcfdist 19:29:42] Superclusters using 5.250 to 10.500 GB RAM each ( 6 threads): 150
[INFO vcfdist 19:29:42] Superclusters using 10.500 to 21.000 GB RAM each ( 3 threads): 36
[INFO vcfdist 19:29:42] Superclusters using 21.000 to 63.000 GB RAM each ( 1 threads): 3

I'm wondering if similar to the --max-variant flag there could be a --max-supercluster-size parameter that would essentially skip these 3 superclusters grabbing for way too much memory, assuming that's really the issue. I'm ok throwing away a small handful of variants just to be able to finish the compute. Do you think something like that might help or just a red herring? In that case it would be nice to still have a log of which variants/superclusters were skipped somewhere.

@rickymagner
Copy link
Author

I did a bit more digging into this and made some observations. I tried splitting by chromosome, and found some required much more memory than others, to the point of some shards failing while other succeed. The common thread was the Largest supercluster (vars) were higher for the failed shards, and lower for the successful ones. This seems to suggest that this is the limiting factor, and a flag like --max-supercluster-size with the option to write to something like a skipped_sites.vcf.gz might "solve" this issue entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants