Running monopogen on very large samples #68

aidanshoham12 · 2024-07-11T00:55:51Z

Hello
Thank you for developing this tool, I have been using and testing it out over the past few months. I am interested in detecting the number of somatic SNVs present in tumor vs normal single cell samples. The samples I have been using are significantly larger than the test bone marrow bam file provided by you (my bam samples are on average 21G). This usually is not an issue, however Ive had problems running the germ line module. The errors I encounter usually have to do with the tool running out of memory:
[2024-06-24 10:21:25,004] INFO Monopogen.py Checking dependencies...
[mpileup] 1 samples in 1 input files
(mpileup) Max depth is above 1M. Potential memory hog!
Lines total/split/realigned/skipped: 209691145/485374/116534/0
Picked up JAVA_TOOL_OPTIONS: -Xmx2g
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.HashMap.resize(HashMap.java:702)
at java.base/java.util.HashMap.putVal(HashMap.java:661)
at java.base/java.util.HashMap.put(HashMap.java:610)
at java.base/java.util.HashSet.add(HashSet.java:221)
at main.Main.restrictToVcfMarkers(Main.java:343)
at main.Main.allData(Main.java:313)
at main.Main.main(Main.java:111)
gzip: path/to/germline/chr1.gp.vcf.gz: No such file or directory
path/to/germline/chr1.gp.vcf.gz: No such file or directory
Picked up JAVA_TOOL_OPTIONS: -Xmx2g
Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines
File source: path/to/germline/chr1.germline.vcf
null
at vcf.VcfHeader.checkHeaderLine(VcfHeader.java:135)
at vcf.VcfHeader.(VcfHeader.java:119)
at vcf.VcfIt.(VcfIt.java:190)
at vcf.VcfIt.create(VcfIt.java:175)
at vcf.VcfIt.create(VcfIt.java:150)
at main.Main.allData(Main.java:297)
at main.Main.main(Main.java:111)
This error message produces the chr*.gl.vcf.gz, but it has issues producing the chr*.gp.log and chr*.gp.vcf.gz files, which are required for the production of subsequent files (chr*.phased.log and chr*.phased.vcf.gz). I suspected that the production of the chr*.gp.log and chr*.gp.vcf.gz files was the cause of the out of memory error so I accessed the .../outputs/Scripts directory and looked at the runGermline_chr*.sh script.
The beginnings of the second and fourth lines began with java -Xmx20g -jar. I suspected that this was where the amount of memory was being specified to beagle. I changed the 20g to 200g: java -Xmx200g -jar and ran the second, third, fourth and fifth lines of the runGermline_chr*.sh script, as shown below after the tool had already produced the chr*.gl.vcf.gz and failed to make the chr*.gp.log and chr*.gp.vcf.gz files.

java -Xmx200g -jar /path/to/apps/beagle.27Jul16.86a.jar gl=/path/to/outputs//germline/chr1.gl.vcf.gz ref=/path/to/LD_panels/CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased.vcf.gz chrom=chr1 out=/path/to/outputs/germline/chr1.gp impute=false modelscale=2 nthreads=24 gprobs=true niterations=0

zless -S /path/to/outputs/germline/chr1.gp.vcf.gz | grep -v 0/0 > /path/to/outputs/germline/chr1.germline.vcf

java -Xmx200g -jar /path/to/beagle.27Jul16.86a.jar gt=/path/to/outputs/germline/chr1.germline.vcf ref=/path/to/LD_panels/CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased.vcf.gz chrom=chr1 out=/path/to/outputs/germline/chr1.phased impute=false modelscale=2 nthreads=24 gprobs=true niterations=0

rm /path/to/outputs/germline/chr1.germline.vcf
Interestingly, by changing the amount of memory assigned to beagle I was able to successfully run the germline module on chromosome 1 (the largest chromosome) on a tumor sample with more SNVs than normal tissue. I was then able to continue and successfully run the somatic module and get the required outputs for plotting UMAPs with mutation profiles. I find this tool very interesting and would really love to continue using it. I was wondering if there was an easier way to change this value instead of having to manually change it during every run. Is there a flag in the germline module where I can specify the memory allocated to beagle? Is manually changing the amount of memory for beagle the correct approach?
Thank you again for all your hard work!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running monopogen on very large samples #68

Running monopogen on very large samples #68

aidanshoham12 commented Jul 11, 2024

Running monopogen on very large samples #68

Running monopogen on very large samples #68

Comments

aidanshoham12 commented Jul 11, 2024