-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stochastic bug causes themisto build to hang indefinitely (themisto_linux-v3.2.1) #40
Comments
Have you tried to allocate more memory to your batch job? 2G might not be cutting it, since there could be some memory overhead on top of what is passed as an option to Themisto, which in your case is "--mem-gigas 2". |
Yes, I have tried this, allocating 6GB per job. I see another stochastic failure (1 out of ~5000 runs), where themisto hangs indefinitely. Here is the output (hung for 6 hours): 22.6940 Mon Jul 1 13:15:20 2024 Themisto-v3.2.1 Stage 1: 100%
I then re-ran the failed job, with exactly the same parameters (6GB of memory), and it finished in 10 seconds or so: 28.1510 Mon Jul 1 20:20:21 2024 Themisto-v3.2.1 Stage 1: 100% 6634.8530 Mon Jul 1 20:20:28 2024 Sorting KMC database |
The part where it is hanging is very disk-heavy. My guess is that there is some hangup in the disk IO of the cluster, and Themisto is unable to recover from that. Unfortunately that part of the code is within the KMC library, so it's hard for me to fix. In the long term I'm planning to get rid of the KMC dependency altogether, so this should be fixed then. |
Hello,
This bug only happens in ~1-10 out of ~5000 themisto build runs. I am running themisto on ~4500 genomes, calling themisto build on each of these genomes separately, using HPC to schedule them in parallel. Sometimes I see ~5 runs hang indefinitely, most recently I saw 1 run hang indefinitely. If I re-run a failed run, themisto finishes normally. So this bug does not seem to be caused by the specific data, and it occurs in ~ 0.02% of themisto build runs.
Pure speculation on my part, but perhaps caused by some kind of rare race condition?
Here is how I am calling themisto on HPC:
sbatch -p scavenger --mem=2G --cpus-per-task=4 --wrap="themisto build -k 31 -i ../results/themisto_replicon_references/GCF_017165095.1_ASM1716509v1_genomic/GCF_017165095.1_ASM1716509v1_genomic.txt --index-prefix ../results/themisto_replicon_indices/GCF_017165095.1_ASM1716509v1_genomic --temp-dir ../results/themisto_replicon_indices/temp --mem-gigas 2 --n-threads 4 --file-colors"
9 hours later, this is what the log file looked like:
failed-slurm-11363085.out.txt
When I rerun this command, the run finishes in 8 seconds, here is the log file:
rerun-slurm-11372781.out.txt
The text was updated successfully, but these errors were encountered: