-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
anim analysis and classes object issue #442
Comments
I note 301570f73a84863d33e566c45554d3fd is in the 301570f73a84863d33e566c45554d3fd GCA_015265475.1 Candidatus Sulfurimonas marisnigri strain SoZ1 chromosome, complete genome |
yes, is also in my genomes folder (named by accession_number.fna) I don't know where the problem could be at that point. |
What's an easy way to download these? This fails: $ pip install ncbi-acc-download
...
$ ncbi-acc-download --format fasta GCA_000012965.1
Failed to download file with id GCA_000012965.1 from NCBI
NCBI Entrez returned error code 400, are ID(s) GCA_000012965.1 valid? Update: See kblin/ncbi-acc-download#12 (comment) for an explanation and a workaround |
@peterjc I downloaded these genomes, searched by taxonomy through ncbi datasets cli.
GCA_000012965.1 this is the accession number related to the submitted GenBank assembly, as reported on NCBI. |
Thanks. That worked to download the genomes: $ conda install ncbi-datasets-cli
... and: $ for F in $(cut -f 2 ../classes.txt); do \
datasets download genome accession $F --filename $F.zip && \
unzip -j $F.zip "*.fna"; \
done
... All 295 downloaded, and the MD5 checksums verified to match with:
In theory I can now try to reproduce your example. Roughly how long did yours take with 40 workers? |
Using latest default branch on Linux, $ pyani --version
pyani version: 0.3.0-alpha
$ pyani createdb --dbpath pyani_0_3.db
$ pyani index --indir genomes/ # gave same classes.txt you shared earlier
$ pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis"
100%|█████████████████████████████████████████████████████████████████████████████| 295/295 [00:00<00:00, 1588343.62it/s]
100%|████████████████████████████████████████████████████████████████████████████| 86730/86730 [00:15<00:00, 5649.74it/s]
^C That first part was fast, but then I aborted with ctrl+c (lots of traceback logging from Python's multiprocessing worker jobs). I think it probably was doing something (I should have checked with Anyway, as per your report, I can reproduce the $ pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis" --classes classes.txt
Traceback (most recent call last):
File "/home/pjacock/miniforge3/bin/pyani", line 33, in <module>
sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
File "/home/pjacock/repositories/pyani/pyani/scripts/pyani_script.py", line 143, in run_main
returnval = args.func(args)
File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 200, in subcmd_anim
genome_ids = add_run_genomes(
File "/home/pjacock/repositories/pyani/pyani/pyani_orm.py", line 558, in add_run_genomes
label_dict[key] = LabelTuple(label_data[key] or "", class_data[key] or "")
KeyError: '747a9c2b134793ff19f33afaefc75c54' Thank you, reproducible bug reports are very helpful! |
The first progress bar is turning the 295 genomes into 295 x 295 = 86730 pairwise combinations, and is so quick it doesn't really need a progress bar at all. The second progress bar is building the list of 86730 jobs which will be submitted to either multiprocessing (local) or a cluster. That takes a modest amount of time. Then sadly there is no progress bar feedback on actually running them! In theory that could be added to the multiprocessing (local job) runner code in https://github.com/widdowquinn/pyani/blob/master/pyani/run_multiprocessing.py while it might be harder for a cluster. $ time pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis"
... However, there is progress,
Sadly this failed:
All the filter files got computed:
Recovery mode is quick, but fails - at least with a more useful message:
I am guessing that could be a broken empty delta file, or it might be a bad alignment between two genomes. This needs to more helpful error message with the filename! |
With this patch:
We get: That file was empty. I removed it, and tried again:
Update: Looks like I had multiple failed delta-filter steps, so lots of empty files. Update: Yeah, lots of empty failed intermediates. Frustratingly there are too many filter files to use simple wildcard expansion, but this gives an indication - most were empty: $ dust pyani_0_3_output/nucmer_output/ -e "\.filter$"
... Possible patch recover mode to rebuild empty intermediate files (which isn't always the right thing to do): $ git diff pyani/pyani_files.py
diff --git a/pyani/pyani_files.py b/pyani/pyani_files.py
index 84c44f3..2cc716c 100644
--- a/pyani/pyani_files.py
+++ b/pyani/pyani_files.py
@@ -193,12 +193,13 @@ def load_classes_labels(path: Path) -> Dict[str, str]:
# Collect existing output files when in recovery mode
-def collect_existing_output(dirpath: Path, program: str, args: Namespace) -> List[Path]:
+def collect_existing_output(dirpath: Path, program: str, args: Namespace, ignore_empty:bool=True) -> List[Path]:
"""Return a list of existing output files at dirpath.
:param dirpath: Path, path to existing output directory
:param program: str, name of program to use for comparisons
:param args: Namespace, command-line arguments for the run
+ :param ignore_empty: bool, should empty files be excluded?
"""
# Obtain collection of expected output files already present in directory
if program == "nucmer":
@@ -208,4 +209,8 @@ def collect_existing_output(dirpath: Path, program: str, args: Namespace) -> Lis
pattern = "*/*.filter"
elif program == "blastn":
pattern = "*.blast_tab"
- return sorted(dirpath.glob(pattern))
+ candidates = sorted(dirpath.glob(pattern))
+ if ignore_empty:
+ # We have to go to the disk again to query file size
+ candidates = [_ for _ in candidates if _.stat().st_size]
+ return candidates |
Progress, reaching the third progress bar now!
Removing the file it is regenerated again, nearly empty: $ cat pyani_0_3_output/nucmer_output/GCA_021044585.1_ASM2104458v1_genomic/GCA_021044585.1_ASM2104458v1_genomic_vs_GCA_001829655.1_ASM182965v1_genomic.filter
/home/pjacock/Sulfurimonas/genomes/GCA_021044585.1_ASM2104458v1_genomic.fna /home/pjacock/Sulfurimonas/genomes/GCA_001829655.1_ASM182965v1_genomic.fna
NUCMER The delta file is the same, as is the other way round (swapping the FASTA order). Both the FASTA files look fine, and their MD5 matches yours. Reducing the test case to just those two fails the same way. Running nucmer by hand gives the same - they appear to be too different to each other? |
So the problem seems to be this two files? |
You have at least one outlier in the dataset (one of the pair just mentioned - don't know which one yet), but there could be others. This is something pyANI should handle better. I'll get back to you tomorrow with more information on which sequence(s) I would suggest you leave out. |
I would look at dropping all the Sulfurimonas sp. HSL entries. Also, it looks like you can halve your dataset (e.g. GCA_021044585.1 and GCF_021044585.1 are both Sulfurimonas sp. HSL-3221 chromosome). I forget the significance of the GCA and GCF prefixes but pick one over the other? That should shrink the compute time to a quarter! |
@peterjc thank you for the advanced input. I will give it a better look. |
Hi!
I'm trying to run
pyani anim
on a set of genomes downloaded from GTDB. I'm running it onpython 3.10.15
andpyani version: 0.3.0-alpha
, everything installed on a conda env through mamba. The problem is that running the analysis withouth--classes
specified run smoothly up to the point that it stuck here:Is stucked to that point from 5/6 hours now.
If I run it with
--classes
, the error message is the following:My code with classes is as follow:
Attached there is a subset from the classes.txt file.
classes.txt
Here the output of
listdeps
.installed dependencies
The text was updated successfully, but these errors were encountered: