Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anim analysis and classes object issue #442

Closed
luigallucci opened this issue Dec 3, 2024 · 13 comments
Closed

anim analysis and classes object issue #442

luigallucci opened this issue Dec 3, 2024 · 13 comments

Comments

@luigallucci
Copy link

Hi!

I'm trying to run pyani anim on a set of genomes downloaded from GTDB. I'm running it on python 3.10.15 and pyani version: 0.3.0-alpha, everything installed on a conda env through mamba. The problem is that running the analysis withouth --classes specified run smoothly up to the point that it stuck here:

pyani anim   --indir "$GENOME_DIR"    --outdir "$OUTPUT_DIR"   --dbpath "$DB"   --workers 40   --name "Sulfurimonas_ANI_Analysis"                                                                             
 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 295/295 [00:00<00:00, 3555516.32it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 86730/86730 [04:00<00:00, 360.20it/s]

Is stucked to that point from 5/6 hours now.

If I run it with --classes, the error message is the following:

Traceback (most recent call last):
  File "/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/bin/pyani", line 10, in <module>
    sys.exit(run_main())
  File "/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pyani/scripts/pyani_script.py", line 143, in run_main
    returnval = args.func(args)
  File "/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pyani/scripts/subcommands/subcmd_anim.py", line 200, in subcmd_anim
    genome_ids = add_run_genomes(
  File "/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pyani/pyani_orm.py", line 558, in add_run_genomes
    label_dict[key] = LabelTuple(label_data[key] or "", class_data[key] or "")
KeyError: '301570f73a84863d33e566c45554d3fd'

My code with classes is as follow:

pyani anim \
  --indir "$GENOME_DIR"  \
  --outdir "$OUTPUT_DIR" \
  --dbpath "$DB" \
  --workers 40 \
  --name "ANI_AnalysisSulf" \
  --classes "$GENOME_DIR/classes.txt"

Attached there is a subset from the classes.txt file.

classes.txt

Here the output of listdeps.

installed dependencies

System information
	Platorm==Linux-5.15.0-126-generic-x86_64-with-glibc2.35
	Python==3.10.15 | packaged by conda-forge | (main, Oct 16 2024, 01:24:24) [GCC 13.3.0]
Installed pyani Python dependendencies...
	Pillow==11.0.0 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pillow-11.0.0.dist-info)
	biopython==1.84 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/biopython-1.84.dist-info)
	matplotlib==3.5.1 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/matplotlib-3.5.1.dist-info)
	namedlist==1.8 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/namedlist-1.8.dist-info)
	networkx==3.4 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/networkx-3.4.dist-info)
	numpy==1.23.5 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/numpy-1.23.5.dist-info)
	openpyxl==3.1.5 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/openpyxl-3.1.5.dist-info)
	pandas==1.4.4 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pandas-1.4.4.dist-info)
	scipy==1.14.1 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/scipy-1.14.1.dist-info)
	seaborn==0.13.2 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/seaborn-0.13.2.dist-info)
	sqlalchemy==2.0.36 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/SQLAlchemy-2.0.36.dist-info)
	tqdm==4.67.1 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/tqdm-4.67.1.dist-info)
Installed pyani development dependendencies...
	bandit==Not Installed (-)
	black==Not Installed (-)
	codecov==Not Installed (-)
	coverage==Not Installed (-)
	doc8==Not Installed (-)
	flake8==Not Installed (-)
	jinja2==3.1.4 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/jinja2-3.1.4.dist-info)
	mypy==Not Installed (-)
	pydocstyle==Not Installed (-)
	pylint==Not Installed (-)
	pytest==Not Installed (-)
	pytest-cov==Not Installed (-)
	sphinx==Not Installed (-)
Installed pyani pip-install dependendencies...
	pre-commit==Not Installed (-)
	pytest-ordering==Not Installed (-)
	sphinx-rtd-theme==Not Installed (-)
Installed third-party tool versions...
	blast+==Linux_2.16.0+ (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/bin/blastn)
	nucmer==Linux_3.1 (/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/bin/nucmer)
Traceback (most recent call last):
  File "/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/bin/pyani", line 10, in <module>
    sys.exit(run_main())
  File "/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pyani/scripts/pyani_script.py", line 143, in run_main
    returnval = args.func(args)
  File "/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pyani/scripts/subcommands/subcmd_listdeps.py", line 86, in subcmd_listdeps
    for tool, version in get_tool_versions():
  File "/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pyani/dependencies.py", line 119, in get_tool_versions
    yield (name, func())
  File "/bioinf/home/lgallucc/software/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pyani/aniblastall.py", line 112, in get_version
    ).group()
AttributeError: 'NoneType' object has no attribute 'group'
@peterjc
Copy link
Collaborator

peterjc commented Dec 3, 2024

I note 301570f73a84863d33e566c45554d3fd is in the classes.txt file attached:

301570f73a84863d33e566c45554d3fd	GCA_015265475.1	Candidatus Sulfurimonas marisnigri strain SoZ1 chromosome, complete genome

@luigallucci
Copy link
Author

yes, is also in my genomes folder (named by accession_number.fna)

I don't know where the problem could be at that point.

@peterjc
Copy link
Collaborator

peterjc commented Dec 3, 2024

What's an easy way to download these? This fails:

$ pip install ncbi-acc-download
...
$ ncbi-acc-download --format fasta GCA_000012965.1
Failed to download file with id GCA_000012965.1 from NCBI
NCBI Entrez returned error code 400, are ID(s) GCA_000012965.1 valid?

Update: See kblin/ncbi-acc-download#12 (comment) for an explanation and a workaround

@luigallucci
Copy link
Author

@peterjc I downloaded these genomes, searched by taxonomy through ncbi datasets cli.

datasets download genome taxon "$TAXON" --filename "$GENOME_ZIP"

GCA_000012965.1 this is the accession number related to the submitted GenBank assembly, as reported on NCBI.

@peterjc
Copy link
Collaborator

peterjc commented Dec 3, 2024

Thanks. That worked to download the genomes:

$ conda install ncbi-datasets-cli
...

and:

$ for F in $(cut -f 2 ../classes.txt); do \
    datasets download genome accession $F --filename $F.zip && \
    unzip -j $F.zip "*.fna"; \
  done
...

All 295 downloaded, and the MD5 checksums verified to match with:

$ diff <(cut -f 1 ../classes.txt | sort) <(md5sum *.fna | cut -f 1 -d " "  | sort)

In theory I can now try to reproduce your example. Roughly how long did yours take with 40 workers?

@peterjc
Copy link
Collaborator

peterjc commented Dec 4, 2024

Using latest default branch on Linux,

$ pyani --version
pyani version: 0.3.0-alpha
$ pyani createdb --dbpath pyani_0_3.db
$ pyani index --indir genomes/  # gave same classes.txt you shared earlier
$ pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis" 
100%|█████████████████████████████████████████████████████████████████████████████| 295/295 [00:00<00:00, 1588343.62it/s]
100%|████████████████████████████████████████████████████████████████████████████| 86730/86730 [00:15<00:00, 5649.74it/s]
^C

That first part was fast, but then I aborted with ctrl+c (lots of traceback logging from Python's multiprocessing worker jobs). I think it probably was doing something (I should have checked with top or similar).

Anyway, as per your report, I can reproduce the KeyError:

$ pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis" --classes classes.txt 
Traceback (most recent call last):
  File "/home/pjacock/miniforge3/bin/pyani", line 33, in <module>
    sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
  File "/home/pjacock/repositories/pyani/pyani/scripts/pyani_script.py", line 143, in run_main
    returnval = args.func(args)
  File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 200, in subcmd_anim
    genome_ids = add_run_genomes(
  File "/home/pjacock/repositories/pyani/pyani/pyani_orm.py", line 558, in add_run_genomes
    label_dict[key] = LabelTuple(label_data[key] or "", class_data[key] or "")
KeyError: '747a9c2b134793ff19f33afaefc75c54'

Thank you, reproducible bug reports are very helpful!

@peterjc
Copy link
Collaborator

peterjc commented Dec 4, 2024

The first progress bar is turning the 295 genomes into 295 x 295 = 86730 pairwise combinations, and is so quick it doesn't really need a progress bar at all.

The second progress bar is building the list of 86730 jobs which will be submitted to either multiprocessing (local) or a cluster. That takes a modest amount of time.

Then sadly there is no progress bar feedback on actually running them! In theory that could be added to the multiprocessing (local job) runner code in https://github.com/widdowquinn/pyani/blob/master/pyani/run_multiprocessing.py while it might be harder for a cluster.

$ time pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis"
...

However, there is progress, top shows lots of jobs doing the delta-filter wrapper, and I can see nucmer output appearing. This was after about an hour:

$ ls pyani_0_3_output/nucmer_output/* -1 | grep -c ".filter"
61586

Sadly this failed:

$ time pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis"
100%|█████████████████████████████████████████████████████████████████████████████| 295/295 [00:00<00:00, 2655192.45it/s]
100%|███████████████████████████████████████████████████████████████████████████| 86730/86730 [00:08<00:00, 10362.90it/s]
[ERROR] [pyani.scripts.subcommands.subcmd_anim]: At least one NUCmer comparison failed. Please investigate (exiting)
Traceback (most recent call last):
  File "/home/pjacock/miniforge3/bin/pyani", line 33, in <module>
    sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
  File "/home/pjacock/repositories/pyani/pyani/scripts/pyani_script.py", line 143, in run_main
    returnval = args.func(args)
  File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 296, in subcmd_anim
    run_anim_jobs(joblist, args)
  File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 401, in run_anim_jobs
    raise PyaniException("Multiprocessing run failed in ANIm")
pyani.PyaniException: Multiprocessing run failed in ANIm

real    76m8.429s
user    3517m32.283s
sys     98m47.371s

All the filter files got computed:

$ ls pyani_0_3_output/nucmer_output/* -1 | grep -c ".filter"
86730

Recovery mode is quick, but fails - at least with a more useful message:

$ time pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis" --recovery
100%|█████████████████████████████████████████████████████████████████████████████| 295/295 [00:00<00:00, 2649506.81it/s]
[WARNING] [pyani.scripts.subcommands.subcmd_anim]: Entering recovery mode
100%|███████████████████████████████████████████████████████████████████████████| 86730/86730 [00:07<00:00, 10877.53it/s]
  0%|                                                                                          | 0/86730 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/pjacock/miniforge3/bin/pyani", line 33, in <module>
    sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
  File "/home/pjacock/repositories/pyani/pyani/scripts/pyani_script.py", line 143, in run_main
    returnval = args.func(args)
  File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 303, in subcmd_anim
    update_comparison_results(joblist, run, session, nucmer_version, args)
  File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 438, in update_comparison_results
    qaln_length, saln_length, pc_id, sim_errs = anim.parse_delta(job.outfile)
  File "/home/pjacock/repositories/pyani/pyani/anim.py", line 396, in parse_delta
    avrg_ID = sum(weighted_identical_bases) / sum(aligned_bases)
ZeroDivisionError: division by zero

real    0m21.147s
user    0m30.599s
sys     0m0.996s

I am guessing that could be a broken empty delta file, or it might be a bad alignment between two genomes. This needs to more helpful error message with the filename!

@peterjc
Copy link
Collaborator

peterjc commented Dec 4, 2024

With this patch:

$ git diff
diff --git a/pyani/scripts/subcommands/subcmd_anim.py b/pyani/scripts/subcommands/subcmd_anim.py
index 4c8c40d..cd29d80 100644
--- a/pyani/scripts/subcommands/subcmd_anim.py
+++ b/pyani/scripts/subcommands/subcmd_anim.py
@@ -435,7 +435,11 @@ def update_comparison_results(
         logger.debug(
             "\t%s vs %s; job %s", job.query.description, job.subject.description, job
         )
-        qaln_length, saln_length, pc_id, sim_errs = anim.parse_delta(job.outfile)
+        try:
+            qaln_length, saln_length, pc_id, sim_errs = anim.parse_delta(job.outfile)
+        except ZeroDivisionError:
+            # Should we log this and continue?
+            raise ZeroDivisionError(f"Problem with {job.outfile}") from None
         qcov = qaln_length / job.query.length
         scov = saln_length / job.subject.length
         run.comparisons.append(

We get: ZeroDivisionError: Problem with pyani_0_3_output/nucmer_output/GCA_000147355.1_ASM14735v1_genomic/GCA_000147355.1_ASM14735v1_genomic_vs_GCA_000012965.1_ASM1296v1_genomic.filter

That file was empty. I removed it, and tried again:

$ rm pyani_0_3_output/nucmer_output/GCA_000147355.1_ASM14735v1_genomic/GCA_000147355.1_ASM14735v1_genomic_vs_GCA_000012965.1_ASM1296v1_genomic.filter
$ time pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis" --recovery
100%|█████████████████████████████████████████████████████████████████████████████| 295/295 [00:00<00:00, 2459880.08it/s]
[WARNING] [pyani.scripts.subcommands.subcmd_anim]: Entering recovery mode
100%|███████████████████████████████████████████████████████████████████████████| 86730/86730 [00:07<00:00, 10939.85it/s]
[ERROR] [pyani.scripts.subcommands.subcmd_anim]: At least one NUCmer comparison failed. Please investigate (exiting)
Traceback (most recent call last):
  File "/home/pjacock/miniforge3/bin/pyani", line 33, in <module>
    sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
  File "/home/pjacock/repositories/pyani/pyani/scripts/pyani_script.py", line 143, in run_main
    returnval = args.func(args)
  File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 296, in subcmd_anim
    run_anim_jobs(joblist, args)
  File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 401, in run_anim_jobs
    raise PyaniException("Multiprocessing run failed in ANIm")
pyani.PyaniException: Multiprocessing run failed in ANIm

real    0m22.296s
user    0m36.502s
sys     0m3.540s

The empty file is back. Did it work for you?

Update: Looks like I had multiple failed delta-filter steps, so lots of empty files.

Update: Yeah, lots of empty failed intermediates. Frustratingly there are too many filter files to use simple wildcard expansion, but this gives an indication - most were empty:

$ dust pyani_0_3_output/nucmer_output/ -e "\.filter$" 
...

Possible patch recover mode to rebuild empty intermediate files (which isn't always the right thing to do):

$ git diff pyani/pyani_files.py
diff --git a/pyani/pyani_files.py b/pyani/pyani_files.py
index 84c44f3..2cc716c 100644
--- a/pyani/pyani_files.py
+++ b/pyani/pyani_files.py
@@ -193,12 +193,13 @@ def load_classes_labels(path: Path) -> Dict[str, str]:
 
 
 # Collect existing output files when in recovery mode
-def collect_existing_output(dirpath: Path, program: str, args: Namespace) -> List[Path]:
+def collect_existing_output(dirpath: Path, program: str, args: Namespace, ignore_empty:bool=True) -> List[Path]:
     """Return a list of existing output files at dirpath.
 
     :param dirpath:  Path, path to existing output directory
     :param program:  str, name of program to use for comparisons
     :param args:  Namespace, command-line arguments for the run
+    :param ignore_empty: bool, should empty files be excluded?
     """
     # Obtain collection of expected output files already present in directory
     if program == "nucmer":
@@ -208,4 +209,8 @@ def collect_existing_output(dirpath: Path, program: str, args: Namespace) -> Lis
             pattern = "*/*.filter"
     elif program == "blastn":
         pattern = "*.blast_tab"
-    return sorted(dirpath.glob(pattern))
+    candidates = sorted(dirpath.glob(pattern))
+    if ignore_empty:
+        # We have to go to the disk again to query file size
+        candidates = [_ for _ in candidates if _.stat().st_size]
+    return candidates

@peterjc
Copy link
Collaborator

peterjc commented Dec 4, 2024

Progress, reaching the third progress bar now!

$ time pyani anim --indir genomes/ --outdir pyani_0_3_output/ --dbpath pyani_0_3.db --name "Sulfurimonas_ANI_Analysis" --recovery
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 295/295 [00:00<00:00, 2731390.02it/s]
[WARNING] [pyani.scripts.subcommands.subcmd_anim]: Entering recovery mode
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 86728/86728 [00:08<00:00, 9821.39it/s]
  2%|██▏                                                                                                       | 1831/86728 [00:07<05:43, 247.02it/s]
Traceback (most recent call last):
  File "/home/pjacock/miniforge3/envs/pyani-plus_py312/bin/pyani", line 33, in <module>
    sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pjacock/repositories/pyani/pyani/scripts/pyani_script.py", line 143, in run_main
    returnval = args.func(args)
                ^^^^^^^^^^^^^^^
  File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 303, in subcmd_anim
    update_comparison_results(joblist, run, session, nucmer_version, args)
  File "/home/pjacock/repositories/pyani/pyani/scripts/subcommands/subcmd_anim.py", line 442, in update_comparison_results
    raise ZeroDivisionError(f"Problem with {job.outfile}") from None
ZeroDivisionError: Problem with pyani_0_3_output/nucmer_output/GCA_021044585.1_ASM2104458v1_genomic/GCA_021044585.1_ASM2104458v1_genomic_vs_GCA_001829655.1_ASM182965v1_genomic.filter

real	0m31.862s
user	0m40.162s
sys	0m4.642s

Removing the file it is regenerated again, nearly empty:

$ cat pyani_0_3_output/nucmer_output/GCA_021044585.1_ASM2104458v1_genomic/GCA_021044585.1_ASM2104458v1_genomic_vs_GCA_001829655.1_ASM182965v1_genomic.filter
/home/pjacock/Sulfurimonas/genomes/GCA_021044585.1_ASM2104458v1_genomic.fna /home/pjacock/Sulfurimonas/genomes/GCA_001829655.1_ASM182965v1_genomic.fna
NUCMER

The delta file is the same, as is the other way round (swapping the FASTA order). Both the FASTA files look fine, and their MD5 matches yours. Reducing the test case to just those two fails the same way.

Running nucmer by hand gives the same - they appear to be too different to each other?

@luigallucci
Copy link
Author

So the problem seems to be this two files?

@peterjc
Copy link
Collaborator

peterjc commented Dec 4, 2024

You have at least one outlier in the dataset (one of the pair just mentioned - don't know which one yet), but there could be others. This is something pyANI should handle better. I'll get back to you tomorrow with more information on which sequence(s) I would suggest you leave out.

@peterjc
Copy link
Collaborator

peterjc commented Dec 4, 2024

I would look at dropping all the Sulfurimonas sp. HSL entries.

Also, it looks like you can halve your dataset (e.g. GCA_021044585.1 and GCF_021044585.1 are both Sulfurimonas sp. HSL-3221 chromosome). I forget the significance of the GCA and GCF prefixes but pick one over the other? That should shrink the compute time to a quarter!

@luigallucci
Copy link
Author

@peterjc thank you for the advanced input. I will give it a better look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants