Error in running average_nucleotide_identity.py :( #126

Heraud04 · 2019-02-22T22:33:26Z

HI,
I installed pyani using conda: conda install pyani

When I use the next command: average_nucleotide_identity.py -i allgenomes/ -o anita -m ANIm -g

The following error appears:

Traceback (most recent call last):
File "/home/dennis/miniconda3/bin/average_nucleotide_identity.py", line 793, in
org_lengths = pyani_files.get_sequence_lengths(infiles)
File "/home/dennis/miniconda3/lib/python3.6/site-packages/pyani/pyani_files.py", line 53, in get_sequence_lengths
sum([len(s) for s in SeqIO.parse(fn, 'fasta')])
File "/home/dennis/miniconda3/lib/python3.6/site-packages/pyani/pyani_files.py", line 53, in
sum([len(s) for s in SeqIO.parse(fn, 'fasta')])
File "/home/dennis/.local/lib/python3.6/site-packages/Bio/SeqIO/init.py", line 637, in parse
for r in i:
File "/home/dennis/.local/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 184, in FastaIterator
for title, sequence in SimpleFastaParser(handle):
File "/home/dennis/.local/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 64, in SimpleFastaParser
line = handle.readline()
File "/home/dennis/miniconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

========================================
pyani Version: 0.2.7
Python Version: 3.6.6
Operating System: Ubuntu 18.04 LTS

Thanks for your reply in advance!

widdowquinn · 2019-02-23T00:28:45Z

Hi @Heraud04,

The error message you're seeing is produced by the Biopython FASTA parser. It's complaining that there's an invalid byte (or character) in one of your input genome files.

Without seeing your data files, I can't tell exactly what the problem is. Would you be able to share a minimal (i.e. small) dataset that still throws this error, along with your command-line, so that I can investigate?

L.

widdowquinn · 2019-03-14T10:05:03Z

Hi @Heraud04

Did you find the problem in your input files, and can I close this issue?

Many thanks,

L.

AlisaGU · 2020-11-05T14:42:05Z

I have this problem. When I tested it in a small genome set, it was normal. But, when I began to compute in a big set(about 100 genomes), errors occurred.

I don't know which genome caused this error.

Do you need any other information? @widdowquinn

widdowquinn · 2020-11-05T14:51:22Z

Hi @AlisaGU

The problem is the same as I noted for @Heraud04 - I believe the error is in one of your FASTA files, and not a problem with pyani. You'll need to check your files for validity and either exclude or fix the problematic file.

Cheers,

L.

AlisaGU · 2020-11-05T22:52:38Z

ok, thanks

widdowquinn · 2020-11-05T22:54:47Z

@AlisaGU - if you do find the problematic file, I'd like to know what the "difficult" characters are. Would you be able to send me the file (or enough of the file to reproduce the issue without giving away data you don't want to share…) when you find it?

AlisaGU · 2020-11-05T22:56:08Z

@widdowquinn It's my pleasure. I will send it to you.

AlisaGU · 2020-11-06T01:54:05Z

@widdowquinn It's wierd. The program seems to run normally after I cut the genomes sets into two subsets.
It's running and no error was reported.

luigallucci · 2024-12-03T10:58:48Z

There is a solution to that? I'm running it with more than 100 of genomes.

peterjc · 2024-12-03T11:24:04Z

You can try running the Linux tool file on the FASTA files to see which if any it thinks have a non-ASCII and non-UTF-8 encoding:

$ file OP073605.fasta
OP073605.fasta: ASCII text

It will accept multiple files at once, e.g. with wild cards.

luigallucci · 2024-12-03T11:46:48Z

@peterjc thanks for the reply. Yes, they are all ASCII text. I was wondering if splitting can solve as @AlisaGU said. I will give it a try, but would be nice to know what to fix.

peterjc · 2024-12-03T12:18:43Z

@luigallucci Oh. You are saying the file command says everything is ASCII, but still you are getting UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte from Python?

Let's try opening the files in Python directly then in the default text mode - does this work?:

for F in *.fasta; do print $F; python3 -c "import sys; print(len(open(sys.argv[1]).read()))" $F; done

Hopefully it will also give the UnicodeDecodeError but tell us which a problematic filename too.

luigallucci · 2024-12-03T12:27:46Z

@peterjc your command gives me:

Error: no "print" mailcap rules found for type "text/plain"
1734343

but for all the files in my folder.

..while I tried grep --color='auto' -P -n '[^\x00-\x7F]' ./*.fna

that doesn't give me any highlighted line.

luigallucci · 2024-12-03T13:01:01Z

@peterjc Got where the problem is.

In my classes file. But I don't know where and how I can correct it. Is the one generated by pyani index but with an additional column for the source type of the genomes. I added this column by excel. I was thinking this was the problem, but no. Changing to another method gave the same problem. (e.g. python or direct txt files modification)

peterjc · 2024-12-03T13:33:13Z

@luigallucci if you think the problematic non-ASCII characters are coming from your classes file, then I strongly doubt you are really getting the same error as the original issue reported as issue #126, which coming from Bio/SeqIO/FastaIO.py when trying to parse a FASTA file.

Perhaps you should open a separate issue with a clear bug report (including versions of Python, pyANI, the operating system, the exact command used, and ideally sample data).

peterjc · 2024-12-03T21:37:54Z

@luigallucci I'm assuming you solved your unicode issue with classes.txt, but have another issue now logged as #442?

widdowquinn self-assigned this Feb 23, 2019

widdowquinn closed this as completed Oct 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in running average_nucleotide_identity.py :( #126

Error in running average_nucleotide_identity.py :( #126

Heraud04 commented Feb 22, 2019

widdowquinn commented Feb 23, 2019

widdowquinn commented Mar 14, 2019

AlisaGU commented Nov 5, 2020 •

edited

Loading

widdowquinn commented Nov 5, 2020

AlisaGU commented Nov 5, 2020

widdowquinn commented Nov 5, 2020

AlisaGU commented Nov 5, 2020

AlisaGU commented Nov 6, 2020

luigallucci commented Dec 3, 2024

peterjc commented Dec 3, 2024

luigallucci commented Dec 3, 2024

peterjc commented Dec 3, 2024

luigallucci commented Dec 3, 2024

luigallucci commented Dec 3, 2024

peterjc commented Dec 3, 2024

peterjc commented Dec 3, 2024

Error in running average_nucleotide_identity.py :( #126

Error in running average_nucleotide_identity.py :( #126

Comments

Heraud04 commented Feb 22, 2019

The following error appears:

widdowquinn commented Feb 23, 2019

widdowquinn commented Mar 14, 2019

AlisaGU commented Nov 5, 2020 • edited Loading

widdowquinn commented Nov 5, 2020

AlisaGU commented Nov 5, 2020

widdowquinn commented Nov 5, 2020

AlisaGU commented Nov 5, 2020

AlisaGU commented Nov 6, 2020

luigallucci commented Dec 3, 2024

peterjc commented Dec 3, 2024

luigallucci commented Dec 3, 2024

peterjc commented Dec 3, 2024

luigallucci commented Dec 3, 2024

luigallucci commented Dec 3, 2024

peterjc commented Dec 3, 2024

peterjc commented Dec 3, 2024

AlisaGU commented Nov 5, 2020 •

edited

Loading