Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in running average_nucleotide_identity.py :( #126

Closed
Heraud04 opened this issue Feb 22, 2019 · 16 comments
Closed

Error in running average_nucleotide_identity.py :( #126

Heraud04 opened this issue Feb 22, 2019 · 16 comments
Assignees

Comments

@Heraud04
Copy link

HI,
I installed pyani using conda: conda install pyani

When I use the next command: average_nucleotide_identity.py -i allgenomes/ -o anita -m ANIm -g

The following error appears:

Traceback (most recent call last):
File "/home/dennis/miniconda3/bin/average_nucleotide_identity.py", line 793, in
org_lengths = pyani_files.get_sequence_lengths(infiles)
File "/home/dennis/miniconda3/lib/python3.6/site-packages/pyani/pyani_files.py", line 53, in get_sequence_lengths
sum([len(s) for s in SeqIO.parse(fn, 'fasta')])
File "/home/dennis/miniconda3/lib/python3.6/site-packages/pyani/pyani_files.py", line 53, in
sum([len(s) for s in SeqIO.parse(fn, 'fasta')])
File "/home/dennis/.local/lib/python3.6/site-packages/Bio/SeqIO/init.py", line 637, in parse
for r in i:
File "/home/dennis/.local/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 184, in FastaIterator
for title, sequence in SimpleFastaParser(handle):
File "/home/dennis/.local/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 64, in SimpleFastaParser
line = handle.readline()
File "/home/dennis/miniconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

========================================
pyani Version: 0.2.7
Python Version: 3.6.6
Operating System: Ubuntu 18.04 LTS

Thanks for your reply in advance!

@widdowquinn widdowquinn self-assigned this Feb 23, 2019
@widdowquinn
Copy link
Owner

Hi @Heraud04,

The error message you're seeing is produced by the Biopython FASTA parser. It's complaining that there's an invalid byte (or character) in one of your input genome files.

Without seeing your data files, I can't tell exactly what the problem is. Would you be able to share a minimal (i.e. small) dataset that still throws this error, along with your command-line, so that I can investigate?

L.

@widdowquinn
Copy link
Owner

Hi @Heraud04

Did you find the problem in your input files, and can I close this issue?

Many thanks,

L.

@AlisaGU
Copy link

AlisaGU commented Nov 5, 2020

I have this problem. When I tested it in a small genome set, it was normal. But, when I began to compute in a big set(about 100 genomes), errors occurred.

I don't know which genome caused this error.

Do you need any other information? @widdowquinn

@widdowquinn
Copy link
Owner

Hi @AlisaGU

The problem is the same as I noted for @Heraud04 - I believe the error is in one of your FASTA files, and not a problem with pyani. You'll need to check your files for validity and either exclude or fix the problematic file.

Cheers,

L.

@AlisaGU
Copy link

AlisaGU commented Nov 5, 2020

ok, thanks

@widdowquinn
Copy link
Owner

@AlisaGU - if you do find the problematic file, I'd like to know what the "difficult" characters are. Would you be able to send me the file (or enough of the file to reproduce the issue without giving away data you don't want to share…) when you find it?

@AlisaGU
Copy link

AlisaGU commented Nov 5, 2020

@widdowquinn It's my pleasure. I will send it to you.

@AlisaGU
Copy link

AlisaGU commented Nov 6, 2020

@widdowquinn It's wierd. The program seems to run normally after I cut the genomes sets into two subsets.
It's running and no error was reported.

@luigallucci
Copy link

There is a solution to that? I'm running it with more than 100 of genomes.

@peterjc
Copy link
Collaborator

peterjc commented Dec 3, 2024

You can try running the Linux tool file on the FASTA files to see which if any it thinks have a non-ASCII and non-UTF-8 encoding:

$ file OP073605.fasta
OP073605.fasta: ASCII text

It will accept multiple files at once, e.g. with wild cards.

@luigallucci
Copy link

@peterjc thanks for the reply. Yes, they are all ASCII text. I was wondering if splitting can solve as @AlisaGU said. I will give it a try, but would be nice to know what to fix.

@peterjc
Copy link
Collaborator

peterjc commented Dec 3, 2024

@luigallucci Oh. You are saying the file command says everything is ASCII, but still you are getting UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte from Python?

Let's try opening the files in Python directly then in the default text mode - does this work?:

for F in *.fasta; do print $F; python3 -c "import sys; print(len(open(sys.argv[1]).read()))" $F; done

Hopefully it will also give the UnicodeDecodeError but tell us which a problematic filename too.

@luigallucci
Copy link

@peterjc your command gives me:

Error: no "print" mailcap rules found for type "text/plain"
1734343

but for all the files in my folder.

..while I tried grep --color='auto' -P -n '[^\x00-\x7F]' ./*.fna

that doesn't give me any highlighted line.

@luigallucci
Copy link

@peterjc Got where the problem is.

In my classes file. But I don't know where and how I can correct it. Is the one generated by pyani index but with an additional column for the source type of the genomes. I added this column by excel. I was thinking this was the problem, but no. Changing to another method gave the same problem. (e.g. python or direct txt files modification)

@peterjc
Copy link
Collaborator

peterjc commented Dec 3, 2024

@luigallucci if you think the problematic non-ASCII characters are coming from your classes file, then I strongly doubt you are really getting the same error as the original issue reported as issue #126, which coming from Bio/SeqIO/FastaIO.py when trying to parse a FASTA file.

Perhaps you should open a separate issue with a clear bug report (including versions of Python, pyANI, the operating system, the exact command used, and ideally sample data).

@peterjc
Copy link
Collaborator

peterjc commented Dec 3, 2024

@luigallucci I'm assuming you solved your unicode issue with classes.txt, but have another issue now logged as #442?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants