Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenBank ValueError with BioPython #71

Open
MrTomRod opened this issue Sep 23, 2022 · 4 comments
Open

GenBank ValueError with BioPython #71

MrTomRod opened this issue Sep 23, 2022 · 4 comments

Comments

@MrTomRod
Copy link

Hey! I ran into trouble with GenBank output. I tried to parse a file that contains these lines with BioPython:

LOCUS       scf_19                 41458 bp    DNA     linear   VRL 2022-09-23
DEFINITION  scf_19.
COMMENT     Annotated using VIBRANT v1.2.1
FEATURES             Location/Qualifiers
     source          /organism="scf_19"

Normally, BioPython uses this code to parse your "invalidly spaced" GenBank.

But because the LOCUS line is less than 79 characters long, the BioPython parser goes into this code, triggering a ValueError on line 1438:

  File ".../lib64/python3.10/site-packages/Bio/GenBank/Scanner.py", line 1438, in _feed_first_line
    raise ValueError(
ValueError: LOCUS line does not contain - at position 71 in date:
LOCUS       scf_19                 41458 bp    DNA     linear   VRL 2022-09-23

If it's not too much trouble, please fix this issue.

@KrisKieft
Copy link
Member

Hi,

I apologize but I probably will not get to fixing this issue. Please try other methods of building a genbank from the source genomes/proteins.

@MrTomRod
Copy link
Author

No problem, there are easy workarounds.

@peterjc
Copy link

peterjc commented Apr 4, 2023

The source feature is also malformed, which used to trigger a warning in Biopython but recently we had a regression and errored - see biopython/biopython#4274

@peterjc
Copy link

peterjc commented Apr 5, 2023

There appear to be more GenBank issues flagged by the Biopython parser in the example flagged in biopython/biopython#4274 and are probably general issues:

  • Source feature without a location (issue above)
  • Not escaping double quotes in qualifier values as two double quotes
  • Sequence lines indented one more space than normal
  • Missing // line after the sequence at the end of the record

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants