Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EpubReader throws Exception if OPF's <metadata> tag contains XML comments #105

Closed
wendong-li opened this issue Aug 22, 2016 · 5 comments
Closed

Comments

@wendong-li
Copy link

wendong-li commented Aug 22, 2016

Just wanted to share some information about an error I faced when parsing the metadata of an epub file using EpubReader.

I was using ebooklib v0.15. When the error happens, the library throws the following exception:

'cython_function_or_method' object has no attribute 'rfind'

Checked the source code, and confirmed that it comes from the following code:

class EpubReader(object):
    ...
    def _load_metadata(self):
        ...
        for t in metadata:
            if not etree.iselement(t):
                continue
            if t.tag == default_ns + 'meta':
                ...
            else:
                tag = t.tag[t.tag.rfind('}') + 1:]    // From here

The reason is that etree.iselement(t) seems to treat both comments and tags as elements, e.g. it returns true even if t is an XML comment.

I was using python 2.7.10 on Mac 10.11 with lxml v3.6.2.

I checked the issue history here but seems no one else met this issue before, is it only me because of my environment?

It seems to me that this is an issue of the lxml library. And I tried to upgrade my lxml to v3.6.3 but failed to do that (couldn't fix some error during the installation).

In case it's a common issue, it would be better that ebooklib handles it internally. That's why I created this ticket.

Thanks,
Wendong

@benjhastings
Copy link

I just hit the same issue. Really the best way to fix this would be to modify ebooklib.utils#parse_string so that it accepts some kwargs. Then the user can specify a parser if they require one.

I ended up patching my local install of ebooklib so that it used my version of the parse_string function that ebooklib calls in the _load_container() method.

My modified "parse_string" method is as follows:

def parse_string(s):
    parser = etree.XMLParser(remove_comments=True)
    try:
        tree = etree.parse(io.BytesIO(s.encode('utf-8')), parser=parser)
    except:
        tree = etree.parse(io.BytesIO(s), parser=parser)
    return tree

I then just patched my local copy so that instead of importing parse_string from ebooklib.utils I imported it from mymodule.utils

@benjhastings
Copy link

I just added a pull request in (#115) which should resolve this issue.

@aerkalov
Copy link
Owner

aerkalov commented Mar 9, 2017

It seems that this issue has been fixed on 2014-10-17 by Chris Grice and the fix is in the master branch but there has been no release since then. Last release was on 2014-05-15. Not so nice from my side. What I will do is fix couple of smaller issues and do the release ASAP.

@aerkalov
Copy link
Owner

There is a 0.16 release now which fixes this issue.

@wendong-li
Copy link
Author

Thanks @aerkalov. Confirmed it's not reproduced any more in v0.16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants