Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem converting .docx file with to_html method and example from documentation #251

Open
louloup22 opened this issue Jul 27, 2018 · 1 comment

Comments

@louloup22
Copy link

Hello,

I followed the example from your documentation to convert my docx file to html:

from pydocx import PyDocX
# Pass in a path
html = PyDocX.to_html('file.docx')
# Pass in a file object
html = PyDocX.to_html(open('file.docx', 'rb'))
# Pass in a file-like object
from cStringIO import StringIO
buf = StringIO()
with open('file.docx') as f:
    buf.write(f.read())
html = PyDocX.to_html(buf)

As I am using Python 3.6 I changed cStringIO to io. However I always have the same issue with my .docx file at the line buf.write(f.read())

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-24-598c617210d8> in <module>()
     10 buf = StringIO()
     11 with open('file.docx') as f:
---> 12     buf.write(f.read())
     13 html = PyDocX.to_html(buf)

~/anaconda3/lib/python3.6/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 14: invalid start byte

It is the case with all the .docx files I tried. Does anybody can suggest what is wrong ?

@jlward
Copy link
Contributor

jlward commented Jul 27, 2018

Have you tried encoding f.read().encode('utf8')?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants