Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to convert non ASCII html #57

Open
itoyuyoti opened this issue Oct 16, 2017 · 3 comments
Open

Fail to convert non ASCII html #57

itoyuyoti opened this issue Oct 16, 2017 · 3 comments

Comments

@itoyuyoti
Copy link

pynliner fail to convert html when we use non ascii characters with Pseudo classes.

Minimal code to reproduce

# -*- coding: utf-8 -*-
import pynliner

html = u'''
<style>
blockquote>:first-child { margin-top: 0; }
</style>
<ul>
   <li>
   あいうえお
   <ul>
      <li>
      </li>
   </ul>
   </li>
</ul>
'''

print pynliner.fromString(html)

Error outputs

Traceback (most recent call last):
  File "c:/central/md2email/poc.py", line 19, in <module>
    print pynliner.fromString(html)
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\__init__.py", line 318, in fromString
    return Pynliner(**kwargs).from_string(string).run()
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\__init__.py", line 139, in run
    self._apply_styles()
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\__init__.py", line 262, in _apply_styles
    for element in select(self.soup, selector.selectorText):
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 174, in select
    context_matches = [el for el in context[0].find_all(tag, find_dict) if checker(el)]
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 98, in checker
    if not func(el):
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 82, in <lambda>
    'first-child': lambda el: is_first_content_node(getattr(el, 'previousSibling', None)),
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 70, in is_first_content_node
    if is_white_space(el):
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 50, in is_white_space
    if isinstance(el, bs4.NavigableString) and str(el).strip() == '':
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-8: ordinal not in range(128)
exit status 1
>>> pynliner.__version__
'0.8.0'
python --version
Python 2.7.13 :: Anaconda, Inc.
@itoyuyoti
Copy link
Author

If we remove あいうえお ( non ascii characters ) from the code, pynliner works as expected.

@itoyuyoti
Copy link
Author

This patch works for us

--- a/pynliner/soupselect.py
+++ b/pynliner/soupselect.py
@@ -47,7 +47,7 @@ def get_attribute_checker(operator, attribute, value=''):


 def is_white_space(el):
-    if isinstance(el, bs4.NavigableString) and str(el).strip() == '':
+    if isinstance(el, bs4.NavigableString) and unicode(el).strip() == '':
         return True
     if isinstance(el, bs4.Comment):
         return True```

@jhotujec
Copy link

I'm having the same error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants