Fail to convert non ASCII html #57

itoyuyoti · 2017-10-16T07:29:58Z

pynliner fail to convert html when we use non ascii characters with Pseudo classes.

Minimal code to reproduce

# -*- coding: utf-8 -*-
import pynliner

html = u'''
<style>
blockquote>:first-child { margin-top: 0; }
</style>
<ul>
   <li>
   あいうえお
   <ul>
      <li>
      </li>
   </ul>
   </li>
</ul>
'''

print pynliner.fromString(html)

Error outputs

Traceback (most recent call last):
  File "c:/central/md2email/poc.py", line 19, in <module>
    print pynliner.fromString(html)
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\__init__.py", line 318, in fromString
    return Pynliner(**kwargs).from_string(string).run()
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\__init__.py", line 139, in run
    self._apply_styles()
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\__init__.py", line 262, in _apply_styles
    for element in select(self.soup, selector.selectorText):
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 174, in select
    context_matches = [el for el in context[0].find_all(tag, find_dict) if checker(el)]
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 98, in checker
    if not func(el):
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 82, in <lambda>
    'first-child': lambda el: is_first_content_node(getattr(el, 'previousSibling', None)),
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 70, in is_first_content_node
    if is_white_space(el):
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 50, in is_white_space
    if isinstance(el, bs4.NavigableString) and str(el).strip() == '':
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-8: ordinal not in range(128)
exit status 1

>>> pynliner.__version__
'0.8.0'

python --version
Python 2.7.13 :: Anaconda, Inc.

The text was updated successfully, but these errors were encountered:

itoyuyoti · 2017-10-16T07:38:44Z

If we remove あいうえお ( non ascii characters ) from the code, pynliner works as expected.

itoyuyoti · 2017-10-16T08:22:59Z

This patch works for us

--- a/pynliner/soupselect.py
+++ b/pynliner/soupselect.py
@@ -47,7 +47,7 @@ def get_attribute_checker(operator, attribute, value=''):


 def is_white_space(el):
-    if isinstance(el, bs4.NavigableString) and str(el).strip() == '':
+    if isinstance(el, bs4.NavigableString) and unicode(el).strip() == '':
         return True
     if isinstance(el, bs4.Comment):
         return True```

jhotujec · 2018-06-19T14:01:17Z

I'm having the same error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to convert non ASCII html #57

Fail to convert non ASCII html #57

itoyuyoti commented Oct 16, 2017

itoyuyoti commented Oct 16, 2017

itoyuyoti commented Oct 16, 2017

jhotujec commented Jun 19, 2018

Fail to convert non ASCII html #57

Fail to convert non ASCII html #57

Comments

itoyuyoti commented Oct 16, 2017

Minimal code to reproduce

Error outputs

itoyuyoti commented Oct 16, 2017

itoyuyoti commented Oct 16, 2017

jhotujec commented Jun 19, 2018