Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Error #34

Open
sjb554 opened this issue Nov 14, 2016 · 5 comments
Open

Unicode Error #34

sjb554 opened this issue Nov 14, 2016 · 5 comments

Comments

@sjb554
Copy link

sjb554 commented Nov 14, 2016

I am only getting this error once in a while, but it looks like this:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcd in position 7: invalid continuation byte

Can this be solved by changing the requirements.txt file? Or, is some other solution appropriate here?

Thanks,
SJB

@evidens
Copy link
Owner

evidens commented Nov 16, 2016

It sounds like the file in question might not be UTF8. You say, once in a while, are the sources different?
When a file is encoded improperly many text editors can detect the encoding and open them regardless. Some, like TextMate allow you to 'save as' to UTF8

On Mon, Nov 14, 2016 at 9:40 AM -0800, "sjb554" [email protected] wrote:

I am only getting this error once in a while, but it looks like this:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xcd in position 7: invalid continuation byte

Can this be solved by changing the requirements.txt file? Or, is some other solution appropriate here?

Thanks,

SJB


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@sjb554
Copy link
Author

sjb554 commented Nov 17, 2016

I discovered it was only letters such as 'Í' and 'Ó'. My files were large, but none so large that I couldn't manually go in and replace them (largest being a little over 1 GB).

I am thinking that if it comes up more frequently, I would need to make some kind of scrubbing program to change 'Ó' to 'O' and so on.

Thanks for all the great answers,
SJB

@evidens
Copy link
Owner

evidens commented Nov 17, 2016

I would see how you're saving your files. If they're properly encoded in UTF8, it should support extended character sets (I'm pretty sure I've tested it with French input in the past))

On Wed, Nov 16, 2016 at 11:17 PM -0800, "sjb554" [email protected] wrote:

I discovered it was only letters such as 'Í' and 'Ó'. My files were large, but none so large that I couldn't manually go in and replace them (largest being a little over 1 GB).

I am thinking that if it comes up more frequently, I would need to make some kind of scrubbing program to change 'Ó' to 'O' and so on.

Thanks for all the great answers,

SJB


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@sjb554
Copy link
Author

sjb554 commented Nov 17, 2016

That makes sense. My download and save code is not very robust:

`
def save_json(url):
import os
filename = url.replace('/','').replace(':','') .replace('.','|').replace('|json','.json').replace('|JSON','.json').replace('Json','.json').replace('|','').replace('?','').replace('=','').replace('&','').replace('_','').replace('-','')
path = "C:/xxx/json"
fullpath = os.path.join(path, filename)
import urllib2
response = urllib2.urlopen(url)
webContent = response.read()
f = open(fullpath, 'w')
f.write(webContent)
f.close()

f = open('U:/xxx/url_list.txt')
p = f.read()
url_list = p.split('\n') #here's where \n is the line break delimiter that can be changed
for url in url_list:
save_json(url)
`

@evidens
Copy link
Owner

evidens commented Nov 17, 2016

Use io.open like in this example http://stackoverflow.com/a/14870531 then the files are written out as utf-8.

On Thu, Nov 17, 2016 at 8:06 AM -0800, "sjb554" [email protected] wrote:

That makes sense. My download and save code is not very robust:

`

def save_json(url):

import os

filename = url.replace('/','').replace(':','') .replace('.','|').replace('|json','.json').replace('|JSON','.json').replace('Json','.json').replace('|','').replace('?','').replace('=','').replace('&','').replace('_','').replace('-','')

path = "C:/xxx/json"

fullpath = os.path.join(path, filename)

import urllib2

response = urllib2.urlopen(url)

webContent = response.read()

f = open(fullpath, 'w')

f.write(webContent)

f.close()

f = open('U:/xxx/url_list.txt')

p = f.read()

url_list = p.split('
') #here's where
is the line break delimiter that can be changed

for url in url_list:

save_json(url)

`


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants