What's That Noise?! [Ian Kallen's Weblog]

Monday April 16, 2007

Character Encoding Foibles in Python

I was recently stymied by an encoding error (the exception thrown was kicked off by UnicodeError) on a web page that was detected as utf-8, the W3 Validator said it was utf-8 but in all my efforts to get a parsing classes derived from python's SGMLParser, it consistently bombed out. I tried chardet:

>>> import chardet
>>> import urllib
>>> urlread = lambda url: urllib.urlopen(url).read()
>>> chardet.detect(urlread(theurl))
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

...and yet the parser insisted that it had hit the "'ascii' codec can't decode byte XXXX in position YYYY: ordinal not in range(128)" error. WTF?!

On a hunch, I decided to try forcing it to be treated as utf-16 and then coercing it back to utf-8, like this

parser.feed(pagedata.encode("utf-16", "replace").encode("utf-8"))

That worked!

I hate it when I follow an intuited hunch, it pans out and but I don't have any explanation as to why. I just don't know the details of python's character encoding behaviors to debug this further, most of my work is in those Curly Bracket languages :)
If any python experts are having any "OMG don't do that, here's why..." reactions, please let me know!

python utf8 character sets character encoding chardet sgmlparser

( Apr 16 2007, 11:28:31 AM PDT ) Permalink

links

« April 2007 »
Sun	Mon	Tue	Wed	Thu	Fri	Sat
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Today

« April 2007 »

Sun

Mon

Tue

Wed

Thu

Fri

Sat

Today

Lijit Search