What's That Noise?! [Ian Kallen's Weblog]

Main | Next day (Apr 17, 2007) »

20070416 Monday April 16, 2007

Character Encoding Foibles in Python

I was recently stymied by an encoding error (the exception thrown was kicked off by UnicodeError) on a web page that was detected as utf-8, the W3 Validator said it was utf-8 but in all my efforts to get a parsing classes derived from python's SGMLParser, it consistently bombed out. I tried chardet:

>>> import chardet
>>> import urllib
>>> urlread = lambda url: urllib.urlopen(url).read()
>>> chardet.detect(urlread(theurl))
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}
...and yet the parser insisted that it had hit the "'ascii' codec can't decode byte XXXX in position YYYY: ordinal not in range(128)" error. WTF?!

On a hunch, I decided to try forcing it to be treated as utf-16 and then coercing it back to utf-8, like this

parser.feed(pagedata.encode("utf-16", "replace").encode("utf-8"))
That worked!

I hate it when I follow an intuited hunch, it pans out and but I don't have any explanation as to why. I just don't know the details of python's character encoding behaviors to debug this further, most of my work is in those Curly Bracket languages :)
If any python experts are having any "OMG don't do that, here's why..." reactions, please let me know!

           

( Apr 16 2007, 11:28:31 AM PDT ) Permalink