I try keep my ride on the cluetrain rolling by listening to what users of the services I help maintain have to say. The Technorati support forums have provided me with a great opportunity to hear what problems Technorati's members are experiencing. For the uninitiated, Technorati's crawler analyzes web pages to identify blog posts, make them searchable and identify links that measure what the blogosphere is paying attention to. There are a fair number of blogs that get caught in our automated blog flagging; the service processes several million pings per day and amidst that throughput, there are going to be mistakes in the flagging heuristics (flagged blogs are, naturally called "flogs", sometimes they end up demoted as "splogs" but others, turn out to be legit blogs). I'm trying to reduce the mistake rate; the indexing hazards that folks run into are a source of much grief (it doesn't take much to find folks who are very vocal about such lapses).
So, I've been on a tear over the last few weeks chasing down problems in Technorati's crawler and identifying its failure conditions. It's code that, until recently, I've not been too intimate with but inheriting responsibility for its functioning has forced me to study it more closely and grasp a firmer command of python programming. A peculiar failure case that had me puzzled for a while involved blogs that had (sufficiently) well formed pages and feeds, there didn't seem to be anything wrong with the data that'd prevent us from indexing them and yet they consistently failed to get indexed. I first became aware of it in this topic
The issue moved to a new topic where an initial diagnosis I offered (corrupted gzip encoding from Apache 2.2's mod_deflate, I thought) didn't quite pan out. But follow-ups from Technorati users KilRoY66 and wa7son helped clarify that the culprit was the gzip encoding that wordpress was configured to do. Apache 2.2/mod_deflate, you're off the hook. Their blogs (TNTVillage blog and justaddwater.dk | Instant Usability & Web Standards, respectively) both used Apache 2.2 but they both are also hosted on wordpress.org installations. For reasons yet to be explained, python's gzip library detects the encoding returned by wordpress as corrupted. Thank you, Technorati members, for helping identify this issue!
I'm going to patch the code (based on Mark Pilgrim's openanything) to recover from encoding errors and raise a proper exception if it's truly unrecoverable (as it is currently, the code catches any exceptions from decompressing the bytes, prints a message and moves along, essentially swallowing a fundamental error). In the meantime, if you're not getting indexed by Technorati and you have wordpress' compression on, try turning it off and see if that makes a difference.( Apr 21 2007, 02:15:01 PM PDT ) Permalink