What's That Noise?! [Ian Kallen's Weblog]

20080407 Monday April 07, 2008

The WordPress Security Cancer

The blogosphere has had its share of maladies before. Comment spam, trackback spam, splogs and link trading schemes are the colds and flus that we've come to know and groan about. But lately, a cancer has afflicted the ecosystem that has led us at Technorati to take some drastic measures. Thousands of WordPress installations out in the wilds of the web are vulnerable to security compromises, they are being actively exploited and we're not going to index them until they're fixed.

We know about them at Technorati because part of what we do is count links. Compromised blogs have been coming to our attention because they have unusually high outbound links to spam destinations. The blog authors are usually unaware that they've been p0wned because the links are hidden with style attributes to obscure their visibility. Some bloggers only find out when they've been dropped by Google, this WordPress user wrote

My 2.2 installation was being hacked into and spam hidden links dumped into index.php. I didn't notice until google decided to ban me (they have now reincluded my site).
read it

To their credit, the WordPress developers have been fixing the issues. They released v2.3.3 in February and patches for older releases to thwart this exploit. More recently, they released v2.5, which in addition to having the flawed XML-RPC code fixed, boasts a number of new features. But from what I can tell, despite brisk uptake many blogs remain obliviously vulnerable and the occurrence of compromised blogs seems to only be accelerating. As of today, here is the count of blogs running WordPress installs that have pinged Technorati in the last 90 days:

VersionCount (in thousands)
2.3.3237
2.3.1154
2.3.2146
2.578
2.2.275
2.2.367
2.0.159
2.1.237
2.2.135
2.230
and it trails off with more point releases. So 2.3.3 and 2.5 have enjoyed rapid adoption but AFAICT, it ain't rapid enough -- there are still hundreds of thousands of vulnerable installations out there. Note: I didn't include the WordPress/MU installations out there, I'm note sure what, if any, vulnerabilities are on those sites and anyway, there's a long tail of splog sites running that shite already.

So at Technorati today, I posted that we deployed an update to the crawlers to abort the crawl if the blog appears to have symptoms of being compromised. We'll probably rescind this measure when the number of vulnerable installations in the distribution above looks a little better (some of the false positives I've found are patched but still have unusual metrics associated with the crawl, so they look fishy). However for the time being, these are just creating a lot of noise and instability in our systems and enough is enough. If you're running an old WordPress installation and you're not getting indexed, stop what you're doing and upgrade. Just Do It. The docs on the WordPress site seem to cover what you need to know and the WordPress Forums should help fill in the gaps.

Digging through the lore, it looks like there have been a procession of security problems with WordPress installations:

wp-forum
There's the 'WP-Forum Plugin for WordPress "user" SQL Query Injection Vulnerability' advisory from French Security Incident Response Team in January.
theme distributors
WordPress theme author Derek Punsalan advised 'Do not download WordPress themes distributed by 3rd party sites' last November.

Using Technorati membership information, I have personally contacted several hundred of bloggers about this issue. These have included blogs with no authority as well as blogs belonging to A-listers. Many have been grateful for the heads up but none (that I have spotted) have posted about this issue. The blogs that are unclaimed are SOL, I don't have any way to reach them (without groping around their site to find a contact email, though I've done a little of that too). Kevin Burton has made a public plea, Anyone Want to Help Fix these Compromised Wordpress Blogs? One blog that did break the silence (Deep Jive Interests) did so in response to tweets about the issue that Kevin's been facing on TailRank.

But is outreach to bloggers going to be enough to stop the spread of this cancer? Probably not. I think the best way to get the word out is to spread the word, tell bloggers you know to post about it. For their part, what I'd really like to see from the WordPress folks (and all blog CMS developers) are

  1. Automated updates -- I understand that automating upgrades my be problematic when there are database schema changes and such required but installing security patches should be an option in the administrative console
  2. Security check services -- Bloggers who are uncertain of their blog's vulnerability should be able to authenticate (via OpenID) that they are the author and have their blog sniffed for security holes. OK, this won't work for old versions that don't support OpenID or if, heaven forbid, the OpenID libraries themselves are compromised but I think you get the point. If it can be sanely checked, check it.
Ultimately, this issue may have to be resolved by Matt Cutts or maybe the official Google blog publicizing it -- the threat of being in Google's penalty box seems to be a sure way to get people riled up. I expect they'll be lining up for chemo-therapy in short order.

         

( Apr 07 2008, 10:23:44 PM PDT ) Permalink View blog reactions


20080209 Saturday February 09, 2008

Building A Team of Rock Stars is Cheaper

Building a team of rock stars is cheaper than a team of lower-salaried, less experienced programmers. It's also harder. The notion that there is more economy in the enthusiasm of project contributors and having "more hands on deck", even if they're cheaper hands, is naive. Martin Fowler

If the cost premium for a more productive developer is less than the higher productivity of that developer, then it's cheaper to hire the more expensive developer.
You might assume that there's a positive scaling effect with a larger team. Fowler continues
The trouble is that that assumption assumes productivity scales linearly with team size, which again observation indicates isn't the case. Software development depends very much on communication between team members. The biggest issue on software teams is making sure everyone understands what everyone else is doing. As a result productivity scales a good bit less than linearly with team size. As usual we have no clear measure, but I'm inclined to guess at it being closer to the square root.
Keep reading the Cheaper Talent Hypothesis.

Trouble is, finding the highly capable and seasoned talent can be a long search. Weeding out the fakers is time consuming, finding the right fit for those who are for real takes longer. And so the search goes on. Technorati is searching; if you're the real deal, call us.

     

( Feb 09 2008, 08:48:53 AM PST ) Permalink View blog reactions


20071204 Tuesday December 04, 2007

Percolation in the Blogosphere

I've worked on a number of different web service and enterprise software products before but never gave one its external name until today. Our release of the Technorati Percolator is the culmination of months of work to harness the vast flow of raw data coming through Technorati to distill a palatable data volume and it's named for the internal moniker I'd been using for it during its development (after all, names with "buzz" and "meme" in them just wouldn't do). While you're looking around at the things we've cooked up in the percolator, make sure you also check out rising links of the day on Blogger Central and today in photos. Today we released them and I mentioned a bit about what goes into them on the Technorati blog. What I didn't elaborate on is what this release means to me on a personal level.

I originally came to Technorati in 2004 after a conversation with Dave fired up my creative sparks about the blogosphere. He had all of these rich conceptualizations about the technology changes in our midst, the social significance of decentralized events, the basic human drives that motivates them, the power of the long tail and the peculiar phenomenon that when you work in the service of others you reap the rewards manifold. I knew I had to work with him to build the ultimate air-traffic-control radar, real time search and meta-CMS systems. The 2004 political season provided an opportunity to work on those problems; the zeitgeist applications that we built to work with CNN's election coverage were thrilling accomplishments.

Since then, Technorati has undergone tremendous growth (regularly chronicled in Dave's "State of the Blogosphere" posts) on the foundation of a search vertical that had no precedent: the real time search of distributed micropublisheds sources. A number of technology changes were necessitated to scale us up; those changes have been likened to rebuilding your jet aircraft's engines at 40,000 feet. A lot has happened since 2004 (the growing pains have been regularly chronicled by the blogosphere) but until now, few of our outward facing accomplishments have excited me as much as the percolator.

There a lot of great sites out there using votes, comments, ratings and other explicit actions that are taken as representative of social gestures. There are also a lot of great sites that use implicit social gestures such as links to identify significant publishes, these are much closer to Technorati's heart. However, our aspirations are to look further along the long tail than most of these other sites can. Bloggers have said they want to see more than "all of the usual suspects", in an October 2007 post, ParisLemon said he wanted

a 'backpage' of sorts where some of us "B-listers" who are on ... everyday under the headlines, could have a chance to have some of our other tech stories showcased

Everyday the percolator is surfacing thousands of things that the blogosphere is talking about; blog posts, news stories and other stuff. It's true, the "A-list" percolates more posts and they bubble up higher; this is basic social software physics and classic power law stuff. But we have put a stake in the ground; we're going to serve bloggers across the power curve spectrum who are producing quality posts and acquiring attention from other bloggers as well as identify where the other attention magnets are by enabling an application that highlights them. When you walk into a crowded party and there are a myriad of conversations going on, you want to find the conversations that are pertinent to your interests and who the thought leaders are in those conversations. For me, today's release marks a new beginning of Technorati playing the role of connector and catalyzer. I hope you enjoy it!

       

( Dec 04 2007, 11:45:40 PM PST ) Permalink View blog reactions


20071029 Monday October 29, 2007

Addiction

Every family and every community has them. Addicts. Lives twisted by chemical dependency and the accompanying mental illnesses. Maybe I'll never fully understand how lives can wind down into oblivion in that way, given the many opportunities to I consider myself lucky to have never succumbed to such an existence.

From his sister, here's a short contribution to understanding the life, decline and death of one of my teenage cohorts Sherwood Brewer. For now and evermore, I imagine he's partying with Skitchie: boot-a-doot-doot!

   

( Oct 29 2007, 08:59:13 AM PST ) Permalink View blog reactions


20071026 Friday October 26, 2007

memcached hacks

I needed to clear a cache entry from a memcached cluster of 5 instances. Since I didn't know which one the client had put it in, I concocted a command line cache entry purger. netcat AKA nc(1) is my friend.

Let's say the cache key is "shard:7517" and the memcached instances are running on hosts ghcache01, ghcache02, ghcache03, ghcache04 and ghcache05 on port 11111 the incantation to spray them all with a delete command is

$ for i in 1 2 3 4 5
> do echo $i && echo -e "delete shard:7517\r\nquit\r\n" | nc -i1 ghcache0$i 11111
> done
and the output looks like
1
NOT_FOUND
2
DELETED
3
NOT_FOUND
4
NOT_FOUND
5
NOT_FOUND
which indicates that the memcached instance on ghcache02 had the key and deleted it (note the memcached protocol response: DELETED), the rest didn't have it and returned NOT_FOUND.

For more information on the memcached protocol, see the docs under source control.

     

( Oct 26 2007, 12:29:37 PM PDT ) Permalink View blog reactions


20071014 Sunday October 14, 2007

Benchmarks Smenchmarks

I've been hearing about folks using the LightSpeed web server instead of Apache for its supposed performance gains and ease of use. OK, so maybe if you're not familiar with the subtleties and madness of Apache, it can seem complicated. But the performance issues are often red herrings. Granted, it's been a few years since I've done any web server benchmarking but from my previous experience with these things, the details really matter for the outcomes and in the real world, the outcomes themselves matter very little.

The benchmark results published on the LightSpeed Technologies web site raised a flag for me right away: their comparison to Apache 2.0 was with the pre-forked MPM instead of the worker MPM. Is it any wonder that the results are pretty close to those for Apache 1.3? Either they had no idea what they were doing when they performed this benchmark or they knew exactly what they were doing and were burying the superior scalability of the worker MPM. Pitting a threaded or event driven process model against a forked one is just stupid. However, the evidence leans more towards willful sloppiness or fraud than ignorance. For instance, they claim to have raised the concurrency on Apache above 10k connections ... but they link to an httpd.conf that has MaxClients set to 150. RTFM, that can't happen.

Why don't these things matter in the real world? In benchmark world, there aren't varying client latencies (slow WAN links, etc), varying database response times (for instance MySQL's response times are very spikey), the vagueries of load balancers ebbing and flowing the load and logging configurations aren't set up for log data management. In the real world, application design and these various externalities are the culprits in application performance, not CPU bottlenecks in the web server runtime. The PHP interpreter itself is likely not your bottleneck either. If you're writing crap-assed code that performs unnecessary loops or superfluous database calls, it's going to run like crap no matter what web server is driving it (I've had to pick through a lot of error-ridden PHP code in my day). With Apache's support for sendfile() static file serving and all of the flexibility you get from mod_proxy, mod_rewrite and the rest of the toolkit, I don't understand the appeal of products like LightSpeed's.

       

( Oct 14 2007, 11:43:08 PM PDT ) Permalink View blog reactions


20071013 Saturday October 13, 2007

When Is A Blog Not A Blog?

The Revolution Will Not Be Televised There are blogs that don't take comments (like this one: I don't have time to moderate spam). There are mainstream media sites that are adopting reader comments. There are blogs being published by independent companies with editorial staff. There are big media organizations publishing columns and event streams as blogs. So I'm finding myself asking some basic questions about blogging of late: Is it an indication of maturation or mutation of the blogosphere that there's quibbling about what's a blog and what isn't? Is main stream media's co-opting of blogospheric mores a harbinger of a thermador to some un-televised revolution? Has the little town become too much of a metropolis that twitter, facebook and other social media are the destinations of urban flight?

The basic existential questions of the blogosphere and where its boundaries reside have been open to consideration (and re-consideration) for quite some time. Not a day goes by on the Technorati support forums without a splogger showing up to complain that their spam isn't getting indexed (Note: I'm not saying everyone who has indexing problems is a spammer, I'm saying spammers come rolling in to complain about it). A few weeks ago, Scoble melodramatically lamented that the TechMeme leaderboard heralds the death of blogging":

I was just looking at the TechMeme Top 100 List and noticed that it has very few bloggers on it. I can only see about 12 real blogs on that list. Blogging being defined as 'single voice of a person.' Most of the things on the list are now done by teams of journalists - that isn't blogging anymore in my book.
It's true, a lot of the many of the successful blogs have a prolific editorial staff. But death? Really? Why is blogging as an individual practice more or less than blogging as part of a collaborative enterprise? The existence of the weblogsincs, gawkers and huffington posts of the world are manifestations of blogging as a format but are from what I can tell are no less or more blogs than any others. New blogs continue to be created every second, and some of them will eventually develop thriving audiences.

The line between micropublishing and macropublishing is blurring. Reuters recently announced they they're taking comments on stories and Ally Insider's revelation that the New York Times is posting reader comments got a lot of play. In his post about Technorati rankings, Doug Karr doesn't feel that CNN Political Ticker should be considered a blog. So I'm asking myself, when is a blog not a blog?

Sometimes blogs (the narrower Scoble definition kind) provide the primary source for the facts of our times. Other times, it's main stream media that is bringing forth those facts. As the emergence of blogs that operate like main stream media continues and main stream media adopts blogging as a technology and practice, perhaps this is the ultimate outcome of a leveled publishing playing field: changes will flow along many vectors, cross bred practices are inevitable and Darwinistic rules will prevail such that a lot of things that you'd previously not have considered blogs are morphing into them.

       

( Oct 13 2007, 11:22:11 PM PDT ) Permalink View blog reactions


20070925 Tuesday September 25, 2007

Word of the Day: schadendouche

Flickr

schadendouche (SHOD-n-doosh)
A Darwin Award candidate whom you may take great pleasure in chuckling at.
What an unbelievably fortuitous advertisement for Flickrbooth! Apparently an alleged laptop thief (or the back-alley buyer or dope connection or whatever) took a self portrait with photobooth that unwittingly revealed his picture in the victim's flickr stream. I think this warrants a Bugs Bunny quote: "What a maroon!"

via boingboing
Listen up kids, crime doesn't pay.

           

( Sep 25 2007, 10:22:48 AM PDT ) Permalink View blog reactions


20070819 Sunday August 19, 2007

BSODly errors trying to read Linus Torvalds

The Borg Trying to follow a link to Linus Torvalds' railing against subversion, the irony of getting this error heightens the humor:

Microsoft OLE DB Provider for ODBC Drivers error '80004005'

[Microsoft][ODBC SQL Server Driver][SQL Server]Transaction (Process ID 134) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

/efytimes/lefthome.asp, line 193 
Sure, database problems happen regardless of the enabling technology, Microsoft is not unique to this. However, I seem to run into completely fubarred application degradation like this (essentially a BSOD on the web) far more often with ASP and .Net based sites than those enabled by other technologies. Of course, any site architected to require a database transaction to serve a content page (without any user data transaction) is a firing offense any place I'll ever work.

       

( Aug 19 2007, 09:46:22 AM PDT ) Permalink View blog reactions


20070629 Friday June 29, 2007

Power Bet

Powerset Last night I was among an invited group that Powerset brought in to witness how their natural language search sausage is made. It was actually kinduva cold cut platter: not exactly a meal but an interesting variety was offered for consumption.

When I was a kid, I thought that by 2007 we'd all have flying cars, rocket packs and computers would be all-seeing/all-knowing accoutrements on our wrists. I think all of us who ever watched Scotty verbally ask the Enterprise questions and get responsive answers in English sentences has had hunger pangs for satisfying natural language search. Powerset is trying to advance human-computer interfaces a little closer to that satisfaction, leap frogging previous efforts, by licensing Xerox PARC's technology and hiring a buncha heavy hitters to make it real.

Powerset COO Steve Newcomb introduced some of the sluggers in their line-up, walked attendees through the thinking behind their PR and release strategy and provided a peek into their search capabilities.

Among the impressive powersetters are people who have been-there/done-that with scaled-up search such as x-Yahoo!'s Chad Walters and Tim Converse (read Tim's post the other day about term proximity and linguistics, great stuff), as well as experts in natural language search with backgrounds at PARC and Ask Jeeves. As a company, they're not just-another-web2.0 rails app built by 2 guys and trying to get to the next level. Powerset is more of a bold bottled-lightning science experiment embracing ruby n' rails as a way to get it in front of people.

Powerset has signed up 10K people since announcing the availability of updates and previews on PowerLabs a few weeks ago. Newcomb characterized their labs preview effort as a way to use social software to guide product management decisions, "a mashup of Digg, Facebook and Google apps." I'm a big fan of transparency and community inclusion, it will be interesting to see how inclusive/closed this effort is.

OK, so after all of that, the "Where's the beef?" moment arrived. A side-by-side comparison interface was demonstrated with Powerset results on the left and Google results on the right. Explaining that the test index was scoped to Wikipedia, the goog results were similarly scoped down. The Powerset use case was demonstrated with a query like "What politicians were killed by disease?" On goog, the results are matching terms (and variants on their stems), "politicians", "killed" and "disease". Powerset matches semantically similar tokens and their grammatical relationships.

So Powerset's top result for that query highlighted Sir Edward Heath died from pneumonia on Wikipedia's page for Edward Heath. Highlighting a completely different snippet (none of the query terms were matched but the semantics were) that accurately answers the query is very impressive. Powerset is using Freebase's ontology and WordNet's synonym mappings to connect indexed sentence structures to the query. They do all of this analysis and mapping at index time, which undoubtedly raises the cost of indexing tremendously. They're making a big bet that the raised search results quality will pay those costs back.

When asked about the computational horsepower required to index web documents with the sentence structure decomposition and semantics mappings, Newcomb hedged at first ("Barney's gonna kill me", referring to CEO Barney Pell). But alas, he convinced himself (or did a good job method-acting conviction) that it was safe to reveal that it takes them about a second to grammatically analyze and index a typical document. Lamenting again about his confession, someone from the audience quipped the query, "Which CEO killed Steve Newcomb?" Yea, he didn't search their index for that.

On the subject of Google comparisons, Newcomb kinda squirmily described Powerset as reverent of ("not cocky about") what Google has accomplished but taking a different approach to web search. Doing side-by-side comparisons with Google as their demo does is pretty ballsy and it seems to get them in trouble; being positioned as a "Google killer" by their audience of search wonks and journalists when things are still very much at a proof-of-concept level seems rather premature. I think Powerset needs to reel that in lest they awaken a sleeping giant and fill him with a terrible resolve while they're still on the tarmac. If you've designed a new aircraft, you don't trumpet about revolutionizing aeronautics before the test pilots have taken off. Particularly if folks are proclaiming that Boeing is in trouble. When Powerset indexes a real web corpus, it will be interesting to see how successfully they can overlay web graph, clustering/disambiguation, time and other relevance components. I think that will provide a real moment-of-truth.

Powerset is making a big bet on natural language search as a transformative technology. They've got a lot of great people and a lot of great technology. All in all, the presentation felt a little dog-and-ponyish with the limited corpus but I'm looking forward to hearing more from them later this year when they release a major iteration. See also:

           

( Jun 29 2007, 10:41:46 AM PDT ) Permalink View blog reactions


20070616 Saturday June 16, 2007

Natural Hackasters

Hack Day: London, June 16/17 2007 I'm reading with amusement and wonder the events that unfolded at the Yahoo! Hackday in London. Apparently the Alexandra Palace main hall (the BBC's venue for this) has a roof that opens up. And it did. This was precipitated by a lightning strike on the building as a storm blew over (precipitated, storm: no pun left shall be unpunned). Yes, audience member laptops are open, PA system all setup... and it's raining inside the hall. Not to worry, all Londoners are equipped with umbrellas at all times. That's a fact. "I thought a bomb went off", sez Chad of the lightning strike when he was on IM a few hours later. Is the roof there like Chase Field where the Diamondbacks play baseball in Phoenix? I dunno, I'm checking out pictures of "Ally Pally" to assess. Anyway, power and wifi are back and the show goes on.

Follow along with Hackday London Lightning on Technorati's hackdaylondon tag stream.

( Jun 16 2007, 10:45:59 AM PDT ) Permalink View blog reactions


20070609 Saturday June 09, 2007

Disappearance of the Desktop Interface

I was sick of various computer OS desktop metaphors 10-12 years ago. At the time, I thought virtual reality technologies were gonna take over (anybody else remember VRML?). I remember the Windows 95/98 releases, lauded by Microsoft as such great advancements, striking me as just laughable in their utter lack of imagination (even if they were big upgrades from the Windows 3.x mess). When that "innovation" made it to Windows XP, I realized that Microsoft was hopelessly lost as far as OS interface design.

Since then, I've seen a lot of technology changes that I view as the harbingers of the desktop metaphor's demise. Graphics card technology that was once only found on $15-50k SGI pizza boxes workstations are now cheap as pizza. Jeff Han's demonstration of high resolution multi-touch applications at eTech and TED last year was fantastic. At TED again this year, the photosynth demonstration got a big round of "oohs" and "aahs" from a rapt audience (you must see the detail zooming, also check out this photosynth demo reel).

So when are we gonna see these technologies in our everyday lives? Apparently, soon. It's funny how different Apple and Microsoft's foray into this is. In a few weeks, Apple is coming out with a $500 phone (the multi-touch usage is demonstrated at 3:55 into this MacWorld TV report from last January). By the end of the year, we will reportedly see Microsoft's $10k coffee table appearing in hotel lobbies. Can't wait? Fishing in your pocket for an extra $10k? Into starcraft? There are some folks working on a multi-touch DIY kit (Microsoft: 0, Hackers: 1).

Putting on my futurist hat: Five years from now, Intel's 80-cores-on-a-fingernail chip, voice recognition audio inputs and multi-touch screens on commodity devices will make the desktop metaphor seem like a quaint joke. Kids born today will shake their heads in disbelief that desktops we're productive tools. I've yet to explain a command line interface to my kids, who are grade school age; as familiar and comfortable as those interfaces are to me, the youngins look at me typing in a shell window with puzzlement. In their youthful eyes, I may as well be composing vulcan legal tracts (the reality is probably more frightful, it might really be perl). Computing interfaces will fade away into our intuition.

I just wish the iPhone was coming out in time for father's day (yes, honey, that's a hint). In the meantime, I'm still putting up with Apple and Microsoft's OS interfaces, wincing at the trash cans, recycle bins, folder icons, etc. It'll be good riddance.

         

( Jun 09 2007, 10:23:46 AM PDT ) Permalink View blog reactions


20070601 Friday June 01, 2007

Web Spam As Signs of the Times

There was a time not long ago when Findory offered a credible value proposition for participants and consumers of the blogosphere. The idea of a blog recommendation and reader personalization service is a good one. I guess things didn't work out as planned at Findory. Earlier this year, Greg Linden announced that Findory was riding into the sunset.

The old Findory blog (@ http://findory.blogspot.com/) has been dormant for some time (the last posts from Greg were in 2005), now it's been taken over by a splogger who has been grabbing abandoned blogspot URLs (this one has PageRank of 3) and posting link farm links and German keywords to them. Sad.

I'd recommend holding on to your blogspot URLs forever; even if you're not using 'em anymore it's better to maintain the museum piece than contribute to the web spam problem.

         

( Jun 01 2007, 12:55:10 PM PDT ) Permalink View blog reactions


20070519 Saturday May 19, 2007

Ruby on Rails Ads

There's a series of "Mac vs. PC" ad knock-offs for Ruby on Rails on YouTube, they're really funny. I'm starting to use Ruby in favor of Perl (or trying to) for a lot of everyday duct-tape stuff, it's a great language. Some of the hyperbole around ruby and rails and peace-on-earth are a little amusing too but for now, laugh along and let 'em have their fun!

   

( May 19 2007, 06:59:38 AM PDT ) Permalink View blog reactions


20070516 Wednesday May 16, 2007

PostgreSQL Quirk: invalid domains

I've had my fill of MySQL's quirks, so I thought I'd plumb for PostgreSQL's. So many things that MySQL is fast and loose about, PostgreSQL is strict and correct. However, I was fiddling around with PostgreSQL's equivalent to MySQL's enum and found what I would expect a strict RDBMS to be strict about... not so strict.

PostgreSQL does not have enum but there are a few different ways you can define your own data types and constraints and therefore prescribe your on constrained data type. This table definition will confine the values in 'selected' to 5 characters with the only options available being 'YES', 'NO' or 'MAYBE':

ikallen=# create table decision ( selected varchar(5) check (selected in ('YES','NO','MAYBE')) );
CREATE TABLE
ikallen=# insert into decision values ('DUH');
ERROR:  new row for relation "decision" violates check constraint "decision_selected_check"
ikallen=# insert into decision values ('CLUELESS');
ERROR:  value too long for type character varying(5)
ikallen=# insert into decision values ('MAYBE');
INSERT 0 1
I don't want to hear any whining about how diff-fi-cult constrained types are. Welcome to the NBA, where RDBMS' throw elbows. The flexibility you get from loosely constrained types will come back to bite you on your next programming lapse.

So what's wrong with this:

ikallen=# create table indecision ( selected varchar(5) check (selected in ('YES','NO','MAYBE SO')) );
CREATE TABLE
ikallen=# insert into indecision values ('MAYBE');ERROR:  new row for relation "indecision" violates check constraint "indecision_selected_check"
ikallen=# insert into indecision values ('MAYBE SO');
ERROR:  value too long for type character varying(5)
ikallen=#
'MAYBE SO' is in my list of allowed values but violates the width constraint. Should this have ever been allowed? Shouldn't PostgreSQL have complained vigorously when a column was defined with varchar(5) check (selected in ('YES','NO','MAYBE SO'))? Yes? No? Maybe?

Well, I think so.

One of the cool things about PostgreSQL is the ability to define a constrained type and use it in your table definitions:

ikallen=# create domain ynm varchar(5) check (value in ('YES','NO','MAYBE'));
CREATE DOMAIN
ikallen=# create table coolness ( choices ynm );
CREATE TABLE
ikallen=# insert into coolness values ('nope');
ERROR:  value for domain ynm violates check constraint "ynm_check"
ikallen=# insert into coolness values ('YES');
INSERT 0 1
Coolness!

Contrast with MySQL's retarded handling of what you'd expect to be a constraint violation:

mysql> create table decision ( choice enum('YES','NO','MAYBE') );
Query OK, 0 rows affected (0.01 sec)

mysql> insert into decision values ('ouch');
Query OK, 1 row affected, 1 warning (0.03 sec)

mysql> select * from decision;
+--------+
| choice |
+--------+
|        |
+--------+
1 row in set (0.00 sec)

mysql> select length(choice) from decision;
+----------------+
| length(choice) |
+----------------+
|              0 |
+----------------+
1 row in set (0.07 sec)

mysql> insert into decision values ('MAYBE');
Query OK, 1 row affected (0.00 sec)

mysql> select * from decision;
+--------+
| choice |
+--------+
|        |
| MAYBE  |
+--------+
2 rows in set (0.00 sec)

mysql> select length(choice) from decision;
+----------------+
| length(choice) |
+----------------+
|              0 |
|              5 |
+----------------+
2 rows in set (0.00 sec) 
Ouch, indeed. Wudz up wit dat?

There are a few things that MySQL is really good for but if you want a SQL implementation does what you expect for data integrity, you should probably be looking elsewhere.

       

( May 16 2007, 07:33:00 PM PDT ) Permalink View blog reactions


20070509 Wednesday May 09, 2007

No splogs, ay

I had to take a few days off of work last week because of my aching back, it was really a fog-of-pain for a few days but this week I'm on the mend and in beautiful Banff for the WWW 2007 conference. Actually, I'm mostly here for the AIRweb workshop but staying a few extra days to hear what folks are thinking about regarding the future of the web, online information retrieval, humanity, and so on.

The AIRweb submissions included a lot of web graph related research. Some of it makes quite intuitive sense: web spammers will link to their spam sites as well as legitimate sites (camouflage) but legitimate sites don't link to web spam sites. So some of the talks discussed the underlying linear algebra of these phenomenon (Anti-TrustRank and BadRank) or their inapplicability to identifying spam (TrustRank). The presentations about temporal patterns, spam term density, the effects of on-the-fly re-ranking and javascript redirection were quite interesting.

A lot of these rank-demotion and web graph heuristics aren't really central to the efforts we have at Technorati for thwarting splogs. We instrument the data streams for baseline behaviors of various features. It's more like an intrusion detection system because fundamentally, web spammers can't behave like "normal" publishers and still succeed; they have to compensate for their absense of popularity with all kinds of abnormal behaviors and those behaviors are quite intrusive if you're listening for them. And so we are. This is by no means perfect but we're doing way better than 80-20. It's my belief that as the web becomes more participatory and there are incentives and opportunities to inject junk into it, intrusion detection will as much a vital capability as search relevance rank demotion to maintain a high quality experience. At the close of the workshop, I proposed that the web spam research community tell us what they want; what can we do to help? I can only imagine that Technorati's data streams could prove useful for the growing challenges of the participant-driven and temporally sensitive web.

So that was yesterday.

This morning, Tim Berners-Lee kicked off with a keynote that touched on the successive innovations of email, the web, wikis and blogs. On the iterative nature of technological and social change, he drew a cycling diagram of the needs that emerge when changes occur and enjoy widespread adoption and the collaborative/creative forces that drive innovation. He laid out how the Semantic Web was the next iteration and complex meaning will be readily accessible on the web. OK, that's all well and good. However, I just don't buy this idea that the Semantic Web is ... the Web at all. We have a web for people (he ackowledged as much at the beginning of the talk) but the idea of having tons of detailed data representations for generalized browsers of really complex data... I just don't get why folks won't end up building domain specific apps anyway. Building UI's for "general data representation" means that you'll never really be able represent the domain specific qualities within some part of The Ontology. At least, I've never seen those things work. Useful apps need domain experts (champions of the end-user e.g. product managers) and engineers to build something that works for that domain. Generic UI's breakdown when dealing with the nuances of specific domains. I want a data-rich web for humans that is machine consumable (microformats), not a parallel-universe web of machine-oriented RDF. Anyway, thanks for inventing the web TBL and good luck all you Semantic Webbers. I think you'll need it.

I almost fell out of my chair though when TBL said that blog spam isn't really a problem. I'll surmise that he has a set feed reader repertoire (or, old school bookmarks) and doesn't use blog search much. While I think we've done a pretty good job spam scrubbing Technorati, the fact remains that there is a veritable ocean of pinging rubbish mongers engaging in underhanded payola schemes, kleptotorial and other nefarious endeavors out there. What spam you do see on Technorati is the tip of the ice berg. Tim, use our site, despite the ice berg tip :)

Side notes: when in Canada going to "google.com" gets redirected to "google.ca" which includes a toggle to search "The Web"/"Pages from Canada" ... amusing, ergo the graphic in this post. Also, I can't believe how long the days are here; about 3 hours more daylight than the San Francisco bay area!

So thanks to Brian Davison, Carlos Castillo and Kumar Chellapilla for putting together a great AIRweb program, good work guys! I'm heading home tomorrow.

                                     

( May 09 2007, 09:44:35 PM PDT ) Permalink View blog reactions


20070430 Monday April 30, 2007

Blogging Upright

I've been asleep just about all day, the pain killers and muscle relaxants they gave me last night were that good.

It all started a few weeks ago when EBMUD sent me a water bill that indicated over three times our normal water usage (and three times the cost). Everything seemed fine with all of the household plumbing. I called for an inspection, their inspector didn't show up on the day I expected them. But we got a note left on the door saying that, while nobody is home, the water meter runs continuously and that our usage continues to be unusually high.

Over the weekend, I checked around the house more diligently. What I thought may have been a wet spot by the side of the garage (not far from a spigot) seemed like a good candidate, so I got the shovel and started digging. The soil didn't get much softer as I dug deeper. There was no specific motion or event that I recall being more vigorous than others but in the hours that followed, a pain in my lower back grew. And grew. And grew to a point of intensity that everything I did hurt in my lower back. Sitting down. Getting up from a sitting position. Laying down. Everything hurt, intensely! A doctor friend of mine told me that I musta skipped charter 2 of the "You're over 40 now" manual where it is specified not to do any more shoveling. Doh!

At the emergency room, they gave me a cocktail of toradol, dilaudid and phenergan and a prescription for soma and percocet. The shot last night really knocked me out, I've been asleep off and on most of the day today. I'm gonna be doing a lot of laying down with ice on my back. A lot of walking around. But not a lot of sitting. So, I'm writing this post woozy from the drugs but standing upright with the pooter on the kitchen counter. Gonna go for a walk next. I need to resolve things with the water company and the plumbing on our premises.

           

( Apr 30 2007, 04:47:38 PM PDT ) Permalink View blog reactions


20070426 Thursday April 26, 2007

Temperature Swings At The Old Ball Game

The rhythm of the baseball is always about hot streaks and cold streaks. In the 2006 season, the Giants couldn't put together any sustained hot streaks; it was a dark time for Giants fans -- I don't think they won more than 3 games in row and that they only did a few times. The first weeks of 2007 baseball were even darker; losing 7 out of the first 9 games disheartened a lot of fans. But what a difference now, the Giants have gone from a polar chill to an equatorial blaze in a matter of weeks; they've won 9 of their last 10!

Matt Cain finally has the victory he's been deserving; he's got a 1.55 ERA but what should be a 4-0 record is only at 1-1 so far. I think we're gonna see his W:L ratio shifting favorably in the weeks ahead. Barry Bonds is getting pitches, and smashing them. I'm sure soon enough competing team managers will get the message: the old Barry is back and crushinger than ever and we'll see lots of 4 finger calls. But for now, enjoy the ride.

Yesterday's victory came on the backs of a partial relief squad (Todd Linden and Lance Niekro) as Omar Vizquel and Dave Roberts took a rest (Roberts came on late in the as a pinch runner and scored). Next up this evening, Russ Ortiz will duel against Brad Penny and I'm looking forward to an exciting game. Three words: beat sweep el aye!

       

( Apr 26 2007, 07:14:56 AM PDT ) Permalink View blog reactions


20070425 Wednesday April 25, 2007

Intel Migration Pain With Perl

There's a bunch of code that I haven't had to work on in months. Some of it predates my migration from PPC Powerbook to the Intel based MacBook Pro. Now that I'm dusting this stuff off, I'm running to binary incompatibilities that are messin' with my head. My recompiled my Apache 1.3/mod_perl installation just fine but doing a CVS up on the code I need to work on and updating the installation, there's a new CPAN dependency. No problem, use the CPAN shell. Oh, Class::Std::Utils depends on version.pm and it's ... the wrong architecture. Re-install version.pm. Next, XMLRPC::Lite is unhappy 'cause it depends on XML::Parser::Expat and it's ... the wrong architecture.

Aaaaugh!

The typical error looks like

mach-o, but wrong architecture at /System/Library/Perl/5.8.6/darwin-thread-multi-2level/DynaLoader.pm
I just said "screw it" and typed "cpan -r" ... which looks to be the moral equivalent of "make world" from back in my FreeBSD days. Everything that has an XS interface just needs to be recompiled.

Compiling... compiling... compiling. I guess that'll give me time to write a blog post about it. OK, that's done, seems to have fixed things: back to work.

                 

( Apr 25 2007, 05:19:37 PM PDT ) Permalink View blog reactions


20070423 Monday April 23, 2007

Simple is as simple ... dohs!

I was working on an Evil Plan (tm) to serialize python feedparser results with simplejson.

 parsedFeed = feedparser.parse(feedUrl)
 print simplejson.dumps(parsedFeed) 
Unfortunately, I'm hitting this:
TypeError: (2007, 4, 23, 16, 2, 7, 0, 113, 0) is not JSON serializable 
I'm suspecting there's a dictionary in there that has a tuple as key and that's not allowed in JSON-land. So much for simple! Looks like I'll be writing a custom serializer fror this. I was just trying to write a proof-of-concept demo; what I've proven is that just 'cause "simple" is in the name, doesn't mean I'll be able to do everything I want with it very simply.

I've had a long day. A good night's sleep and fresh eyes on it tomorrow will probably get this done but if yer reading this tonight and you happen to have something crafty up your sleeve for extending simplejson for things like this, let me know!

     

( Apr 23 2007, 10:50:21 PM PDT ) Permalink View blog reactions


20070422 Sunday April 22, 2007

Linux Virtual Memory versus Apache

I ran into a very peculiar case of an Apache 2.0.x installation with the worker MPM completely failing to spawn it's configured thread pool. The hardware and kernel versions weren't significantly different from other systems running Apache with the same configuration. Here are the worker MPM params in use:

ServerLimit         40
StartServers        20
MaxClients        2000
MinSpareThreads     50
MaxSpareThreads   2000
ThreadsPerChild     50
MaxRequestsPerChild  0
But on this installation, same version of Apache and RedHat Enterprise Linux 4 like rest, every time httpd started it would cap the number threads spawned and leave these remarks in the error log:
[Fri Apr 20 22:54:24 2007] [alert] (12)Cannot allocate memory: apr_thread_create: unable to create worker thread 

It turns out that a virtual memory parameter had been adjusted, vm.overcommit_memory had been set to 2 instead of 0. Here's the explanation of the parameters I found:

overcommit_memory is a value which sets the general kernel policy toward granting memory allocations. If the value is 0, then the kernel checks to determine if there is enough memory free to grant a memory request to a malloc call from an application. If there is enough memory, then the request is granted. Otherwise, it is denied and an error code is returned to the application. If the setting in this file is 1, the kernel allows all memory allocations, regardless of the current memory allocation state. If the value is set to 2, then the kernel grants allocations above the amount of physical RAM and swap in the system as defined by the overcommit_ratio value. Enabling this feature can be somewhat helpful in environments which allocate large amounts of memory expecting worst case scenarios but do not use it all.
From Understanding Virtual Memory
The vm.overcommit_ratio value is set to 50 on all of our systems but rather than fiddling with that, setting vm.overcommit_memory to 0 had the intended effect; Apache started right up and readily stood-up to load testing.

So, if you're seeing these kind of evil messages in your Apache error log, use sysctl and check out the vm parameters. I haven't dug further into why the worker MPM was conflicting with this memory allocation config; next time I run into Aaron, I'm sure he'll have an explanation in his back pocket.

                 

( Apr 22 2007, 08:19:57 PM PDT ) Permalink View blog reactions


20070421 Saturday April 21, 2007

The Users of Your Service Are Your Best Friends

I try keep my ride on the cluetrain rolling by listening to what users of the services I help maintain have to say. The Technorati support forums have provided me with a great opportunity to hear what problems Technorati's members are experiencing. For the uninitiated, Technorati's crawler analyzes web pages to identify blog posts, make them searchable and identify links that measure what the blogosphere is paying attention to. There are a fair number of blogs that get caught in our automated blog flagging; the service processes several million pings per day and amidst that throughput, there are going to be mistakes in the flagging heuristics (flagged blogs are, naturally called "flogs", sometimes they end up demoted as "splogs" but others, turn out to be legit blogs). I'm trying to reduce the mistake rate; the indexing hazards that folks run into are a source of much grief (it doesn't take much to find folks who are very vocal about such lapses).

So, I've been on a tear over the last few weeks chasing down problems in Technorati's crawler and identifying its failure conditions. It's code that, until recently, I've not been too intimate with but inheriting responsibility for its functioning has forced me to study it more closely and grasp a firmer command of python programming. A peculiar failure case that had me puzzled for a while involved blogs that had (sufficiently) well formed pages and feeds, there didn't seem to be anything wrong with the data that'd prevent us from indexing them and yet they consistently failed to get indexed. I first became aware of it in this topic

The issue moved to a new topic where an initial diagnosis I offered (corrupted gzip encoding from Apache 2.2's mod_deflate, I thought) didn't quite pan out. But follow-ups from Technorati users KilRoY66 and wa7son helped clarify that the culprit was the gzip encoding that wordpress was configured to do. Apache 2.2/mod_deflate, you're off the hook. Their blogs (TNTVillage blog and justaddwater.dk | Instant Usability & Web Standards, respectively) both used Apache 2.2 but they both are also hosted on wordpress.org installations. For reasons yet to be explained, python's gzip library detects the encoding returned by wordpress as corrupted. Thank you, Technorati members, for helping identify this issue!

I'm going to patch the code (based on Mark Pilgrim's openanything) to recover from encoding errors and raise a proper exception if it's truly unrecoverable (as it is currently, the code catches any exceptions from decompressing the bytes, prints a message and moves along, essentially swallowing a fundamental error). In the meantime, if you're not getting indexed by Technorati and you have wordpress' compression on, try turning it off and see if that makes a difference.

                 

( Apr 21 2007, 02:15:01 PM PDT ) Permalink View blog reactions


20070420 Friday April 20, 2007

WiFi on the Train: Blogging From BART

I'm currently about 60 feet under Market Street in downtown San Francisco, inside a BART station. But I'm connected to the wifi_rail network with 5 bars. I haven't fired up any YouTube streams yet but for IM, twitter updates and ...blogging, this is groovy; I'll take it!

I haven't seen any official announcements about BART's wifi system but as a serendipitous user, I hope it's here to stay. In fact, I hope it's extended to cover the track between stations, the transbay tube and the east bay stations as well! Maybe I'm being a little over-appreciative (greedy).

               

( Apr 20 2007, 07:57:26 PM PDT ) Permalink View blog reactions


20070419 Thursday April 19, 2007

A Giant Turn Around?

Could the extra-inning push last night, kicked off by Barry Bond's tying slash homer in the 8th be the harbinger of baseball to come? I'm quite impressed with how Armando Benitez and Jonathan Sanchez held back the Cardinals long enough for 12th inning surge from Randy Winn, Omar Vizquel and Rich Aurilia. We're seeing real solid playing from those guys and Ray Durham. The pitching rotation is solid, the losses that Matt Cain has suffered... are really an injustice. The guy's pitched fantastic, if we see the run support turn on he'll be putting up the W's. I expect Barry Zito's shutout the other day to be the first of many. Noah Lowry, Matt Morris and Russ Ortiz get props too, those guys and much of the roster are pretty damned solid.

Today's 6-2 romp over the Cards has me thinking that the Giants won't be spending too much more time down there at the bottom of the division. I think the offensive slump from the season's start can be declared officially over. What remains to be seen is whether they can sustain this kind of solid play day in and day out. I have faith they will! Let's Go Giants!

Now if only the temperatures felt like baseball weather; it's cold!

( Apr 19 2007, 06:57:28 PM PDT ) Permalink View blog reactions


20070416 Monday April 16, 2007

Character Encoding Foibles in Python

I was recently stymied by an encoding error (the exception thrown was kicked off by UnicodeError) on a web page that was detected as utf-8, the W3 Validator said it was utf-8 but in all my efforts to get a parsing classes derived from python's SGMLParser, it consistently bombed out. I tried chardet:

>>> import chardet
>>> import urllib
>>> urlread = lambda url: urllib.urlopen(url).read()
>>> chardet.detect(urlread(theurl))
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}
...and yet the parser insisted that it had hit the "'ascii' codec can't decode byte XXXX in position YYYY: ordinal not in range(128)" error. WTF?!

On a hunch, I decided to try forcing it to be treated as utf-16 and then coercing it back to utf-8, like this

parser.feed(pagedata.encode("utf-16", "replace").encode("utf-8"))
That worked!

I hate it when I follow an intuited hunch, it pans out and but I don't have any explanation as to why. I just don't know the details of python's character encoding behaviors to debug this further, most of my work is in those Curly Bracket languages :)
If any python experts are having any "OMG don't do that, here's why..." reactions, please let me know!

           

( Apr 16 2007, 11:28:31 AM PDT ) Permalink View blog reactions


« San Francisco's 1980s metal sc... Speaking of Upgrades »