What's That Noise?! [Ian Kallen's Weblog]

Main | Next month (Jun 2007) »

20070519 Saturday May 19, 2007

Ruby on Rails Ads

There's a series of "Mac vs. PC" ad knock-offs for Ruby on Rails on YouTube, they're really funny. I'm starting to use Ruby in favor of Perl (or trying to) for a lot of everyday duct-tape stuff, it's a great language. Some of the hyperbole around ruby and rails and peace-on-earth are a little amusing too but for now, laugh along and let 'em have their fun!

   

( May 19 2007, 06:59:38 AM PDT ) Permalink


20070516 Wednesday May 16, 2007

PostgreSQL Quirk: invalid domains

I've had my fill of MySQL's quirks, so I thought I'd plumb for PostgreSQL's. So many things that MySQL is fast and loose about, PostgreSQL is strict and correct. However, I was fiddling around with PostgreSQL's equivalent to MySQL's enum and found what I would expect a strict RDBMS to be strict about... not so strict.

PostgreSQL does not have enum but there are a few different ways you can define your own data types and constraints and therefore prescribe your on constrained data type. This table definition will confine the values in 'selected' to 5 characters with the only options available being 'YES', 'NO' or 'MAYBE':

ikallen=# create table decision ( selected varchar(5) check (selected in ('YES','NO','MAYBE')) );
CREATE TABLE
ikallen=# insert into decision values ('DUH');
ERROR:  new row for relation "decision" violates check constraint "decision_selected_check"
ikallen=# insert into decision values ('CLUELESS');
ERROR:  value too long for type character varying(5)
ikallen=# insert into decision values ('MAYBE');
INSERT 0 1
I don't want to hear any whining about how diff-fi-cult constrained types are. Welcome to the NBA, where RDBMS' throw elbows. The flexibility you get from loosely constrained types will come back to bite you on your next programming lapse.

So what's wrong with this:

ikallen=# create table indecision ( selected varchar(5) check (selected in ('YES','NO','MAYBE SO')) );
CREATE TABLE
ikallen=# insert into indecision values ('MAYBE');ERROR:  new row for relation "indecision" violates check constraint "indecision_selected_check"
ikallen=# insert into indecision values ('MAYBE SO');
ERROR:  value too long for type character varying(5)
ikallen=#
'MAYBE SO' is in my list of allowed values but violates the width constraint. Should this have ever been allowed? Shouldn't PostgreSQL have complained vigorously when a column was defined with varchar(5) check (selected in ('YES','NO','MAYBE SO'))? Yes? No? Maybe?

Well, I think so.

One of the cool things about PostgreSQL is the ability to define a constrained type and use it in your table definitions:

ikallen=# create domain ynm varchar(5) check (value in ('YES','NO','MAYBE'));
CREATE DOMAIN
ikallen=# create table coolness ( choices ynm );
CREATE TABLE
ikallen=# insert into coolness values ('nope');
ERROR:  value for domain ynm violates check constraint "ynm_check"
ikallen=# insert into coolness values ('YES');
INSERT 0 1
Coolness!

Contrast with MySQL's retarded handling of what you'd expect to be a constraint violation:

mysql> create table decision ( choice enum('YES','NO','MAYBE') );
Query OK, 0 rows affected (0.01 sec)

mysql> insert into decision values ('ouch');
Query OK, 1 row affected, 1 warning (0.03 sec)

mysql> select * from decision;
+--------+
| choice |
+--------+
|        |
+--------+
1 row in set (0.00 sec)

mysql> select length(choice) from decision;
+----------------+
| length(choice) |
+----------------+
|              0 |
+----------------+
1 row in set (0.07 sec)

mysql> insert into decision values ('MAYBE');
Query OK, 1 row affected (0.00 sec)

mysql> select * from decision;
+--------+
| choice |
+--------+
|        |
| MAYBE  |
+--------+
2 rows in set (0.00 sec)

mysql> select length(choice) from decision;
+----------------+
| length(choice) |
+----------------+
|              0 |
|              5 |
+----------------+
2 rows in set (0.00 sec) 
Ouch, indeed. Wudz up wit dat?

There are a few things that MySQL is really good for but if you want a SQL implementation does what you expect for data integrity, you should probably be looking elsewhere.

       

( May 16 2007, 07:33:00 PM PDT ) Permalink


20070509 Wednesday May 09, 2007

No splogs, ay

I had to take a few days off of work last week because of my aching back, it was really a fog-of-pain for a few days but this week I'm on the mend and in beautiful Banff for the WWW 2007 conference. Actually, I'm mostly here for the AIRweb workshop but staying a few extra days to hear what folks are thinking about regarding the future of the web, online information retrieval, humanity, and so on.

The AIRweb submissions included a lot of web graph related research. Some of it makes quite intuitive sense: web spammers will link to their spam sites as well as legitimate sites (camouflage) but legitimate sites don't link to web spam sites. So some of the talks discussed the underlying linear algebra of these phenomenon (Anti-TrustRank and BadRank) or their inapplicability to identifying spam (TrustRank). The presentations about temporal patterns, spam term density, the effects of on-the-fly re-ranking and javascript redirection were quite interesting.

A lot of these rank-demotion and web graph heuristics aren't really central to the efforts we have at Technorati for thwarting splogs. We instrument the data streams for baseline behaviors of various features. It's more like an intrusion detection system because fundamentally, web spammers can't behave like "normal" publishers and still succeed; they have to compensate for their absense of popularity with all kinds of abnormal behaviors and those behaviors are quite intrusive if you're listening for them. And so we are. This is by no means perfect but we're doing way better than 80-20. It's my belief that as the web becomes more participatory and there are incentives and opportunities to inject junk into it, intrusion detection will as much a vital capability as search relevance rank demotion to maintain a high quality experience. At the close of the workshop, I proposed that the web spam research community tell us what they want; what can we do to help? I can only imagine that Technorati's data streams could prove useful for the growing challenges of the participant-driven and temporally sensitive web.

So that was yesterday.

This morning, Tim Berners-Lee kicked off with a keynote that touched on the successive innovations of email, the web, wikis and blogs. On the iterative nature of technological and social change, he drew a cycling diagram of the needs that emerge when changes occur and enjoy widespread adoption and the collaborative/creative forces that drive innovation. He laid out how the Semantic Web was the next iteration and complex meaning will be readily accessible on the web. OK, that's all well and good. However, I just don't buy this idea that the Semantic Web is ... the Web at all. We have a web for people (he ackowledged as much at the beginning of the talk) but the idea of having tons of detailed data representations for generalized browsers of really complex data... I just don't get why folks won't end up building domain specific apps anyway. Building UI's for "general data representation" means that you'll never really be able represent the domain specific qualities within some part of The Ontology. At least, I've never seen those things work. Useful apps need domain experts (champions of the end-user e.g. product managers) and engineers to build something that works for that domain. Generic UI's breakdown when dealing with the nuances of specific domains. I want a data-rich web for humans that is machine consumable (microformats), not a parallel-universe web of machine-oriented RDF. Anyway, thanks for inventing the web TBL and good luck all you Semantic Webbers. I think you'll need it.

I almost fell out of my chair though when TBL said that blog spam isn't really a problem. I'll surmise that he has a set feed reader repertoire (or, old school bookmarks) and doesn't use blog search much. While I think we've done a pretty good job spam scrubbing Technorati, the fact remains that there is a veritable ocean of pinging rubbish mongers engaging in underhanded payola schemes, kleptotorial and other nefarious endeavors out there. What spam you do see on Technorati is the tip of the ice berg. Tim, use our site, despite the ice berg tip :)

Side notes: when in Canada going to "google.com" gets redirected to "google.ca" which includes a toggle to search "The Web"/"Pages from Canada" ... amusing, ergo the graphic in this post. Also, I can't believe how long the days are here; about 3 hours more daylight than the San Francisco bay area!

So thanks to Brian Davison, Carlos Castillo and Kumar Chellapilla for putting together a great AIRweb program, good work guys! I'm heading home tomorrow.

                                     

( May 09 2007, 09:44:35 PM PDT ) Permalink