What's That Noise?! [Ian Kallen's Weblog]

« Previous page | Main | Next page »

Tuesday August 29, 2006

Scammed By Kleptotorial

In this corner: Doc is going to attack kleptotorial splogs by employing cleaner living through better licensing (a creative commons flavor). And in this corner: Elliott Back says he is a victim. He has been slammed by Scoble (and Scoble was gracious enough to apologize). I have no sympathy for Elliott Back. Sure, he's just the gun maker, not the shooter. But weapon makers producing wares without safeties get sued for negligence. Basically, any tool that programmatically harvests and posts other people's feeds should at least have the common decency to not ping. If you re-inject something into the update stream that you've appropriated from someone else, you're scamming the update stream. This isn't about quoting or citing, this is about fraudulent pings, "I've updated my blog (nevermind the fact it's with OPP)" -- keep your feed harvesting to yourself, please.

spam splogs creativecommons

( Aug 29 2006, 09:51:57 AM PDT ) Permalink

Monday August 28, 2006

Memcached In MySQL

The MySQL query cache has rarely been of much use to me since it's a pretty much just an optimization for read-heavy data. Furthermore, if you have a pool of query hosts (e.g. you're using MySQL replication to provide a pool of slaves to select from), each with its own query cache in a local silo, there's no "network effect" of benefitting from a shared cache. MySQL's heap tables are a neat trick for keeping tabular data in RAM but they don't work well for large data sets and suffer from the same siloization as the query cache. The standard solution for this case is to use memcached as an object cache. The elevator pitch for memcached: it's a thin distributed hash table in local RAM stores accessible by a very lightweight network protocol and bereft of the featuritus that might make it slow; response times for reads ands writes to memcached data stores typical clock in at single digits of milliseconds.

RDBMS-based caches are often a glorified hash table; a primary key'd column and value column. Using an RDBMS as a cache works but it's kinda overkill; you're not using the "R" in RDBMS. Anyway, transacting with a disk based storage engine that's concerned with ACID bookkeeping isn't an efficient cache. MySQL has the peculiar property of supporting pluggable storage backends. MyISAM, InnoDB and HEAP backends are the most commonly used ones. Today, Brian Aker (of Slashdot and MySQL AB fame) announced his first cut release of his memcache_engine backend.

Here's Brian's example usage:

mysql>  INSTALL PLUGIN memcache SONAME 'libmemcache_engine.so' ; create table foo1 (k varchar(128) NOT NULL, val blob, primary key(k)) ENGINE=memcache CONNECTION='localhost:6666';

mysql> insert into foo1 VALUES ("mine", "This is my dog");
Query OK, 1 row affected (0.01 sec)

mysql> select * from foo1 WHERE k="mine";
+------+----------------+
| k    | val            |
+------+----------------+
| mine | This is my dog |
+------+----------------+
1 row in set (0.01 sec)

mysql> delete  from foo1 WHERE k="mine";
Query OK, 1 row affected (0.00 sec)

mysql> select * from foo1 WHERE k="mine";
Empty set (0.01 sec)

Brian's release is labelled a pre-alpha, some limitations apply, your milage my vary, prices do not include taxes, customs or agriculture inspection fees.

What works

SELECT, UPDATE, DELETE, INSERT
INSERT into foo SELECT ...

What doesn't work

Probably ORDER BY operations
REPLACE (I think)
IN ()
NULL
multiple memcache servers (this would be cake though to add)
table namespace, right now it treats the entire server as one big namespace

The memcached storage plugin runs against the bleeding edge MySQL (Brian sez, "You will want to use the latest 5.1 tree"). What's most exciting about this is using it in combination with MySQL 5.x's support for triggers. A cache entry stored from a query result can be invalidated by a trigger on the row that provides the cache entry data. AFAIK, that's exactly how folks have been using pgmemcache in PostgreSQL but I haven't had a chance to mess with that yet. Anyway, check out Brian's list announcement and post about it, kudos to him for hacking on this, I imagine this will add a lot of value to the MySQL user community.

mysql memcached postgresql

( Aug 28 2006, 07:18:13 AM PDT ) Permalink

Sunday August 27, 2006

Stupid Object Tricks

When I wrote about OSCON last month, I mentioned Perrin Harkins's session on Low Maintenance Perl, which was a nice review of the do's and don'ts of programming with Perl, I really didn't dig into the substance of his session. Citing Andy Hunt (from Practices of an Agile Developer):

When developing code you should always choose readability over convenience. Code will be read many, many more times than it is written. (see book site)

Perrin enumerated a lot of the basic rules of engagement for coding Perl that doesn't suck. Some of the do's and don'ts highlights:

Do's

use strict
use warnings
use source control
test early and often, specifically recommending Test::Class and smolder
follow conventions when you can

...mostly no brainers and yet a lot of Perl programmers are oblivious to basic best practices.

Don'ts

don't use formats (use sprintf!)
don't mess with UNIVERSAL (it's the space-time continuum of Perl objects)
don't define objects that aren't hashes ('cept inside outs)
don't rebless an existing object into a different package (if you describe that as polymorphism in a job interview, expect to be shown the door real quick)

And so on.

The sad fact is that there are many ways to write bad Perl. I was amused to see Damian Conway and Larry Wall sitting in the second row as Perrin read off the indictments that so many Perl programmers are guilty of. On that last point, I can't even figure out why anyone would ever want to do that or why Perl supports it at all. This is ridiculous:

package Foo;

sub new {
  my $class = shift;
  my $data = shift || {};
  return bless $data, $class;
}

package main;

my $foo = Foo->new;
print ref $foo, "\n";
bless $foo, 'Bar';
print ref $foo, "\n";

For the non-Perl readers, create an instance of Foo ($foo), then change it to an instance of Bar, printing out the class names as you go. The output is:

Foo
Bar

Anyone caught doing this will certainly come back as a two headed cyclops in the next life.

I've been trying to increase my python craftiness lately. I first used python about 10 years ago (1996) at GameSpot, we used it for our homebrewed ad rotation system. I fiddled with python some more at Salon as part of the maintenance of our ultraseek search system. But basically, python has always looked weird to me and I've avoided doing anything substantial with it. Well, my interest in it is renewed because there is a substantial amount of legacy code that I'm presently eyeballing and, anyway, I'm very intrigued by JVM scripting languages such as Jython (and JRuby). I'm looking for a best-of-both-worlds environment, things-are-what-you-expect static typing and compile time checking on the one hand and rapid development on the other. I was really astonished to learn that chameleon class assignment like Perl's is supported by Python. Python is strongly typed in that you have to explicitly cast and coerce to change types (very unlike Perl's squishy contextual operators which does a lot of implicit magic). But Python is also dynamically typed, an object's type is a runtime assignment. This is gross:

class Foo:

  def print_type(self):
    print self.__class__

class Bar:

  def print_type(self):
    print self.__class__

if __name__ == "__main__":
  foo = Foo();
  foo.print_type();
  foo.__class__ = Bar
  foo.print_type();

In English, create an instance of Foo (foo), then change it to an instance of Bar, printing out the class names as you go. The output is:

__main__.Foo
__main__.Bar

(Python prefices the class name with the current namespace, __main__) Anyone caught doing this will certainly come back as a reptilian jackalope in the next life.

Of course, Java doesn't tolerate any of these shenanigans. Compile time complaints of "what, are you crazy?!" would surely come hither from javac. There's no setClass(Class):void method in java.lang.Object, thank goodness, even though there is getClass():Class. One of the key characteristics of a language's usefulness for agile development has to be its minimalization of astonishing results, quirky idioms and here-have-some-more-rope-to-hang-yourself behaviors. If you can't read your own code from last month without puzzling over it, how the hell are you going to refactor it quickly and easily next month? Will your collaborators have an easier time with it? Perl has rightly acquired the reputation of a "write once, puzzle forevermore" language. I haven't dug into whether Ruby permits runtime object type changing (that would be really disappointing). I'll dig into that next, clearly the rails developers emphasis on convention and configuration over code is aimed at reducing the surprises that coders can cook up. But that doesn't necessarily carry back to Ruby itself.

python perl ruby java code agile programming jython jruby oscon oscon06

( Aug 27 2006, 08:51:11 AM PDT ) Permalink

Wednesday August 02, 2006

I would pay Muni 10x the fare if...

...just once when a passenger wearing too much perfume or cologne boards the metro, it would prompt the driver (who would be Samuel L. Jackson) to stand up, turn to the passengers and demand, "Get those mother effin' stinks off this mother effin' train!"

Perhaps for once I'd get my money's worth from Muni.

san francisco muni samuel l jackson stinks on a train

( Aug 02 2006, 09:46:20 AM PDT ) Permalink

The 5-year forecast

Sam Ruby's Teenagers on the go slide deck is an interesting prognosis on the future impact of the protocols, formats and form factors in our midst on publishing, sharing and participating on the web.

( Aug 02 2006, 07:13:37 AM PDT ) Permalink

Sunday July 30, 2006

Participant Created Artifacts

Since the universally understood (at least among the intelligentsia) descriptor user generated content continues to nag at people (Tim raised it again during his OSCON session) and the alternatives have been difficult to pin down (was Tim suggesting people contributed experiences?), it's my caffeinated Sunday morning aspiration to consider the alternatives.

Having a label is important, we're making a distinction between published artifacts that are developed by editors and/or paid staff and the stuff created by Normals who are contributing the artifacts of their creative process to the web. Yes, the term user is definitely sterile, generated too mechanical and content seems so... vacuous. Does participant created artifacts work as a descriptor for all of the photos we're uploading, blog posts we're posting and so forth?

oscon oscon06 sunday coffee user generated content participant created artifacts

( Jul 30 2006, 08:44:40 AM PDT ) Permalink

Saturday July 29, 2006

OSCON Rocked

Had a great time at OSCON! Besides the previously noted keynotes and sessions, my faves were Perrin Harkins' Low-Maintenance Perl (a good discussion of best practices in Perl as the simple practices, Perl as the sole domain of wizards is so old school), Moazam Raja's Troubleshooting the JVM and the Applications That Run Within It (a good survey of the built in runtime diagnostics available for java), Tim Bray's The Atom Publishing Protocol as Universal Web Glue (a good example of using vi and curl for bare metal wire protocol demos as well as how slow JRuby's start-up time is!) and Damian Conway's Friday keynote was suitably humorous! If there was anything that I wish I coulda rearranged it was the time slots when there were more than one session I wanted to be in. But the timeslots with nothing interesting going on were good opportunities for hallway conversations; which are often the most important activities at these events, so I won't complain vigorously.

Enjoyed hanging out Friday afternoon for "OSCON decompression" at Urban Grind ("Coffee should be black as night, hot as hell, and strong as love.") with James, David, Josh, David and Scott. Heh, I got PostGIS running on my powerbook, which gave me something to play with on the flight home!

Portland is a really nice town, the Disneyland-like lightrail system (complete with automaton announcements in english and espanol), the neighborhood ambiance, the surrounding greenery... I dig it. For next year's OSCON trip, I'll be bringing the family along!

oscon oscon2006 oscon06 opensource portland perl java

( Jul 29 2006, 11:20:55 AM PDT ) Permalink

Thursday July 27, 2006

Google Code Project Hosting

At Greg Stein's talk, A Google Service for the Open Source Community, he outlined how Google's following up on it's Summer of Code project with a new contribution to open source. No, it's not a dating service or personal trainer service for geeks (that'll be Google's 2007 contribution). And no, it's not source code search (does Krugle have that covered?). This is project hosting on Google Code Project Hosting.

Yes, there is nothing new about project hosting; there's long been things like Source Forge, Tigris, java.net and so forth. Things like Sourceforge are doing a great job, but there are strengthes that Google has that could be brought to bear on the project hosting space. Some of the unique features of Google's project hosting that Greg cited are

simplicity, scalability, reliability: The system is built with Google's trademark minimalist approach to user interfaces as well as leveraging Google's horizontal scaling and robust the data center sauce.
Subversion on Bigtable: They've built a new backend for Subversion that, instead of using flat files or BerkeleyDB, uses Bigtable. Bigtable data is fast, highly available and replicated across data centers so that all Subversion stored resources will benefit from the Bigtable backend.
Complete re-think of issue tracking: Bugzilla, Trac, Jira and most other issue tracking systems are workflow heavy and laden with permissions and security provisions. The heavy stuff is often unnecessary, it just gets in the way of issue tracking for open source projects instead of facilitating the core use cases that people need. They've replaced all of the highly structured data model and query environment with labels (tags). Labels get attached to projects and issues to tags and provide a minimalist structure to query by using full text queries.

Next came the demo.

Creating a new hosted project is simple enough, you fill out a form with the project name, a summary description, a full description, select a license (Apache, Artistic + GPL, GPL v2, LGPL, BSD, MIT and Mozilla are choices ... dual licensing is not permitted) and apply some labels (tags). If you don't have one yet, a subversion password is created for you (it's *not* your gmail password). Your project will have a tabbed interface for the main page, issues, browsing the soruce and an administrative page. Project creators and administrators must use a GMail account. If not using GMail, bug reporters must have some Google account (Picasa, Groups, etc). The "Issues" screen provides a tabular view of bugs, the columns are ajax enabled for parameterization. The neat thing is that instead of using a big form with tons of check boxes and selectors, the issue tracking uses query expressions to refine issue search results. Status field for a bug can be free text; while a static vocabulary is defined and selectable in an ajax drop down the vocabulary is unconstrained. Status isn't the only metadata that's open-ended, instead of having "release version", "milestone", "component", etc the system uses labels. The issue list column repertoire is adaptable so that you can select labels you've defined as listing criteria. All of the open endedness may be an invitation to pandemoneum but the focus is on having the user interface make it easy for the user to do the right thing.

Some of the administratively defined aspects of a project include the issue creation template (defines the prompts that issue creators will see), project links, project discussion groups (using Google Groups), project blogs and activity notification email addresses. The system will support issue tracking feeds. Most of the metadata that will be visible on the project summary page that newcomers to the project will see.

There's currently no "tarball download" service and integration with other Google services is in the works. For the time being, any downloads made available must be done within the limit of the quotas on your subversion repository (100 MB). Plans for importing and exporting, creating APIs and so forth are underway (the issue tracking seems like a natural fit for Atom and Atom Publishing Protocol).

Congrats to Greg and the Google Code team on a great launch!

oscon oscon2006 oscon06 google open source

( Jul 27 2006, 03:46:06 PM PDT ) Permalink

Wednesday July 26, 2006

OSCON Keynotes (July 26, 2006)

I missed the first keynotes (I just arrived in time for Tim O'Reilly's "what technologies are hot according to these slices on the data" bit that he does) but enjoyed Greenplum's Scott Yara talk, School of Rock. He highlighted the parallels of open source development and rock and roll. I'll paraphrase his points.

Open source, like rock and roll, has flourished simply because people enjoyed it. Like rock and roll, money has jumped into open source and an industry has swelled around it. Like rock and roll, open source threatens the establishment but also mutually coopts and becomes the establishment. Yara showed a funny "twins separated at birth?" photo pairing of Rick Rubin and Richard Stallman! What will sustain open source's integrity (like rock and roll's) are the intangibles, the real emotions and inspirations the drive innovation. The popularity game isn't a measure of quality... just because it's widely downloaded doesn't mean it's good just as Britney Spears' and N'Sync's sales success aren't validations of "good" music. So, beware of the vogue of open source, people are starting to believe that open source is better but don't let that undermine what's important. For those who are building their business on open source, go for the $$$ but keep your integrity. At that point Yara ran a little excert of Metallica goofing on a radio promo production (from Some Kind of Monster?), the ironies of choosing them as illustrations of how money changes everything, given how they coopted and have become the music establishment, were high humor for me. Nonetheless, Metallica like a lot of successful open source software projects have succeeded by being a little dangerous, by being genuine and not bothering with the constraints of the legacy establishment.

Anil Dash gave a talk about Trying to Suck Less: Making Web 2.0 Mean Something basically outlining that beyond the technology stack (i.e. LAMP), there are higher level tools that developers can employ to suck less (yep, I confess, at Technorati when we can't quite kick the butt that we aspire to, we focus on sucking less). Citing the technologies that have grown out of SixApart's software plumbing, he highlighted that all successful Web 2.0 compnaies are using load balancing, messaging, caching, filesystems and other scalability and performance platform components. In SixApart's case, perlbal, memcached, mogilefs and djabberd are the core technologies that they build on ... and, so the pitch goes, should you if you want to suck less.

Those the high points of the morning (so far).

oscon oscon2006 metallica sixapart

( Jul 26 2006, 10:00:31 AM PDT ) Permalink

Tuesday July 25, 2006

Technorati's Extreme Makeover

In case you hadn't heard, we've had a lot of things cooking at Technorati. Besides the engaging new look, the new features and the complete overhaul of URL search and link counts, we've been making great strides in our blog spam mitigation (you wouldn't believe the stuff we catch ... and the shear quantity of it!), our internal caching and messaging infrastructure and our data center network. Of course, there's still much to do but we've been heads down on it; if you haven't checked us out lately I think you'll find that our efforts to improve the front end, the back end and all of the cogs and pullies in between have been moving forward.

I'm really proud of the team I work with at Technorati! If you'd like to join the team, we have a lot of innovation ahead. Grab me this week at OSCON and tell me about how you'd like to materialize the real time web! I'll also be moderating a Microformats BOF, this will be a good opportunity to talk about the implementations for producing and consuming microformats. See ya in Portland!

technorati oscon portland jobs microformats

( Jul 25 2006, 10:37:36 PM PDT ) Permalink

Saturday May 06, 2006

The Evils of Blogger's URL Recycling

Blog publishing services typically propagate updates about new posts from blogs (ergo, new blogs too) by pinging or publishing a changes.xml file. But what none of the services provide is an "un-ping" -- blog indexing services such as Technorati don't know when a blog has been deleted from a service. I noticed this today when I found http://blogtrarian.blogspot.com/ participating in a link farm infesting Blogger's service. This can happen because Google's Blogger recycles URLs; when a blog is removed from the system, the URL is freed for reuse.

That particular URL is one that dates back to 2004, it was dormant for several months but just came to life recently with spam. The historic posts (until August 2005) look like normal blogging fare but the recent posts are clearly just splog content. We'll have to work on "un-pinging" so it's easier to distinguish dormant blogs and dead ones.

spam splog web spam google blogger ping

( May 06 2006, 03:13:14 PM PDT ) Permalink

Friday May 05, 2006

Google Is Full?

So Google's CEO Eric Schmidt says his servers are full, hmm. Tying that to SEO'ers griping about their indexing, Andrew Orlowski speculates that it's web spam besetting big daddy. Could be but the hard data isn't out in the wild. The numbers that we can see are that Google is spending several banana republics worth of GDP on capital expenses:

Google continued to make substantial capital investments, mainly in computer servers, networking equipment and its data centers. It spent $345 million on such items in the first quarter, more than double the level of last year. Yahoo, its closest rival, spent $142 million on capital expenses in the first quarter.
Referring to the sheer volume of Web site information, video and e-mail that Google's servers hold, Schmidt said: "Those machines are full. We have a huge machine crisis." (read more)

If the problem is spam, then certainly it's Google's own doing. The elephant in the room is that the acceleration of web spam everyone's talking about is fueled by AdSense, often aided and abetted by Blogger splogs, Google Pages, Google Base, etc. The spam ecosystem is within Google's capacity to reign in but the don't-be-evil company is making too much money on click fraud with plausible deniability to do anything about it. Is Google having problems handling web spam and "filling up" their machines? Cry me a river, all the way to the bank.

spam google adsense splog web spam

( May 05 2006, 02:09:19 PM PDT ) Permalink

Thursday May 04, 2006

Thwarting Kleptotorial

When I read the words on

Microsoft yesterday reached a tentative $70 million deal to settle a California class-action antitrust lawsuit, according to a statement by the law firm representing the plaintiffs in the suit.

at http://www.satishlive.info/?p=27 I had the distinct sense of deja-vu. So I ran some queries against Technorati's index and sho-nuf, I found the exact same content had already been published by InfoWorld. Ah, there was an attribution at the bottom... but InfoWorld didn't publish under a creative commons license. Looks like blatant theft.

Then I checked the next post (http://www.satishlive.info/?p=28) on that blog and read:

I took a new blog search tool called Sphere for a little spin this morning and found it useful.

... hey, didn't I just see that somewhere else? Yep, this time it was PC World and no attribution.

It's safe to surmise that this is kleptotorial laden with AdSense and stuffed into the update stream. I've seen screenscrapes and feedscrapes on splogs before but they're usually easier to identify visually, I had to look more carefully at this to note its spamminess. Is there a market in alerting publishers to copyright infringement? Obviously this stuff should be removed from Technorati's index but is there a more valuable service to publishers that should be provided here? How much would you pay to find out about misappropriations of your content? Is there a market for Technorati to do something like Plagiarism.org to fingerprint blog content?

splog creativecommons copyright spam creative commons plagiarism adsense

( May 04 2006, 09:34:17 PM PDT ) Permalink

Tuesday May 02, 2006

The Colbert Smackdown

In case you've been hiding under a rock, the blogosphere is abuzz about Stephen Colbert's weekend dressing down of George Bush and just about everything else inside the beltway. If you haven't, see the c-span vids:

or read the transcript

The chatter (even art work on flickr) about it is frantic. Thank You Stephen Colbert has 700 links right now (this is a blog that came into being less than 72 hours ago), it's getting about five or ten links per hour at the moment. The videos are the most linked-to youtube reels on Technorati. How wonderful it is to have an administration that is so bad, the opportunities for high humor are so many. Why did we invade Iraq?

stephencolbert colbert flickr youtube technorati bush cspan whitehousecorrespondentsdinner colbertreport comedycentral politicalhumor iraq blogs blogging

( May 02 2006, 09:27:30 PM PDT ) Permalink

Sunday April 30, 2006

Link Raising Won't Pay The Bills

I have developed a great deal of respect for those who do fund raising full time as a profession, it's a tough business. The Happy Valley Odyssey of the Mind teams are trying to raise money to send themselves (the kids) and their coaches to the World Finals and, so far, it's been tough moving that along. With basically three weeks left before the big trip to Ames, Iowa, the thermometer still has quite a ways to go. If you can't donate today, how about linking to their site? Sure links won't pay the bills directly but if getting the word out means that someone who can help with the bills will find out about it, maybe it can help indirectly.

Put a badge on your site with this code :

A study cited by the pros found that donors say they have more money than time. In this case, the teams are putting in all of the time (that's the point of Odyssey of the Mind, it's all of the kids' creativity and intellect applied to problem solving); now they just need to pay some bills. If you can't donate cash and donating your time won't impact their endeavor, what can you do? Donate attention! OK, admittedly badges aren't the most attractive things, but you can take this one down after the World competition. So for the month of May, if you can't send money, send 'em some links!

fundraising odysseyofthemind ames iowa fund raising

( Apr 30 2006, 09:42:34 PM PDT ) Permalink