What's That Noise?! [Ian Kallen's Weblog]

« Previous page | Main | Next page »

Sunday November 12, 2006

CardSpace and the Identity Ecosystem

An interesting introduction came over the transom recently. I've read Kim Cameron's blog before but the honest truth is: I've really been flumoxed by the wide range in the cast of characters and agendas in the identity fray. Some seem overly concerned with identity as a line of business, others concerned with seeing themselves at the center of the discussion. Meeting Kim was a treat, even though he had the cards stacked against him coming from Microsoft, we had a great conversation. When I think of Microsoft I think of the many aspersions; "the Borg", "the evil empire", "The Man", "the big cathedral", "stifling monopolists", "makers of the Blue Screen Of Death", "vendor lock-in creeps", "virus and security-hole mongering dumbos." OK, I'll stop. Of course the reality is that good people also show up in bad places and they make good things happen nonetheless. C# looks and the .Net framework does great stuff for developer productivity. There's a lot of innovation happening in Microsoft's search and online services divisions. To be fair, a lot of Microsoft bashing is another form of bigotry that we have to get beyond. Microsoft has a lot great people and their executive leadership has done a lot of really bad things, so move along. The good guys inside the cathedral need constructive engagement lest they never prevail over the Matrix; more than anyone they (and Melinda) have the capacity to draw the Sith away from the Dark Side (re "constructive engagement": I'm thinking Clinton's Sino-American oppositional/collaborative stance that rides on the inevitable, not Reagan's failure vis-a-vis South Africa, which was wimpy coddling of the anti-divestment movement).

Speaking of the Jedi and Neo architype, characters and ranches in Santa Barbara, endorsements from Doc Searls always get my attention:

When the conversation started to heat up after DIDW, the Neo role was being played by a character with the unlikely title of "Architect", working inside the most unlikely company of all: Microsoft. Kim Cameron is his name, and his architecture is the Identity Metasystem. Note that I don't say "Microsoft's Identity Metasystem". That's because Kim and Microsoft are going out of their way to be nonproprietary about it. They know they can't force an identity system on the world. They tried that already with Passport and failed miserably.

I prefer to think of the various roles of Identity Providers, Relying Parties and People as part of an ecosystem. But metasystem is fine, let's just stick to that vernacular. Kim is the author of Laws of Identity. Again citing the same article from Doc for a nice summarization:

User Control and Consent: digital identity systems must reveal information identifying a user only with the user's consent.

Limited Disclosure for Limited Use: the solution that discloses the least identifying information and best limits its use is the most stable, long-term solution.

The Law of Fewest Parties: digital identity systems must limit disclosure of identifying information to parties having a necessary and justifiable place in a given identity relationship.

Directed Identity: a universal identity metasystem must support both "omnidirectional" identifiers for use by public entities and "unidirectional" identifiers for private entities, thus facilitating discovery while preventing unnecessary release of correlation handles.

Pluralism of Operators and Technologies: a universal identity metasystem must channel and enable the interworking of multiple identity technologies run by multiple identity providers.

Human Integration: a unifying identity metasystem must define the human user as a component integrated through protected and unambiguous human-machine communications.

Consistent Experience across Contexts: a unifying identity metasystem must provide a simple consistent experience while enabling separation of contexts through multiple operators and technologies.

This is powerful stuff. I'm very pleased with our implementation of OpenID to support blog claiming but I know that this is the tip of the iceberg. There are people on the web who aren't authoring and sharing; they may not have nor want a URL that they can use for their identity. So while I'm committed to extending our support for OpenID, I'm also looking beyond it. The Laws are exemplary guiding principles in my exploration of the topic. Kim and Doc joined Kristopher Tate (the Zooomr dude), Tantek and myself to talk about CardSpace, Microsoft's implementation of an identity metasystem. After discussing some of the high-level issues facing the web, the blogosphere and ~~user generated content~~ participant created artifacts in general, we dived deep on CardSpace. Since CardSpace will be shipping with Vista (as well as distributed for Windows XP), by my estimation the coming ubiquity of user-centric identity isn't something to ignore. As we worked through the CardSpace workflow with Kim, Tantek and I came up with this diagram (Glossary: "IDP" = "Identity Provider", "RP" = "Relying Party", CardSpace is a page embedded app so there's both interaction via the browser and directly in the OS). This is of course just Microsoft's implementation but the Good Thing is that they aren't clutching it tightly, folks working on open source implementations (keep an eye on the OSIS working group) will make sure that the identity metasystem isn't a Borg in sheeps clothing.

Identities on the contemporary web suffer from a lot of accountability, authenticity and siloization deficiencies. Pings, trackbacks and comments all suffer from these and in turn we all do in the form of web spam. Reputation systems (such as Technrati's authority ranking) mitigate some of these problems but there is still much to do. I'm really pleased to have met Kim, he's one of the good guys and I look forward to working more folks pushing the online identity envelope. If you're going to be joining Internet Identity Workshop coming up, I'll see you there!

identitymetasystem cardspace technorati openid microsoft

( Nov 12 2006, 01:25:24 PM PST ) Permalink

Sunday October 29, 2006

Hacking Into Movable Type

Everyone knows what a great product Movable Type is. But if you find yourself in care of a Movable Type deployment that nobody seems to be able to login to with superuser privileges, it may seem pretty hopeless; if you need to perform privileged operations, especially if the installation is backended by a ~~sleepycat~~, er, Oracle BerkelyDB database, the data is somewhat opaque. AFAIK, MT doesn't seem to ship with any "break glass with this little hammer if the superuser was hit by a bus" contingencies and with BerkelyDB there's no SQL command prompt; in fact, the only way to dig into it is to write some code. So I was fiddling with just such a MT-3.33 installation; I had an account but not much in the way of privileges. After opening the BerkeleyDB files with DB_File, dumping contents with Data::Dumper and going through some of the MT libraries, I found what I was looking for. Here's the Perl I hacked up to grant myself superuser privileges:

#!/usr/bin/perl

use strict;
use DB_File;
use lib qw( /path/to/MT-3.33/lib );
use MT;
use MT::Serialize;
use MT::ConfigMgr;

my $serializer = MT::Serialize->new(MT::ConfigMgr->instance->Serializer);
my %hash;
tie %hash,  'DB_File', '/path/to/MT-3.33/author.db', O_CREAT|O_RDWR, 0666, $DB_BTREE or die $!;
my $data;
while (my($k,$v) = each %hash) {
    my $rec = $serializer->unserialize($v);
    if (${$rec}->{'name'} eq 'Ian Kallen') {
        $data = ${$rec};
        last;
    }
}
$data->{'is_superuser'} = 1;
my $frozen = $serializer->serialize( \$data );
$hash{'12'} = $frozen;
untie %hash;

For other fixes to Movable Installations, consider MT-Medic.

movabletype perl hack

( Oct 29 2006, 09:35:21 PM PST ) Permalink

Thursday October 19, 2006

OpenID on Technorati

As I announced on the Technorati Weblog, we rolled out support for blog claiming with OpenID. I'm really proud of the work that Chris and the team have done to make this a reality. If you're not familiar with OpenID, here is one good place to start. Sure, I'm well aware of the concerns about phishy user interface vulnerabilities. The idea of logging in without a password may seem weird.

One weird thing, for new users, is that instead of logging into an OpenID-using site (like Zooomr) with a user name and password, you just give it your personal OpenID URL -- and no password. Then your browser pops over to your authenticating site (like myopenid.com) to verify that you want to use your persona on the new site. This is bound to initially confuse people, and since users may not be asked for a password, it can also appear to be less secure, although it is not.
ZDNet: OpenID has a potential cure for Website password overload - Rafe Needleman

Frankly, I'm not certain what the best resolutions are for those concerns. However I'm more comfortable with adopting OpenID "as-is" and evolving as the technology advances then sitting around waiting for it to be perfected. Welcome to now.

Distributed identity ideas have been gestating for a long time while identity cathedrals have been built and fallen. If your blog is your voice, your URL can be your identity.

openid technorati

( Oct 19 2006, 11:42:04 PM PDT ) Permalink

Thinking about linking

Whenever I look at page to page, post to post, blog to blog and domain to domain relationship statistics (and permutations across them) interesting things often emerge. Microsoft's Live Search recently released a linkfromdomain operator that can help dig into these linking relationships. For instance, linkfromdomain:arachna.com ruby returns the pages that I've linked to that have ruby in the text. Combined with the site operator, I can do a search of the pages I've linked to on Technorati with linkfromdomain:arachna.com site:technorati.com.

Looks like the blogosphere is noticing, within the last two days Technorati has seen 57 links to the linkfromdomain announcement blog post. Kudos to MSN's search team for a cool innovation.

One apparent problem with their crawls is javascript/flash-plugin handling, the site:youtube.com linkfromdomain:technorati.com SERP shows pages referenced from Technorati's most linked-to YouTube videos, however all of the SERP items have the text

Hello, you either have JavaScript turned off or an old version of Macromedia's Flash Player. Click here to get the latest flash player.

heh!
Anyway, combine programmatic access (you can get a feed of that search with this link) with these link operators and Live Search is a very powerful and useful product. Read more about it on Live Search's WebLog

search livesearch msn technorati

( Oct 19 2006, 06:56:16 AM PDT ) Permalink

Wednesday October 18, 2006

Saturn Eclipse

This was on NASA's Astronomy Picture Of The Day site a few days ago, I haven't been able to close the browser tab with it... I just keep gazing at the surreality of it.

In the shadow of Saturn, unexpected wonders appear. The robotic Cassini spacecraft now orbiting Saturn recently drifted in giant planet's shadow for about 12 hours and looked back toward the eclipsed Sun. Cassini saw a view unlike any other. First, the night side of Saturn is seen to be partly lit by light reflected from its own majestic ring system. read on

NASA goes on to explain that the eclipse revealed newly detected strata of rings around Saturn.

saturn nasa cassini eclipse

( Oct 18 2006, 10:45:16 AM PDT ) Permalink

Tuesday October 17, 2006

More Greening at Google

Between Google's extensive use of employee shuttles, their green data centers proposal last month and yesterday's announcement Google to Convert HQ to Solar Power, I'm really impressed with the ecologically conscientious initiatives they're taking! Personal note: the solar installation will be led by Energy Innovations, EI president Andrew Beebe is a friend from years ago who I've long lost touch with but I was very pleased to see his name associated to this project.

google green solar

( Oct 17 2006, 06:52:10 AM PDT ) Permalink

Saturday October 07, 2006

Scaling Down

It's broadly appreciated how scaling up is usually driven by business demand, but the requirements for scaling down are rarely as appreciated. Questions about how web 2.0 business scale up abound these days. As the challenges of service growth and business plans stress technical infrastructure, startups try to squeeze everything they can out of their architecture with a number of widely accepted practices. However, scaling considerations for the other direction are oft neglected.

Why should you be thinking about scaling down?

Isolated functional testing to mitigate the riskiness of change
End-to-end testing that doesn't require duplication of production infrastructure is a strategic advantage. I know of a financial analytics system run by a large institution that is untestable. This system has cron jobs, data feeds and query systems built on top of Perl code going back at least a decade. The inputs and outputs are so convoluted, that the system is untestable. So if this code is making the bank that owns it tens of millions of dollars every day (it is!), what's wrong with that? Well, it could be probably be more profitable if it could be changed and optimized safely. As it stands, the folks maintaining the code don't really know what modifications might break the system and with income produced at that scale, who wants to risk it? So look at the systems you're working on now, think about the "scaling up" considerations you've made and ask yourself: Is a system testable in a developer's environment? Can they unit test? Can they perform functional tests? Do the tests require access to resources only available at the data center? Is "now" hardcoded to the present in your code? Using scaled down database, messaging, caching and application runtimes that have no dependencies on a connected network and production infrastructure should be considered up front in your design consideration.
Operational costs of vertical vs horizontal scaling
If a system makes assumptions about the process space it runs in that allows for functionality to be accessed from other runtimes, bravo: you may be headed in the right direction of service oriented architecture and horizontal scaling. But can the application stack be collapsed? This is like the OMG-moment when folks first started running J2EE application tiers over remote interfaces and realized that they've ended up with so much complexity and overhead, they have no choice but to scale up. That complexity can have all kinds of expensive side effects with how effectively systems can be triaged when they ail.
Business agility or just changing your mind
Businesses are run be people. People make mistakes. Wetware is imperfect. When you buy a long term commitment to a data center, you may be assuming liabilities that will outlive the business proof. Make sure the hardware footprint you're signing up for is one you can sustain it or you can get out of it. When you build gratuitous tiers, the costs of taking them out when it's time to consolidate functionality can be stifling. So ask yourself: If systems scaled up to meet business objective that aren't met, can you "retreat" from the scale-up offensive?

Every time I see a system that's hard to test, has sysadmins overwhelmed or are not meeting business objectives and has to be reeled in, I'm reminded of the importance of thinking about scaling in both directions. No, I haven't read the book yet but as someone burdened with too much stuff at home, I've got it on my list.

web 2.0 unit testing functional testing technical operations system architecture software

( Oct 07 2006, 03:53:58 PM PDT ) Permalink

Saturday September 30, 2006

Publishing Little Bits: More than micropublishing, less than big bytes

I find it really fascinating to see the acceptance of a publishing paradigm that lies in between the micropublishing realm of blogging, posting podcasts and videos and "old school" megapublishing. There are of course magazines; your typical piece in the New Yorker is longer than a blog post but shorter than a traditional book. But there's something else on the spectrum, for lack of a better term I'll call it minipublishing.

If you want to access expertise on a narrow topic, wouldn't it be cool to just get that, nothing more, nothing less? For instance, if you want to learn about the user permissions on Mac OS X, buy Brian Tanaka's Take Control of Permissions in Mac OS X. TidBITS Publishing has a whole catalog of narrowly focused publications that are bigger than a magazine article but smaller than your typical book. O'Reilly has gotten into the act too with their Short Cuts series. You can buy just enough on Using Microformats to get started; for ten bucks you get 45 pages of focused discussion of what microformats are and how to use them. Nothing more, nothing less. That's cool!

What if you could buy books in part or in serial form? Buy the introductory part or a specific chapter, if it seems well written, buy more. Many of us who've bought technical books are familiar with publish bloat, dozens of chapters across hundreds of pages that you buy even though you were probably only interested in a few chapters. Sure, sometimes publishers put a a few teaser chapters online hoping to entice you to buy the whole megilla. Works for me, I've definitely bought books after reading a downloaded PDF chapter. But I'm wondering now about buying just the chapters that I want.

publishing microformats macosx media micropublishing minipublishing

( Sep 30 2006, 07:04:31 PM PDT ) Permalink

Wednesday September 27, 2006

You Can't Handle The Truth

Colonel Jessup has assumed control of Newsweek:

Ignorance is bliss
How meta:

See ya at the gulag.

media newsweek ministry of truth afghanistan taliban bush

( Sep 27 2006, 04:16:31 PM PDT ) Permalink

Tuesday September 26, 2006

Green Data Centers

At today's Intel Developer Forum, Google is presenting a paper that argues that the power supply standards that are built into today's PCs are anachronistic, inefficient and costly. With the maturing of the PC industry and horizontal scaling becoming a standard practice in data center deployments, it's time to say good-bye to these standards from the 1980's.

John Markoff reported in the NY Times today

The Google white paper argues that the opportunity for power savings is immense, by deploying the new power supplies in 100 million desktop PC's running eight hours a day, it will be possible to save 40 billion kilowatt-hours over three years, or more than $5 billion at California's energy rates.
Google to Push for More Electrical Efficiency in PCs

Nice to see Google taking leadership on the inefficiencies of the PC commodity hardware architectures.

google pc green datacenters

( Sep 26 2006, 06:02:09 PM PDT ) Permalink

Monday September 25, 2006

Greater than the sum of its parts

The other week I reflected on the scaling-web-2.0 theme of the The Future of Web Apps workshop. Another major theme there was how social software is different, how transformative architectures of participation are. There was one talk that stood out from Tom Coates, Greater than the sum of its parts. A few days ago the slides were posted; I poked through 'em since and they jogged some memories loose, I thought I'd share Tom's message, late though it is, and embellish with my spin.

Tom's basic thesis is that social software enables us to do "more together than we could apart" by "enhancing our social and collaborative abilities through structured mediation." Thinking about that, isn't web 1.0 about structured mediation? Centralized services, editors & producers, editorial staff & workflow, bean counting eyeballs, customer relationship management, demographic surveys and all of that crap? Yes, but what's different is that web 2.0 structured mediation is about bare sufficiency in that it's better to have too little than too much, the software should get out of the way of the user, make him/her a participant, not lead him/her around by the nose.

Next, Tom highlighted that valuable social software should serve

Individual Motives: An individual should get value from their contribution
Social Value: The individual's contributions should provide value to their peers as well
Business/Organizational Value: The organization that hosts the service should enable the user to create and share value and then derive aggregate value to expose this back to it users. I thought that was really well considered.

Tom outlined a spectrum of social software, on the one hand concensus focused and fact oriented where many contributions make one voice and, on the other hand, a social contribution focus and polyphony where many voices produce emergent order. Wikipedia, MusicBrainz and openstreetmap.org are illustrative of the former, Flickr, Plazes, YouTube and Last.fm the latter. Tom discussed the motives for contributing to the community:

anticipate reciprocity: by offering value, it's reasonable to expect others to contribute value as well
reputation: by showing off a little, highlighting something uniquely yours to contribute, you gain prestige
sense of efficacy: by being able to make an impact, a sense of worth is felt
identification with a group: be it for altruism or attachment, contributing to a group makes you part of it

Think about every mailing list you've been on, every online forum and simulated environment you've used and you know it's true, these are among the basic underpinnings of virtual community

Citing The Success of Open Source , he likened social software participants motivations to this ranked list of open source contributor's motivations

learning to code
gaining reputation
scratching an itch
contributing to the commons
sticking it to Microsoft (well, probably no analog in that motive for participating in social software ...)

At a meta-level then, commodization of memes is driven similarly to open source's commodization of software capabilities. I think this analogy requires exploration, particularly now. While Mark Pilgrim counts all Non-Commercial-Use-Only licenses as overly restrictive, I disagree. I don't think we need to remove all encumbrances on our words in order to freely disseminate memes. On the contrary, if every n'er-do-well kleptotorial spammer has free reign of your words, it seems more likely that your meanings authenticity will get lost as it gets reposted on legions of AdSense-laden splogs. So while many of the motivations for contribution inspire analogy, the licensing ramifications are very different. I own my own words. Feel free to quote, excerpt or otherwise use them for non-commercial use. Everything else is a negotiation.

Here are some social software "best practices":

Expose every axis of data you can every axis of data is an application opportunity
Give people a place to represent themselves
- these are my bookmarks
- these are my photos
- these are my videos
- here is my voice
Allow them to associate, connect and form relations with one another
Help them annotate, rate and comment... on Digg, every action is a form of self expression
Look for ways to expose this data back onto the site

And here's what to watch out for:

Be wary of how money changes everything; points, votes and competition can distort the social values as well
Be very careful of user expectations around how private or public their contribution is
Be wary of creating monocultures or echo chambers

So, what's the business? Where's the money?

Attention and advertising
Premium accounts
Building services around the data
Using user-generated annotations and contributions to improve your other services

Well, AFAICT, the business models still need to prove themselves. We've seen virtual communities become viral communities; driven by social networking, peer to peer technologies and other bindings but apart from Fox' MySpace acquisition, where's the big money? Hopefully we'll see "IPO 2.0" events, web 2.0 companies enjoying financial vigor and going public, in the next year or so. Ultimately, it's liquidity that will provide commercial validation. Anyway, you'll find a lot of this in Tom's slides but unfortunately, what's online is just shadow of his live preso @ The Future of Web Apps.

futureofwebapps-sf06 social software virtual community

( Sep 25 2006, 10:24:40 AM PDT ) Permalink

Thursday September 21, 2006

Community Policing In The Blogosphere

I mused about people-powered topic classification for blogs after playing with the Google Image Labeller the other week. It seems like a doable feature for Technorati because the incentives to game topic classification are low.

That same week, Rafe posed a question about community driven spam classification:

Why couldn't Blogger or Six Apart or a firm like Technorati add all of the new blogs they register to a queue to be examined using Amazon's Mechanical Turk service? I'd love to see someone at least do an experiment in this vein. The only catch is that you'd want to have each blog checked more than once to prevent spiteful reviewers from disqualifying blogs that they didn't agree with.
(read the rest)

The catch indeed is that the incentive is high for a system like this to be gamed. Shortly after blogger implemented their flag, spammers ~~fired~~ laughed back with bloggerbowling:

"Bloggerbowling" - the practice of having robots flag multiple random blogs as splogs regardless of content to degrade the accuracy of the policing service.

As previously cited from Cory, all complex ecosystems have parasites. So I've been thinking about what it would take to do this effectively, what would it take overcome the blogosphere's parasites bloggerbowling efforts? The things that come to mind for any system of community policing are about rewards and obstacles. For example

Leverage a user's reputation to weight the value of his/her vote, Technorati's authority ranking (based on the count of unique blogs linking to a blog over 180 days) would be an example of reputation
Raise the barrier for abuse by requiring participants to develop karma over time before they can vote
Create incentives for participation beyond answering "this search had a load of crap in it, how can I clear it out of the way?" (most Technorati users toggle the authority filter)
Instrument the system to ferret out the usage statistics, the actions of obvious 'bowlers would have to be automatically discarded
Support administrative intervention, staff would have to be watching the detectives

I've participated in virtual communities of many flavors for years (in fact, Cory and Rafe are familiar faces from over a dozen years ago on The WeLL, back then I was newbie amongst oldtimers). Virtual communities work well when there are social bonds, when there is accountabiity and reputational capital that gets put on the line. The stronger those factors, the greater the motivation for community policing. Who's motivated to police the blogosphere? Obviously, Technorati is motivated; if the up-to-the-minute is up-to-it's-neck-in-crap, the value diminishes quickly. Another class of motivated users are folks like Doc, authors of narrative ripped off by kleptotorial sploggers. The last class of motivated ecosystem participants that comes to mind are the victims of click fraud, from what I've heard their outcomes to date have been lotsa free ads and their lawyers fetching fat fees.

At the end of the day, I don't have the answers. But I think Rafe, Doc and so many others concerned with splog proliferation are asking great questions. Technorati is currently keeping a tremendous volume of spam out of its search results but, at the end of the day, there's still much to do. And this post is the end of my day, today.

spam splog splogs technorati virtual community blogs web spam

( Sep 21 2006, 11:06:22 PM PDT ) Permalink

Wednesday September 13, 2006

Everybody Hurts, Sometimes

A few weeks ago, Adam mentioned some of the shuffling going on at Technorati's data centers. Yep, we've had our share of operational instability lately, when you have systems that expect consistent network topologies and that has to change, I suppose these things will happen. It seems a common theme I keep hearing in conversations and presentations about web based services: the growing pains.

This morning, Kevin Rose discussed The digg story: from one idea to nine million page views at The Future of Web Apps workshop. Digg has had to overcome a lot of the "normal" problems (MySQL concurrency, data set growth, etc) that growing web services face and have turned to some of the usual remedies, rethinking the data constructs (they hired DBA's) and memcached. This afternoon, Tantek was in fine form discussing web development practices with microformats where he announced updates to the search system Technorati's been cooking, again a growth induced revision. Shortly thereafter, I enjoyed the stats and facts that Steve Olechowski presented in his 10 things you didn't know about RSS talk. And so it goes, this evening it was Feedburner having an episode. "me" time -- heh, know how ya feel <g>

While Feedburner gets "me" time, Flickr gets massages when they have system troubles. Speaking of Flickr, I'm looking forward to Cal Henderson's talk, Taking Flickr from Beta to Gamma at tomorrow's session of The Future of Web Apps. I caught a bit of Scaling Fast and Cheap - How We Built Flickr last spring, Cal knows the business. I've been meaning to check out his book, Building Scalable Web Sites.

Perhaps everybody needs a therapeutic message for the times of choppy seas. When Technorati hurts, it just seems to hurt. Should it be getting meditation and tiger balm (hrm, smelly)? Some tickling and laughter (don't operate heavy machinery)? Animal petting (could be smelly)? Aromatherapy (definitely smelly)? Data center feng shui? Gregorian chants? R.E.M. samples?

futureofwebapps-sf06 palaceoffinearts flickr feedburner digg technorati microformats memcached

( Sep 13 2006, 09:26:42 PM PDT ) Permalink

Monday September 04, 2006

Applying Security Tactics to Web Spam

Hey, I'm in Wired! The current Wired has an article about blog spam by Charles Mann that includes a little bit of my conversation with him. Spam + Blogs = Trouble covers a lot of the issues facing blog publishers (and in a broader sense, ~~user generated content~~ participant created artifacts in general). There are some particular challenges faced by services like Technorati that index these goods in real time; not only must our indices have very fast cycles, so must our abilities to keep the junk out. I was in good company amongst Mann's sources, he talked to a variety of folks from many sides of the blog spam problem: Dave Sifry, Jason Goldman, Anil Dash, Matt Mullenweg, Natalie Glance and even some blog spam perps.

I've also had a lot of conversations with Doc lately about blog spam and the problems he's been having with kleptotorial. A University of Maryland study of December 2005 pings on weblogs.com determined that 75% of the pings are spam AKA spings. By excluding the non-English speaking blogosphere and not taking into account the large portions of the blogosphere that don't ping weblogs.com, that study ignored a larger blogosphere but overall, that assessment of the ping stream coming from weblogs.com seemed pretty accurate. As Dave reported last month, by last July we were finding over 70% of the pings coming into Technorati to be spam.

Technorati has deployed a number of anti-spam measures (such as targetting specific Blogger profiles, as Mitesh Vasa has. Of coures there's more that we've done but if I told you I'd have to kill you, sorry). There are popular theories in circulation on how to combat web spam involving blacklists of URLs and text analysis but those are just little pieces of the picture. Of the things I've seen from the anti-splog crusader websites, I think the fighting splog blog has hit one of the key vulnerabilities of splogs: they're just in it to get paid. So, hit 'em in the wallet. In particular, splog fighter's (who is that masked ranger?) targetting of AdSense's Terms of Service violators sounds most promising. Of course, there's more to blog spam than AdSense, Blogger and pings. The thing gnawing at me about all of these measures is their reactiveness. The web is a living organism of events, the tactics to keeping trashy intrusions out should be event driven too.

Intrusion detection is a proven tool in the computer security practice. System changes are a distrurbance in the force, significant events that should trigger attention. Number one in the list of The Six Dumbest Ideas in Computer Security is "Default Permit." I remember the days when you'd take a host out of the box from Sun or SGI (uh, who?) and it would come up in "rape me" mode. Accounts with default passwords, vulnerability laden printing daemons, rsh, telnet and FTP (this continued even long after the arrival of ssh and scp), all kinds of superfluous services in /etc/inetd.conf and so on. The first order of business was to "lock down" the host by overlaying a sensible configuration. The focus on selling big iron (well, bigger than a PC) into the enterprise prevented vendors from seeing the bigger opportunity in internet computing and the web. And so reads the epitaph of old-school Unix vendors (well, in Sun's case Jonathan Schwartz clearly gets it -- reckoning with the "adapt or die" options, he's made the obvious choice). Those of us building public facing internet services had to take the raw materials from the vendor and "fix them". The Unix vendors really blew it in so many ways, it's really too bad. The open source alternatives weren't necessarily doing it better, even the Linux distros of the day had a lot of stupid defaults. The BSD's did a better job but, unless you were Yahoo! or running an ISP, BSD didn't matter (well, I used FreeBSD very successfully in 90's but then I do things differently). Turning on access to everything but keeping out the bad guys by selectively reacting to vulnerabilities is an unwinnable game. When it comes to security matters, the power of defaults can be the harbinger of doom.

The "Default Deny" approach is to explicitly prescribe what services to turn on. It's the obvious, sensible approach to putting hosts on a public network. By having very tightly defined criteria for what packets are allowed to pass, watching for adversarial connections is greatly simplified. I've been thinking a lot about how this could be applied to providing services such as web search while also keeping the bad guys (web spammers) out.

Amongst web indexers, the big search services try to cast the widest net to achieve the broadest coverage. Remember the mine is bigger than yours flap? Search indices seemingly follow a Default Permit policy. On the other extreme from "try to index everything" is "only index the things that I prescribe." This "size isn't everything" response is seen in services like Rollyo. You can even use Alexa Web Search Platform to cobble your own index. But unlike the case of computer security stances, with web search you want opportunities for serendipity; searching within a narrowly prescribed subset of the web greatly limits those opportunities. Administratively managed Default Deny policies will only get you so far. I suspect in the future effective web indexing is going to require more detailed classification, a Default Deny with algorithmic qualification to allow. Publishers will have to earn their way into the search indices through good behavior.

The blogosphere has thrived on openness and ease of entry but indeed, all complex ecosystems have parasites. So, while we're grateful to be in a successful ecosystem, we'd all agree that we have to be vigilant about keeping things tidy. The junk that the bad guys want to inject into the update stream has to be filtered out. I think the key to successful web indexing is to cast a wide net , keep tightly defined criteria for deciding what gets in and to use event driven qualification to match the criteria. The attention hi-jackers need to be suppressed and the content that would be misappropriated has to be respected. This can be done by deciding that whatever doesn't meet the criteria for indexing, should be kept out. Not that we have to bid adieu to the yellow brick road of real time open content but perhaps we do have to setup checkpoints and rough up the hooligans who soil the vistas.

spam web spam splog splogs adsense technorati wired

( Sep 04 2006, 11:10:15 PM PDT ) Permalink

Saturday September 02, 2006

Mechanical Turk Tagging

I spent way too much time last night giving Google some free labor. The Google Image Labeler is kinda fun, in a peculiar way. In 90 second stretches that AJAX-ishly links you to someone else out there in the ether, you are shown images and a text box to enter tags ("labels" is apparently Google's preferred term, whatever). Each time you get a match with your anonymous partner, you get 100 points. The points are like the ones on Whose Line Is It Anyway, they don't matter. And yet it was strangely fun. The most I ever got in any one 90 second session was 300 points. Network latency was the biggest constraint, sometimes Google's image loading was slow. Also, the images are way too small on my Powerbook ... this is the kinda thing you want a Cinema Display for (the holidays are coming, now you know what to get me).

So what if Technorati did this? Suppose you and some anonymous cohort could be simultaneously shown a blog post and tag it. Most blogging platforms these days support categories. But there are a lot of blog posts out there that might benefit from further categorization. Author's are already tagging their posts and blog readers can already tag their favorite blogs but enabling an ESP game with blog posts sounds like an intriguing way to refine categorization of blogs and posts.

tagging esp game google image labeler mechanical turk

( Sep 02 2006, 12:31:26 PM PDT ) Permalink