What's That Noise?! [Ian Kallen's Weblog]

Wednesday January 25, 2006

HTML in the Real World

Google has a study of how HTML is really being used out in the wild. They've posted their results, Web Authoring Statistics

December 2005 we did an analysis of a sample of slightly over a billion documents, extracting information about popular class names, elements, attributes, and related metadata. The Good, The Bad and The Ugly: it's all in there.

A billion documents sampled, nice!

I think this is a demonstration of Google's expanding interest in grokking the semantics of that are latent in document structures. The results are broken down by

Pages and elements
Elements and attributes
Classes (class="")
HTTP headers
Page headers, head element contents
Metadata, meta tag contents
The body element
Text elements
Table elements
Link relationships (rel/rev)
The a element
The img element
Scripting: The <script> element
Editors and their custom markup

That's pretty good coverage!

I'd love to see how the data changes over time. I suspect parts of the web are becoming more orderly (web 2.0 applications are likely using well formed document structures) while the web as a whole is probably atrophying (the vast installation base of crappy or misconfigured tools are likely the preponderant generators of markup). I'm anticipating a lot of interesting data emerging as the ascendance of microformats continues. Goog, looking forward to follow-up surveys!

google microformats html

( Jan 25 2006, 06:37:29 AM PST ) Permalink

links

« June 2025
Sun	Mon	Tue	Wed	Thu	Fri	Sat
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Today

« June 2025

Sun

Mon

Tue

Wed

Thu

Fri

Sat

Today

Lijit Search