Props from Jeremy on our anti-blog spam efforts are certainly appreciated. I know we don't have a spam-free index, however the amount of spam we keep out of the index is truly astonishing. Our ping interface is deluged with a torrent of rubbish but we do our best to scrub the nasty stuff out of our update stream. The problem defies conventional mail spam or even blog comment spam analytic techniques as the structure of blog spam is very different. Deep examination of the content and structure across a pattern of web sites is often required to distinguish it as spam but in the end, the indicators are there. Most spammers' publishing behaviors are statistical outliers by nature; the numbers speak for themselves.
We have a lot to do, on this and on many fronts but we try to pay attention to the gripes as a measure of priorities. The kudos are nice, too!( Jan 08 2006, 08:29:31 PM PST ) Permalink
The levers and dials of character set encoding can be overwhelming, just looking at the matrix supported by J2SE 1.4.2 gives me vertigo. Java's encoding conversion support is simple enough, if not garrulous:
String iso88591String = request.getParameter("q"); String utf8String = new String(iso88591String.getBytes("UTF-8"));But what do you do if you don't know what encoding you're dealing with to begin with? It looks as though there are a couple of ways to do it:
String q_unknown_japanese = request.getParameter(q); String q_unicode = new String(q_unknown.getBytes("ISO8859_1"),"JISAutoDetect");