What's That Noise?! [Ian Kallen's Weblog]

Main | Next day (Feb 16, 2005) »

20050215 Tuesday February 15, 2005

A Java i18n Checklist I've worked on Java projects with the human language aspects abstracted out but now that I've had to get down and dirty with real localization problems, I'm starting to take mental notes in the "if I had known that earlier it would've saved me a lot of grief" folder.

It seemed pretty straight forward going into the project that I'm working on:

  1. Text is stored as UTF-8 character data
  2. Use ResourceBundle property files to manage display strings
  3. Maintain the set of property keys in the properties file
  4. Let the browser's Accept-Language request headers drive what property file to prefer
See, it's easy! Well, for simple proofs of concept with Western characters, it's just about that easy. When dealing with multibyte strings for asian languages, there's a whole lot more to consider.
  1. Make sure the servlet container is handling UTF-8 appropriately
    For instance, if Tomcat is serving on the HTTP tier edit server.xml and make sure the URIEncoding attribute (absent by default) is set for the connector.
        <Connector port="8080"
                   maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
                   enableLookups="false" redirectPort="8443" acceptCount="100"
                   debug="0" connectionTimeout="20000"
                   disableUploadTimeout="true"
                    URIEncoding="UTF-8"
        />
    
    The same holds true for letting Apache do the HTTP dirty work and connecting with mod_jk
        <Connector port="8009"
                   enableLookups="false" redirectPort="8443" debug="0"
                   URIEncoding="UTF-8"
                   protocol="AJP/1.3" 
        />
    
    And, by the way, if you have static content served by an Apache server, you probably want this as well
    AddDefaultCharset utf-8
    
  2. Wire up the native2ascii ant task into the build system early in the project.
    Manually dealing with the ASCII escaping is a nuisance. If the conversion can't be transparent, at least automate it.
  3. Make sure the database connection drivers are being gentle with their data handling.
    In the case of MySQL, changing the JDBC URLs from this
    jdbc:mysql://localhost/fubar
    
    to this
    jdbc:mysql://localhost/fubar?useUnicode=true&characterEncoding=UTF-8
    
    made a world of difference.
  4. Check the HTTP response headers to assure that the Content-type header value is appropriate
    If the charset isn't set to UTF-8 when it really is, you could be confusing the client. This can set in a servlet, in a JSP and IIRC the struts-config.xml allows you to set it declaritively. You want to set the Content-type before writing to the response object's PrintWriter. Apparently if you have multibyte characters in your JSP page components, you need to set the pageEncoding i.e. in the JSP file itself, something like this:
    <%@ page language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8 %>
    
    Though my whole motivation for using Java on this project was to have page components have only markup and display code; all of the lanugage is abstracted. Anyway, I'm preferring Velocity over JSP these days.
  5. Be prepared to convert request parameter values.
    In my experience, doing this
    request.setCharacterEncoding("UTF-8");
    
    before getting the parameter values is not reliable (could be Tomcat bugs though). However, this appears to be a fairly standard idiom
    String formValue = new String(request.getParameter("formParam").getBytes("ISO8859_1") /* bytes */, "UTF-8");
    
There are other i18n traps to beware of; seems like every place data is passed from one subsystem to another there's an opportunity for the encoding to get mangled.

( Feb 15 2005, 10:17:29 PM PST ) Permalink