May 02, 2005

Quick and Dirty Hack for UTF-8 Support in ResourceBundle

I just don't get why Sun folks didn't fix this in J2SE 1.5. By specification, PropertyResourceBundles, or more exactly, the Properties files are Latin-1 (i.e. ISO 8859-1) encoded: "When saving properties to a stream or loading them from a stream, the ISO 8859-1 character encoding is used. For characters that cannot be directly represented in this encoding, Unicode escapes are used; however, only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings. ". However, since all Latin-1 characters are in the same position in UTF-8 encoding, I don't see a reason why they couldn't have just added support for UTF-8 into the Properties class.

While PropertyResourceBundle only has an implicit reference to the Properties class, the problem is an overall bad design of ResourceBundle class hierarchy. The super class ResourceBundle has two responsibilities: it acts both as a super class and as a factory for loading ResourceBundles. The ResourceBundle handles loading of PropertyResourceBundles that inherit from ResourceBundle, and you can already smell a problem with this suspicous implementation. Generally, the superclass should never need to know anything about child classes implementing it. The getBundle() methods in it are defined as final so there's no way to replace the the default implementation of PropertyResourceBundle. Sun has two answer to this problem: either use native2ascii tool to encode all double-byte characters in your Properties file or implement your own ResourceBundle class.

Using native2ascii by hooking it up with your Ant build as a task is fine, but when you are developing and adding UTF-8 strings into your Properties file, it's just an extra burden to run native2ascii after every change. On Sun's forums, Craig McClanahan discusses how you could use your own ResourceBundle class instead of Properties files to resolve the encoding problem. But the issue with custom ResourceBundle classes is that they are inherently different from PropertiesResourceBundle; you would need a custom class per each locale you are supporting. Since ResourceBundle class handles loading of the PropertyResourceBundles and the methods are marked final, you are stuck with the Latin-1 encoding if you want to use Property files.

The whole problem is stupid. Properties files should have supported UTF-8 in the first place, but the change to support them could have been made at any time after. Assuming UTF-8 as encoding when reading Latin-1 encoded file wouldn't have broken anything: this backwards compatibility is the basic reason why UTF-8 is so popular. All is not lost though; you could just use your own ResourceBundle factory class for loading ResourceBundles and then implement a UTF-8 PropertyResourceBundle class wrapper for UTF-8 support. Here's a quick and dirty hack to do just that:


import java.io.UnsupportedEncodingException;
import java.util.Enumeration;
import java.util.Locale;
import java.util.PropertyResourceBundle;
import java.util.ResourceBundle;

public abstract class Utf8ResourceBundle {

public static final ResourceBundle getBundle(String baseName) {
  ResourceBundle bundle = ResourceBundle.getBundle(baseName);
  return createUtf8PropertyResourceBundle(bundle);
}

public static final ResourceBundle getBundle(String baseName, Locale locale) {
  ResourceBundle bundle = ResourceBundle.getBundle(baseName, locale);
  return createUtf8PropertyResourceBundle(bundle);
}

public static ResourceBundle getBundle(String baseName, Locale locale, ClassLoader loader) {
  ResourceBundle bundle = ResourceBundle.getBundle(baseName, locale);
  return createUtf8PropertyResourceBundle(bundle);
}

private static ResourceBundle createUtf8PropertyResourceBundle(ResourceBundle bundle) {
  if (!(bundle instanceof PropertyResourceBundle)) return bundle;

  return new Utf8PropertyResourceBundle((PropertyResourceBundle)bundle);
}

private static class Utf8PropertyResourceBundle extends ResourceBundle {
  PropertyResourceBundle bundle;

  private Utf8PropertyResourceBundle(PropertyResourceBundle bundle) {
    this.bundle = bundle;
  }
  /* (non-Javadoc)
   * @see java.util.ResourceBundle#getKeys()
   */
  public Enumeration getKeys() {
    return bundle.getKeys();
  }
  /* (non-Javadoc)
   * @see java.util.ResourceBundle#handleGetObject(java.lang.String)
   */
  protected Object handleGetObject(String key) {
    String value = (String)bundle.handleGetObject(key);
    try {
      return new String (value.getBytes("ISO-8859-1"),"UTF-8") ;
    } catch (UnsupportedEncodingException e) {
      // Shouldn't fail - but should we still add logging message?
      return null;
    }
  }

}
}

Above, I've implemented Utf8PropertyResourceBundle as an inner class, but of course you could implement it as a public type if you wanted to use it explicitly. If you look at its handleGetObject method, the byte conversion to UTF-8 is really the only thing these classes are doing, and the thing that Sun missed in their implementation of PropertyResourceBundle.

Posted by thoughts at May 2, 2005 11:37 AM | TrackBack
Comments

Great! I implemented RB myself once, but this looks much better, thanks!

Posted by: Marcin Cenkier at June 10, 2005 05:50 AM

Hay quá, xin cảm ơn nhiều

Posted by: haso at December 11, 2005 07:31 PM

Very cool however I found a bug in your impl. when asking for a resource bundle with a parent. This is how I fixed it. Notice the call to getString which makes sure to recursively go over the parent bundles. Also checking for null.

protected Object handleGetObject(String key) {
String value = (String)bundle.getString(key);
if (value==null) return null;
try {
return new String (value.getBytes("ISO-8859-1"),"UTF-8") ;
} catch (UnsupportedEncodingException e) {
// Shouldn't fail - but should we still add logging message?
return null;
}
}

And another small thing: not all ISO-8859-1 chars are a subset of UTF-8. The (C) sign seems to clash. If I put a \u00a9 in a UTF-8 file, it fails. But if I put the actual sign, it works. For some reason it is encoded as two bytes even though it is below 192 decimal.

Not a big deal though. Still very cool solution.

Posted by: Brian at December 27, 2005 06:28 PM

You are absolutely right on both points Brian. A good catch on NPE with parent bundles. About the other thing you mentioned, I've known about it for a long time but I've been lazy to update this entry and also wanted to see when somebody would report on it. The thing is that even though ISO-8859 characters have the same numeric character reference in UTF-8, all the 8-bit (above ASCII) characters are encoded with a double byte in UTF-8 (whereas all characters are encoded as a single byte - 7 or 8 bit - in ISO-8859). So, the above really works only if you know which character set you've stored your properties file in. Still, (and why it may be better, depending on your application code) *the rest* of your application code doesn't need to know about it.

Posted by: Alphageek at January 8, 2006 11:18 PM

Very cool. I've found a bug, too. If you have your own classloader to load the resources, the method


public static ResourceBundle getBundle(String baseName, Locale locale, ClassLoader loader) {
ResourceBundle bundle = ResourceBundle.getBundle(baseName, locale);
return createUtf8PropertyResourceBundle(bundle);
}


does not work.


Instead, you have to fix it to

public static ResourceBundle getBundle(String baseName, Locale locale, ClassLoader loader) {
ResourceBundle bundle = ResourceBundle.getBundle(baseName, locale, loader);
return createUtf8PropertyResourceBundle(bundle);
}

supplying the loader to the getBundle method. Now all works great as expected.

Posted by: Manfred Steinbach at April 24, 2007 07:46 AM
Post a comment









Remember personal info?