11 February 2009

The infamous Turkish locale bug

I discovered a quirky comment today in Confluence’s Permission.forName(String) method:

// use the english locale to avoid the infamous turkish locale bug
String upperName = permissionName.toUpperCase(Locale.ENGLISH);

Naturally the question popped into my mind: what is the ‘infamous Turkish locale bug’? Looking into the JIRA issues related to the commit (CONF-5931, CONF-7168), I found a link Agnes put to this article about a common Java bug in the Turkish locale: Turkish Java Needs Special Brewing.

In the Turkish alphabet there are two letters for ‘i’, dotless and dotted. The problem is that the dotless ‘i’ in lowercase becomes the dotless in uppercase. At first glance this wouldn’t appear to be a problem; however, the problem lies in what programmers do with upper- and lowercases in their code.

The two lowercase letters are \u0069 ‘i’ and \u0131 ‘ı’ (dotless ‘I’) and are totally unrelated. Their uppercase versions are \u0130 ‘İ’ (capital letter ‘I’ with dot above it) and \u0049 ‘I’. The issue is that this behavior does not occur in English where the single lowercase dotted ‘i’ becomes an uppercase dotless ‘I’.

With the statement String.toUppercase(), most Java programmers try to effectively neutralize case. Consider a HashMap with string keys and you have a key that you want to look up. If you want to ignore case, you’ll probably uppercase everything going into the map, its entries, and the string you’re doing the lookup with. This works fine for English, but not for Turkish, where dotless becomes dotless.

This is a nice example of where you need to be very careful how you handle upper- and lower-casing in your application. Changing the word ‘quit’ to uppercase in the Turkish locale will result in ‘QUİT’, not ‘QUIT’. I’ve heard of other examples where the German ß (sharp ‘s’) doesn’t behave exactly as English speakers would expect either.

There are two ways to properly perform a case-insensitive comparison of Strings in Java in any locale:

  • (preferred) use String.equalsIgnoreCase()
  • use a fixed locale (like Locale.ENGLISH) as an argument to String.toUpperCase(Locale) or String.toLowerCase(Locale).

You can also use Character.toLowerCase() or Character.toUpperCase() to derive a locale-independent case-insensitive String value. This was the solution used in a recent (and still unreleased) fix for the same problem in the Commons Collections CaseInsensitiveMap.