October 24, 2005

g11n gotchas

a couple-three emails i got recently prompted me to think (again) about what globalization means to the average coldfusion developer. coincidentally mark davis, IBM's front man for g11n and president of the Unicode Consortium, is putting together a presentation for the next Unicode conference dealing with "Globalization Gotchas". i highly recommend cf developers doing i18n/g11n work to review these, it's certainly worth the effort.

among my favorites that apply in one way or another to coldfusion (i've yakked about these in various articles/books/blog entries but good stuff usually bears repeating):
  • Unicode encodes characters, not glyphs: U+0067 » ggggggg
  • Unicode does not encode characters by language: French, German, English j have the same code point even though all have different pronunciations; Chinese 大 (da) has the same code point as Japanese 大 (dai).
  • Length in bytes may not be N * length in characters
  • Not all text is correctly tagged with its charset, so character detection may be necessary. But remember, it's always a guess.
  • Use properties such as Alphabetic, not hard-coded lists: isAlphabetic(), /p{Alphabetic} in regex
  • Transliteration (Ελληνικά ↔ Ellēniká) is not the same as Translation (Ελληνικά ↔ Greek)--users of my transliteration CFC please take note
  • Unicode ≠ Globalization. Unicode provides the basis for software globalization, but there's more work to be done...
  • Don't simply concatenate strings to make messages: the order of components different by language. Use Java MessageFormat or equivalent. (like the rbJava or javaRv CFCs)
  • Don't put any translatable strings into your code; make sure those are separated into a resource file.
  • Don't assume everyone can read the Latin alphabet. Don't assume icons and symbols mean the same around the world.
  • Tag all data explicitly. Trying to algorithmically determine character encoding and language isn't easy, and can never be exact.
  • Formatting and parsing of dates, times, numbers, currencies, ... are locale-dependent. Use globalization APIs that use appropriate data.
  • If you heuristically compute territory IDs, timezone IDs, currency IDs, etc. make sure the user can override that and pick an explicit value. (ie be automagical about locale choice, etc. but allow the user to manually pick what they want)
  • Don't assume the timezone ID is implied by the user's locale. For the best timezone information, use the TZ database; use CLDR for timezone names.
  • Java globalization support is pretty outdated: use ICU to supplement it. (cf developers should use ICU4J)

0 Comments:

Post a Comment

<< Home