October 31, 2005

icu4j stealth upgrade

no idea why i missed this (i guess it was never publically announced) but icu4j was recently updated to version 3.4.1 (a maintainance release from 3.4). not a whole lot of changes, i guess the most significant is a fix to the new CharsetDetector class.

anyway, grab it to keep in lock step w/IBM's ICU project.

October 25, 2005

language matters?

you bet it does. just ask the 20 poor slobs who had to cough up 100 new turkish lira (about $76US) each for using the letters "Q" and "W" in kurdish language placards in turkey. it seems that these letters aren't in the turkish alphabet and there's a 1928 law ("Law on the Adoption and Application of Turkish Letters") that requires all signs and what not to only use turkish letters. in case you don't already know, turkey moved from an arabic to a "modified" latin script in the 1920's, pretty gutsy thing to do. i guess they needed tough laws to push this kind of reform through.

this is all news to me....i wonder how they advertize windows there? i know we have a good group of turkish coldfusion users, care to shed some light on this guys?

from CNN.

October 24, 2005

g11n gotchas

a couple-three emails i got recently prompted me to think (again) about what globalization means to the average coldfusion developer. coincidentally mark davis, IBM's front man for g11n and president of the Unicode Consortium, is putting together a presentation for the next Unicode conference dealing with "Globalization Gotchas". i highly recommend cf developers doing i18n/g11n work to review these, it's certainly worth the effort.

among my favorites that apply in one way or another to coldfusion (i've yakked about these in various articles/books/blog entries but good stuff usually bears repeating):
  • Unicode encodes characters, not glyphs: U+0067 » ggggggg
  • Unicode does not encode characters by language: French, German, English j have the same code point even though all have different pronunciations; Chinese 大 (da) has the same code point as Japanese 大 (dai).
  • Length in bytes may not be N * length in characters
  • Not all text is correctly tagged with its charset, so character detection may be necessary. But remember, it's always a guess.
  • Use properties such as Alphabetic, not hard-coded lists: isAlphabetic(), /p{Alphabetic} in regex
  • Transliteration (Ελληνικά ↔ Ellēniká) is not the same as Translation (Ελληνικά ↔ Greek)--users of my transliteration CFC please take note
  • Unicode ≠ Globalization. Unicode provides the basis for software globalization, but there's more work to be done...
  • Don't simply concatenate strings to make messages: the order of components different by language. Use Java MessageFormat or equivalent. (like the rbJava or javaRv CFCs)
  • Don't put any translatable strings into your code; make sure those are separated into a resource file.
  • Don't assume everyone can read the Latin alphabet. Don't assume icons and symbols mean the same around the world.
  • Tag all data explicitly. Trying to algorithmically determine character encoding and language isn't easy, and can never be exact.
  • Formatting and parsing of dates, times, numbers, currencies, ... are locale-dependent. Use globalization APIs that use appropriate data.
  • If you heuristically compute territory IDs, timezone IDs, currency IDs, etc. make sure the user can override that and pick an explicit value. (ie be automagical about locale choice, etc. but allow the user to manually pick what they want)
  • Don't assume the timezone ID is implied by the user's locale. For the best timezone information, use the TZ database; use CLDR for timezone names.
  • Java globalization support is pretty outdated: use ICU to supplement it. (cf developers should use ICU4J)