May 31, 2004

i18n good practices: resource bundles

one of the dreariest bits of i18n work is dealing with strings, especially for retro-fitting existing apps. you'll have to comb thru the existing code substituting resource bundle (rb) keys for existing strings. while regex filters, etc. help, nothing beats a pair of "mark IV eyeballs". in order to keep this task within the bounds of tolerable cruelty, there are a few simple things you might keep in mind when developing cf applications:
  • case: not ever language has case, Thai for instance doesn't, so PERMISSIONS, Permissions and permissions would be represented by the same string. in languages that do have case, those kinds of case permutations are plainly cosmetic (i was going to say cosmetic nonsense but thought better). if there's a application real need for this sort of thing, say to accent some heading, it should be handled via CSS and not hardcoded. hardcoded case strings make the difficult i18n process even more so. think twice before you get carried away with case, especially if you find yourself writing complex <cfif> blocks to handle it.
  • pluralization: not every language deals with plurals the same as English, simply adding a letter ("s" for instance) hardly ever cuts it and in some instances the language structure is completely different (the English phrase "five wood blocks" becomes something like "block of wood five units" in Thai). while you can blow off quite a few CPU cycles with complicated logic to handle plurals, i contend that item(s) is just as understandable as <cfif someQ.recordCount GT 1>items<cfelse>item</cfif> and has the added benefit of i18n simplicity. otherwise you'll have to add another set of rb keys (plural forms vs singular forms) and logic to handle pluralization.
  • compound strings: compound strings are, besides being my pet peeve, strings that contain substituted values. for example, "You owe me #dollarFormat(amountDue)#. Please pay by #dateFormat(normalDueDate)# or I will be forced to shoot you with #numberFormat(budgetQ.bulletsPerDeadbeat)# bullets. Thank you." if you do much i18n research you'll often see folks recommending you avoid compound strings like the plague (for instance, the API for the messageFormat java class comes right and says this). why? because they're hard to handle. first you have to figure out the logic and in some cases its not going to be trivial. then you have to rework the rb string to use place holders for localization ("You owe me {1}. Please pay by {2} or I will be forced to shoot you with {3} bullets. Thank you.") . finally you have to substitute the intended values at runtime--newer versions of my javaRB and RBjava CFC have methods for this. its often much easier to simply rewrite the compound string.
  • floating prepositions: these are perhaps a form of compound string but often can't be handled like them. i sometimes encounter extremely complicated output logic/displays or HTML form elements separated by a preposition (usually "at", "by" or "in"). in its simplest form it might be "dateValue at timeValue" (which actually can be handled as a compound string) but more often then not it's much more complicated. if i can get my way, we normally send floating prepositions to the garbage dump, i mean most folks would have no problem understanding "dateValue timeValue".
i suppose many folks might find this trivial but it adds time and complexity to an already time-consuming and complicated process.

May 26, 2004

icu4j beta/collation

ibm has released another beta version of its supercool icu4j. these betas are also released as an executable JAR (i only noticed this with the first beta for 3.0), so you can jump right into testing. while i was perusing the icu4j site i stumbled across this interesting page: collation performance comparison. wow! icu4j beats the snot out of the plain java JDK for collation over most locales (except for ja_JP and ko_KR locales, note that locales <> collation). i know that collation is of some interest to many i18n folks, so this is kind of interesting news.

May 20, 2004

my tools too

a few days ago, sean c. blogged about the tools he was using, which finally prompted me to blog this "me too tools". the g11n world is slightly different in that a "tool" is more often than not a place to find information than a chunk of software. with that in mind here's my tool list too:
  • icu4j: i literally couldn't do g11n work without this java library. while much of its pioneering i18n functionality has been absorbed into the java core, it still offers hard/impossible-to-duplicate functionality like non-gregorian calendars, holidays & super-sized collations. it is the bee's knees of i18n s/w. and of course, its free.
  • unicode: after watching folks' codepage encoding antics in the user forums, what can i say, just use unicode ©.
  • Common Locale Data Repository: while still in beta, the CLDR is going to be the locale reference. it was thought to be so important that its maintainence was handed-off to the unicode organization by the openi18n org. need to know the currency used in Thailand? short weekday names used Turkish? writing system direction in Afghanistan? this repository is the place to look first. all the info is contained in an XML file per locale (not that i enjoy parsing XML files but i can put up with that chore for the goldmine of locale info it provides).
  • rbManager: if you do g11n work, you build resource bundles (well you should be doing this anyway). if you build resource bundles (rb), then you need a tool. i've looked at and played around with a bunch of rb tools & still haven't found anything as easy to use or as sophisticated as rbmanager, the price (free) is pretty good too. i18nEdit gets an honorable mention for its nifty unicode char picker for those days when you're too lazy to load another locale.
  • SC UniPad: need a unicode text editor that can handle inuktitut and brail at the sametime? look no futher than the plenty fine SC UniPad. i get a kick out of just using it. also a nice tool to double check rb files.
  • unifier: if you have to batch convert text/html docs from codepage encodings to unicode (and who doesn't) this will probably be the best 15 bucks you'll ever spend.
  • javaInetLocator: i built my geoLocator CFC around nigel wetter's javainetlocator class. if you need to know the country and locale of a user (well their IP anyway), this is probably the best non-commercial tool around (and i can say its probably better than many commercial ones i've looked). its fast (i have another geoLocator tool built around db-based IP range queries and nigel's class beats the pants & socks off of it) and free.
  • iText: i've used this java library quite a bit to burn PDFs. it offers really fine control that we often need (municipal tax receipts for instance) & is a piece of cake to use.
  • cfstudio 5: what can i say, i'm old and in the way. while my colleagues laugh that i still use this "antique", i keep remnding them that muscle memory means more and more as you get older (i've literally pounded the alt f & s keys off of several keyboards over the years while i still have the same industrial-strength ms mouse for almost 10 years). and nope, no reference as i couldn't for the life of me tell you where to buy this these days. that said, i'm trying to give cfEclipse a fair trail (it would help a whole bunch though if it had better docs, hint hint spike).
  • java i18n forums: while i don't spend much time there these days, these forums are still a valuable i18n info source. if you do serious i18n work with cf, you know you have to dip down into java quite a bit and if you get stumped as much as i did, these forums are often a life saver. another good java library/info site is of course IBM's developer works. just a for instance, i wanted to learn how to do i18n string searchs & "Efficient text searching in java" turns up (yes that article is a bit dated).
  • books-on-line (BoL): i do a lot of work with ms sql server (frankly i prefer it) and the BoL has come to be my constant companion (my cat neutron uses the pile of sql books i've bought over the years as a spot to cat nap--speaking of cats i still get a great kick out of the my cat hates you site). you really can't to better than this for an ms sql server reference.

May 12, 2004

oh how time flies...

IBM's mark davis has a proposal about "handling different binary formats of datetime". this is something i'd never given any thought to but one glance at table 1 (reproduced below) in the proposal makes me wonder why this hasn't come up before.

Table 1: Binary Time Scales

Source Datatype Unit Epoch
JAVA_TIME int64 milliseconds Jan 1, 1970
UNIX_TIME int32 seconds Jan 1, 1970
ICU4C double64 milliseconds Jan 1, 1970
WINDOWS_FILE_TIME int64 ticks (100 nanoseconds) Jan 1, 1601
WINDOWS_DATE_TIME int64 ticks (100 nanoseconds) Jan 1, 0001
MAC_OLD_TIME int32 seconds Jan 1, 1904
MAC_TIME ? seconds Jan 1, 2001
EXCEL_TIME ? days Dec 31, 1899
DB2_TIME ? days Dec 31, 1899


java and Unix while having the same epoch (origin) differ in datatype and units so they differ in accuracy and range. windows' time scales differ internally for OS vs file system (no snickering). at the current state of this proposal, he's chosen to use Windows datetime as a "universal 'pivot'". that gives a time scale range from 29,000 BC to 29,000 AD. i guess IBM really does take the long term view ;-)

if you want to provide feedback i guess you'll have to join the ICU mailing list.

so now you know.

May 11, 2004

three new papers on HTML/XHTML i18n

the GEO task force has published three "First Working Drafts" dealing with characters, encodings and the ever happy-go-lucky BIDI ;-) http://www.w3.org/TR/i18n-html-tech-char/
http://www.w3.org/TR/i18n-html-tech-lang/
http://www.w3.org/TR/i18n-html-tech-bidi/
pretty good reading.

May 09, 2004

blogspot? bah humbug.

after another week of blogspot's problems (screwed up RSS feeds, appending "/" to file names, nobody i care about supporting atom feeds, etc.), i've finally bit the bullet and fully ported ray camden's cf blog to my website (and made it a bit more i18n in the process). now at least if there's any issues i know who to blame ;-) you can find the blog here. i'll be posting to that blog first & keep the blogspot blog one here as backup until i get bored with doing that.

May 05, 2004

bow wow

besides endlessly arguing about such things as "Arid Canaanite Wasteland" or "palaeo-Hebrew" the unicode folks also hand out awards for "outstanding personal contributions to the philosophy and dissemination of the Unicode Standard". they call that one the bulldog award. that reference is actually to thomas huxley's comment in the 1870's: You know I have to take care of him [Darwin] -- in fact, I have always been Darwin's bull dog. well this year's award winner is none other than tex texin the debonair i18nguy about town. congratulations.

the MAT cometh...

microsoft, it seems is lending i18n app developers a helping hand, at least apps on xp and 2003 OS's. ms has just anounced a beta for MAT. what's MAT? to quote the public site "Microsoft Application Translator (MAT) provides on-the-fly translation of applications' User Interface (UI) from one language to another. Using MAT, you can run applications in your preferred language". in simple terms this means if you develop a desktop app in thai, you could translate it to arabic with MAT. at first glance it looks like it just does text localization, which while not the only part of i18n work, it is however the dreariest. MAT also really won't help apps that aren't at least somewhat i18n (at least according to the public FAQ). from the public site, i'm not really sure if it does web apps. not sure what smaller localization shops will make of this. it might lose them their marginal/low end business. is nothing safe ;-) now if we could just get mm to provide native resourceBundle functionality....