last call: Character Model for the World Wide Web 1.0: Fundamentals
the W3C has released a "last call" on the working draft "Character Model for the World Wide Web 1.0: Fundamentals", next stop for this doc is "Recommendation". this "Architectural Specification" provides a "common reference for interoperable text manipulation on the World Wide Web" and of course recommends the use of Unicode. while it is a "work in progress" it's chock full of good information regarding chars, char encodings, collation, etc. worthwhile reading.
the W3C has just issued a new i18n faq related to language negotiation. it discusses the just about absolute need for language negotiation on good i18n web sites, examining the old standby of HTTP Accept-Language header (i use that in combination with geolocator CFC) as well as stressing the need for manual language swapping (couldn't agree more). another important but sometimes overlooked point is "navigation stickiness", basically remembering which language a user has selected (in cf via cookies or session vars) & always serving content in that language. another interesting point (to me anyway) was a trick to also look at User-Agent header which sometimes also contains language (besides all that boring browser version, etc. stuff). cool. i'm going to look at adding that to the geoLocator CFC when Accept-Language is empty.
so now you know.
i've been helping out a friend do a quick and dirty currency app. the bank supplying him with currency info jumped up and down on our toes by supplying currency symbols in codepage encodings (a boatload of them) rather than unicode--they were geared towards one feed per locale and made the silly codepage encoding choice based on that. this turned a reasonably simple app into a medium-sized monster thumper management one. this datafeed, i guess since i don't have a lot of experience with these, also dropped the ball on us by not supplying more info about each currency. while there is such a thing as one-half (0.50) of a dollar there is no such thing as one-half of a yen. when and where do we round? oh boy.
if you read this blog with any regularity, you know what's coming ;-) another dip in the java pool under cfmx. we built a quick and dirty (but hey it works) CFC that makes use of the locale currency info contained in java.util.Currency class. you can see it in action here.
i'd appreciate any feedback, note that this shouldn't be used to replace the currency formatting/parsing functions in the i18nFunction CFC. this CFC isolates the currency info for easier, specific access.
hiroshi okugawa (mm) and i were working on an issue last week in the forums where one user was having trouble sorting a list to german phonebook, sometimes called DIN-2, collation (string sort order). the problem was that listSort and arraySort cf functions sort based on straight up unicode codepoint values. while this will work for most folks, after all 'a' < 'b' is true for both lexigraphical (dictionary) and unicode orders, it won't work for folks with characters like german umlauts ÄËÜ which have higher unicode values than the unadorned chars AEU, ie. ÄËÜ will always sort as a group after AEU rather than the AÄEËUÜ order which folks in that locale would expect. since i mainly use sql server as my db backend, which has a very nifty COLLATE clause that allows you to cast your resultset to a specific collation, this came as a bit of a surprise to me.
the solution to this, as usual for i18n issues in cf, is to dip down into the underlying java functionality, specifically the java.text.Collator class which allows you to "perform locale-sensitive String comparison". we developed a CFC, i18nSort to wrap up this functionality. we also added a sort method based on IBM's ICU4J com.ibm.icu.text.Collator class. why? because ICU4J provides a much beefier set of collation locales (246 vs the java class's 134) including afrikaans, german phonebook, various european locales pre-euro (which would be useful for historical data), persian (both iran and afghanistan), traditional thai, etc.
collation is a strange beast. it's pretty much a universal user requirement but is not consistent for the same chars (germans, french and swedes sort the same chars differently) nor within the same language (so-called phonebook collation vs dictionaries or book indices). and that's just the alphabet-based scripts--asian ideograph collation can be either phonetic or based on the appearance (strokes) of the character. then there's the special cases based on user preferences: ignore/consider punctuation, case ('A' before/after 'a'), etc. you're looking at thousands of years of people's collation baggage, so yes it's going to be complex. you can read more about unicode's take on collation here.
in java (both "plain" java and ICU4J) collation complexity is handled using three parameters: locale, strength, and decomposition. locale is obvious, specific locales' collation data is used to order sorts (and searches). strength is used across locales (though exact strength assignments vary from locale to locale) and determines the level of difference considered significant in comparisons. there are four basic strengths (ICU4J adds a fifth, QUATERNARY which distinguishes words with/without punctuation):
- PRIMARY: significant for base letter differences 'a' vs 'b'.
- SECONDARY: significant for different accented forms of the same base letter ('o' vs 'ô').
- TERTIARY: signficant for case differences such as 'a' vs 'A' (but again differs locale to locale).
- IDENTICAL: all differences are considered significant during comparison (control chars, pre-composed and combining accents, etc.).
taking an example from the java docs in czech, "e" and "f" are considered primary differences, while "e" and "?" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical.
decomposition is just that, chars are decomposed for comparison. there are three basic decompositions (only two for ICU4J):
- NO_DECOMPOSITION: chars are not decomposed, accented and plain chars are the same, this is the fastest collation but will only work for languages without accented, etc. chars.
- CANONICAL_DECOMPOSITION: chars that are canonical variants are decomposed for collation, ie. accents are handled.
- FULL_DECOMPOSITION: not only accented chars, but also chars that have special formats are decomposed (this decomposition doesn't exist in ICU4J, CANONICAL_DECOMPOSITION is used instead). basically un-normalized text is properly handled.
so now you know.
i'm in bangkok (thailand) watching the "international" broadcast via a live feed from a sports network that starts with an "E" and ends with an "N" of superbowl XXXVIII (38), which this year is sort of unusual (as far as i can recall). the two american announcers are explaining everything, and i mean everything. what a punt is, how to get a first down, what zone defense is, what "play action" is, why the players wear helmets and pads (yes, really), etc. i suppose if this were the very first football or superbowl game being broadcast internationally that might be appropriate but since my neighbors & i got up at 4:00am to watch this game, maybe we know a thing or two already? i'll guess this is the situation in many places around the world.
they are also converting measurements into the SI (metric) system, one of my Thai neighbor's laughingly asked me "when was the last time you heard an NFL linebacker referred to in kilograms and meters?" these guys are also peppering their announcing with references to that other football (soccer to us Americans) and even referring to this as "American" football. the local (Thai language) announcers are ignoring all that goop and announcing the game knowing their audience. there's a lesson here i guess.
one of the interesting things about watching sports "overseas" is that many of the NFL games we get here are raw live feeds. these are really raw, stripped down broadcasts without the special features (sideline interviews, half-time reports, etc.) you'd get from normal network broadcasts. the plus side to this is that we get to see the producer/director shots & hear live mics when they break for commericials (there are no ads permitted on our local cable TV) and during half-time. we'll see the cameras zooming in on hotties in the stands, preview in-game presentations (the replays, analysis, highlights, etc.) and hear what the announcers really think of the game, officiating, etc. (which can sometimes be exactly opposite of what they say when they're "officially live") and every once in a while hear some announcer going beserk (once heard one former QB announcer doing an expletive laden tirade at somebody over the phone). now that's good TV ;-)