February 09, 2004

locale collation

hiroshi okugawa (mm) and i were working on an issue last week in the forums where one user was having trouble sorting a list to german phonebook, sometimes called DIN-2, collation (string sort order). the problem was that listSort and arraySort cf functions sort based on straight up unicode codepoint values. while this will work for most folks, after all 'a' < 'b' is true for both lexigraphical (dictionary) and unicode orders, it won't work for folks with characters like german umlauts ÄËÜ which have higher unicode values than the unadorned chars AEU, ie. ÄËÜ will always sort as a group after AEU rather than the AÄEËUÜ order which folks in that locale would expect. since i mainly use sql server as my db backend, which has a very nifty COLLATE clause that allows you to cast your resultset to a specific collation, this came as a bit of a surprise to me. the solution to this, as usual for i18n issues in cf, is to dip down into the underlying java functionality, specifically the java.text.Collator class which allows you to "perform locale-sensitive String comparison". we developed a CFC, i18nSort to wrap up this functionality. we also added a sort method based on IBM's ICU4J com.ibm.icu.text.Collator class. why? because ICU4J provides a much beefier set of collation locales (246 vs the java class's 134) including afrikaans, german phonebook, various european locales pre-euro (which would be useful for historical data), persian (both iran and afghanistan), traditional thai, etc. collation is a strange beast. it's pretty much a universal user requirement but is not consistent for the same chars (germans, french and swedes sort the same chars differently) nor within the same language (so-called phonebook collation vs dictionaries or book indices). and that's just the alphabet-based scripts--asian ideograph collation can be either phonetic or based on the appearance (strokes) of the character. then there's the special cases based on user preferences: ignore/consider punctuation, case ('A' before/after 'a'), etc. you're looking at thousands of years of people's collation baggage, so yes it's going to be complex. you can read more about unicode's take on collation here. in java (both "plain" java and ICU4J) collation complexity is handled using three parameters: locale, strength, and decomposition. locale is obvious, specific locales' collation data is used to order sorts (and searches). strength is used across locales (though exact strength assignments vary from locale to locale) and determines the level of difference considered significant in comparisons. there are four basic strengths (ICU4J adds a fifth, QUATERNARY which distinguishes words with/without punctuation): - PRIMARY: significant for base letter differences 'a' vs 'b'. - SECONDARY: significant for different accented forms of the same base letter ('o' vs 'ô'). - TERTIARY: signficant for case differences such as 'a' vs 'A' (but again differs locale to locale). - IDENTICAL: all differences are considered significant during comparison (control chars, pre-composed and combining accents, etc.). taking an example from the java docs in czech, "e" and "f" are considered primary differences, while "e" and "?" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical. decomposition is just that, chars are decomposed for comparison. there are three basic decompositions (only two for ICU4J): - NO_DECOMPOSITION: chars are not decomposed, accented and plain chars are the same, this is the fastest collation but will only work for languages without accented, etc. chars. - CANONICAL_DECOMPOSITION: chars that are canonical variants are decomposed for collation, ie. accents are handled. - FULL_DECOMPOSITION: not only accented chars, but also chars that have special formats are decomposed (this decomposition doesn't exist in ICU4J, CANONICAL_DECOMPOSITION is used instead). basically un-normalized text is properly handled. so now you know.

0 Comments:

Post a Comment

<< Home