June 29, 2003

msn messenger transcipts

not really G11N stuff and maybe mildly interesting. i recently downloaded and installed the preview for msn messenger 6.0. some of the new stuff is sort of interesting (webcam, though why on earth would you want to play charades with one is beyond my brain's simple imagination), some of it a little too cute for one of my advanced years (photo display, emoticons) but ms seems to have slipped in an interesting upgrade in the way chat transcripts are stored. chat transcripts (logs) are now XML (i'm really not sure if this was always the case with previous versions but i just in fact noticed). which i think makes them a whole boatload more useful.

June 28, 2003

shoe sizes

shoe sizes? yes this has something to do with globalizing cf. so your cf application sells shoes on the internet--and from a quickie google search for ".cfm and shoe" which turned up 136,000++ pages, i'd guess a whole of folks do (and when customers are paying 500 bucks for a pair of "mountain" shoes, it seems that its fairly profitable). did you ever stop to think about who's buying these products? from what i've seen in a rather feeble sampling i did, i'd say "no, not much". well you see shoe sizes are not the same everywhere in the world and are in fact wildly different: just for instance, a size 10.5 (men's) in the US is a size 10 in the UK and Australia, a size 9 in Mexico, a size 28.5 in Japan and a size 44 in Thailand. granted, walking in off the street this wouldn't be such a big deal but hey, these sites were doing business on the internet. none of the 2 dozen or so sites i sampled offered anyway to convert their products' sizes to other systems, in fact none of these actually even mentioned what shoe size system they were using, though i guess you could infer that from where the shops were physically located. i get giddy thinking about all the international product returns this probably generates or the look on peoples' faces when they get microscopic shoes delivered to them ;-) as tex texin says, companies are losing money on the internet over things like this, "this" being not properly globalizing your application. anyway something to think about the next time you go shoe shopping.

June 26, 2003

boy howdy! locale data markup language specification released

hot off the locales mailing list (courtesy of tex texin), the Free Standards Group Open Internationalization Initiative (OpenI18N) announced the release of the locale data markup language specification LDML, Version 1.0. the full announcement can be seen here. this is sort of a big deal, as there is no real standard for locale data repository.

blog language stats: verified

i had a nice discussion with Maciej Ceglowski concerning the blog language stats i was on about (22-jun post), turns out they have had students manually verifying the language guesser's stats and came back with a 95% correct score. so the blogging world does indeed appear to be dominated by english, portuguese, polish, and farsi. wow. since geography is something i eat, sleep and drink, i downloaded and somewhat cleaned up the blog geography data from NITLE Blog Census and put together a static map of blog geography. there's also instructions on that page on how to find your blog's location and how to add an ICBM meta tag (yes thats right ICBM as in missle tag, geography is so cool ;-) so you too can be "found".

June 25, 2003

jre version craziness

i've just now officially lost track of the number of G11N "quirks" related to jre versions and java implementations (ie OS). it seems many encoding, timezones, etc. "quirks" are related to jre version and/or OS. being a novice with java, its one of the last things i look at, hey java's monolithic and portable right? for instance (just because it fresh) one quick way to enable a particular encoding server-wide is supposed to be via jvm arguments "-Dfile.encoding=xxxx" where xxxx is the encoding you want to use. officially (ie via sun bug parade) this isn't supposed to work, as file.encoding is supposed to be read/only. that said, its an often suggested workaround on the java i18n forums for this sort of stuff (or where a library doesn't support encoding switches like some JDBC drivers). in practice though, its a bit hit and miss (or trial by error), depending on jre and os. for example, for jdk1.3.1_07 file.encoding is read/only; j2sdk1.4.1_02 is read/write; jre1.3 on linux doesn't work; jre1.3 works fine. so if you run into any sort of voodoo like G11N issue, perhaps the jre version for your server's OS is the real cause. so how did i figure this all out? i didn't. it was those bug hounds from hell at mm who did (damon cooper, hiroshi okugawa, jim schley et al). my g11n hat's off to them.

June 24, 2003

language negotiation

one important CGI variable from the G11N perspective is HTTP_ACCEPT_LANGUAGE. why? because it represents what language/locale the user wants as opposed to what cf might be able to deliver (via setEncoding(), cfProcessingDirective, cfcontent and the actual dynamic content). matching what the user wants and what your app can deliver is often called "language negotiation". while HTTP_ACCEPT_LANGUAGE is usually a single locale or language (th-th or th for example) it can often be a list of languages/locales (especially w/MACs, some of the longest HTTP_ACCEPT_LANGUAGE lists i've ever seen came from MAC browsers though browsers in internet cafe's at major tourist desitinations can get pretty long as well). language preferences are usually listed (comma delimited) in order, with most preferred first and may contain a quality (q) value that represents an estimate of the user's preference for that language range. for instance, "en-us,ko;q=0.5" means i prefer US english but will also accept Korean. whether a value for HTTP_ACCEPT_LANGUAGE exists depends on the browser age and whether a user has set it (for IE that would be via tools, internet options, languages), it also may only contain a language (en) rather than a full locale (en-ca) and we all know how important locale is ;-) because of this i use geoLocator (which determines locale from a users IP) along with HTTP_ACCEPT_LANGUAGE to find and fix a users locale. more info on HTTP_ACCEPT_LANGUAGE can be found here.

June 22, 2003

remarkable blog language stats

remarkable, maybe even unbelievable, blog language stats published by the NITLE Blog Census. english first, yeah ok but portuguese, polish, AND farsi (persian) in the top four! the language classification are based on the textcat language guesser. if the stats actually pan out, i'll look into adding its algorithm into making uBlocks.cfc (18-jun post) better at language guessing too ;-) though my preliminary testing shows it pukes on mixed languages (mixed thai and english are guessed as "estonian").

ancient history

wow, i always knew BYTE magazine was ahead of its time but here's an article that deals with G11N issues, note that date on this thing. much of it is still relevant, even though it was written a dog's age ago.

unicode/xml bar brawl: no injuries/no arrests

john dowdell's blog entry Unicode/XML warning was actually resolved amicably, mainly in favor of XML (markup/tags win out over by-the-book char encoding):
  • unicode's line and paragraph separator are now discouraged, use markup tags instead.
  • language tags should replace unicode language tag codepoints
  • not that i care a whole lot but musical notation will be replaced by a yet another customized XML language
  • the fraction slash is maintained, but may be handled better by the MathML markup language (yet another customized XML language)
  • superscripts and subscripts are retained, but could be replaced by markup tags (could??, i can see this will be fun)
the unicode side of this can be found in the full technical report. by and large this won't effect most cf folks but as JD points out its something to keep your eye peeled for with legacy content (though my experience is that very little of that is actually in unicode).

June 19, 2003

some metatags from the G11N perspective

here's a couple of maybe useful metatags from the G11N perspective. the CONTENT-TYPE might be getting ignored these days with mx (since mx itself ignores it) but probably shouldn't:
  • CONTENT-LANGUAGE <META HTTP-EQUIV="CONTENT-LANGUAGE" CONTENT="en-US,th-TH,fr"> primary human language (or languages) of this document. search engines may use this tag to categorize pages by language.
  • CONTENT-TYPE <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=UTF-8"> many folks recommend to always use this tag (for content type) and to specify the charset. while mx will ignore whatever values you have for this tag, i still include it out of habit. turns out some search engines may also examine this tag's values.

June 18, 2003

determine input language: pure cf solution

as mentioned earlier (june15th posting), i finally got my act somewhat together and created a pure coldfusion solution uBlocks.cfc to the problem of determining input language (or should i say unicode block). this CFC trys to determine unicode block (or subrange) from a sample of text. it returns a cf query with unicodeblock stats for the test phrase (ie. basic latin x%, CJK y%, etc.). you can try it out here. it will eventually bubble up on the devnet gallery. i'd appreciate any feedback. on the 'to do' list for this thing:
  • combine unicode block information into scripts and then into individual languages (chances of this: slim or next to none) or at least language clusters
  • return array of unicodeblocks corresponding to position within text sample, ie. map unicodeblock *positions* within text sample. probably useful for parsing "tower of babel" text.

some i18n resources

i was wandering around on tex texin's website (yes that is his name), i18nguy looking for telephone number info (i want to test country address vs country phone numbers) and found a good resource at the world telephone numbering guide. the wtng is a very rich site (though in my personal opinion, its seriously ugly). while you're on tex's site don't forget to download his web internationalization tutorial. you can apply quite a bit of this to coldfusion (and i guess flash) app development. during my wandering i also found a bit of country/timezone info from of all places the world wide construction site. i guess construction gangs are i18n now too. speaking of timezones, here's a tip for getting client timezone info. makes use of dan switzer's client/server gateway JSAPI and a few lines of coldfusion. and finally, this is a spatial view of the world's UTM and timezones--its something i did at my old job but never tweaked much, so its kind of slow (especially at startup). if you look closely you can see why gathering timezone info is sort of voodoo-like ;-)

June 17, 2003

more geoLocation: geoClip

well not really geoLocation and perhaps not exactly g11n but Geoclip is probably one of the best flash apps i've ever seen. try the "Travel through France and its 36.557 municipalities" app. the first thing you might notice, is that the flash app isn't loading the whole spatial database in one go (ie data and flash app are seperate). if you drag the green rectangle around the reference map and then click "extraction by municipality" the flash app runs back to the database (mySQL) and pulls the spatial data within that area. this allows for flash to front-end some really monster thumping spatial databases (and these things tend to get very big). eric's even got an SVG version.

June 16, 2003

geoLocation

location matters (being a GIS practioner you can bet i think it matters). i've researched and used several geoLocation services, some of which are listed here on i18ngurus (and some of those have since gone belly up). while most of these services accurately show i'm coming from bangkok (the infosplit service, though, pinpointed bangkok as being in northwestern china near the mongolian border) none can really compare with nigel wetters' inetaddress locator. its free, its fast and works a treat with coldfusion. once you know where somebody's located you can serve them content based on their locale, price products in local currencies, price products based on geographical location (within Thailand, free, everybody else pays double ;-), target ads, analyze web traffic (ala farcry), etc.

machine translations? bah humbug.

i'm not a huge fan of machine translations, especially the free online ones. from my experience they are mostly good for a laugh. they will almost never do as well as a human being (context for instance, "i'm mad at you" gets translated into "i'm crazy" in many of the machine translators listed below, the Thai "parsit" just cracks me up). just to prove a point, here's a list of some that i've stumbled across over time. bah, humbug.

June 15, 2003

determine input language

finding a specific language from input text is rather hard but you can come "close" by determining the UnicodeBlock a particular char falls in. it won't help much with larger clumps of languages such as latin-1 (western european languages) or CJK (chinese, japanese, korean). this java code snippet is a simple wrapper class that returns the UnicodeBlock of a given char.
import java.lang.*;
import java.lang.Character;
import java.lang.Character.UnicodeBlock;

public class determineLanguage {
	public final static String whatLanguage(char aChar){
		java.lang.Character.UnicodeBlock aBlock = java.lang.Character.UnicodeBlock.of(aChar);	
		String thisLanguage = String.valueOf(aBlock);
		return thisLanguage;
	}
	
}
this cfmx code snippet illustrates how to use the wrapper:
<cfsilent>
<cfprocessingdirective pageencoding="utf-8"> <!--- remove for cf5 --->
<cfcontent type="text/html; charset=utf-8">  <!--- remove for cf5 --->
<cfscript>
	if (isDefined("form.testLanguage") and trim(len(form.testLanguage))) {
		determineLanguage = createobject("java","determineLanguage");
		test=asc(form.testLanguage);
		thisLang = determineLanguage.whatLanguage(test);
	}
</cfscript>
</cfsilent>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
<head>
	<title>Test Language Determination</title>
	<style type="text/css">
        BODY {
	    font-size : 85%;
       	font-family : "Arial Unicode MS";
        }
        INPUT {
       	font-family : "Arial Unicode MS";
        }
	</style>    
	
</head>

<body>
<form action="testLang.cfm" method="post">
test language: <input type="text" name="testLanguage" size="50"> <input type="submit" value="try">
</form>
<cfif isDefined("variables.thisLang")>
<b>unicode subset</b>: <cfoutput>#thisLang#</cfoutput>
</cfif>
</body>
</html>
as soon as i can get my unicode char db sorted out i suppose we can dispense with the java class altogether and just use cf (mx) code.

opensource i18n CMS? have you heard of farcry?

if you're in need of a CMS thats i18n, you might give farcry a look. it now fully supports unicode and is well on its way to being 100% i18n functional (in a practical sense it is now). its highly customizable, rather well documented, and a breeze to get setup and running.

get your dose of engrish here

this is just too funny to pass by. perhaps its a bit cruel but most humor is.

June 14, 2003

new icu4j version

ibm's released another version (2.6) of its oh so cool ICU4J library. its now unicode 4.0 compliant, much reduced in size (jar file is now about 2.3mb), modularized (you can mix and match the stuff you need) and you can now "cast" currency formatting from any locale. download here. you can see it in action (with simple how-to example for thai, islamic, chinese, hebrew, japanese & georgian calendars) at icu4jcalendars.