June 15, 2003

determine input language

finding a specific language from input text is rather hard but you can come "close" by determining the UnicodeBlock a particular char falls in. it won't help much with larger clumps of languages such as latin-1 (western european languages) or CJK (chinese, japanese, korean). this java code snippet is a simple wrapper class that returns the UnicodeBlock of a given char.
import java.lang.*;
import java.lang.Character;
import java.lang.Character.UnicodeBlock;

public class determineLanguage {
	public final static String whatLanguage(char aChar){
		java.lang.Character.UnicodeBlock aBlock = java.lang.Character.UnicodeBlock.of(aChar);	
		String thisLanguage = String.valueOf(aBlock);
		return thisLanguage;
	}
	
}
this cfmx code snippet illustrates how to use the wrapper:
<cfsilent>
<cfprocessingdirective pageencoding="utf-8"> <!--- remove for cf5 --->
<cfcontent type="text/html; charset=utf-8">  <!--- remove for cf5 --->
<cfscript>
	if (isDefined("form.testLanguage") and trim(len(form.testLanguage))) {
		determineLanguage = createobject("java","determineLanguage");
		test=asc(form.testLanguage);
		thisLang = determineLanguage.whatLanguage(test);
	}
</cfscript>
</cfsilent>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
<head>
	<title>Test Language Determination</title>
	<style type="text/css">
        BODY {
	    font-size : 85%;
       	font-family : "Arial Unicode MS";
        }
        INPUT {
       	font-family : "Arial Unicode MS";
        }
	</style>    
	
</head>

<body>
<form action="testLang.cfm" method="post">
test language: <input type="text" name="testLanguage" size="50"> <input type="submit" value="try">
</form>
<cfif isDefined("variables.thisLang")>
<b>unicode subset</b>: <cfoutput>#thisLang#</cfoutput>
</cfif>
</body>
</html>
as soon as i can get my unicode char db sorted out i suppose we can dispense with the java class altogether and just use cf (mx) code.