Thaana conversions class for PHP 5 - v0.2

Here is an update to the Thaana conversions class I released in Nov 2007. This new version 0.2 release expands the varieties of conversions available and should be more than adequate for almost all uses. This version, most importantly, adds solid UTF-8 conversion functions allowing for more flexibility in PHP-based Unicode/UTF-8 Thaana handling. Further, the class is now licensed under the pretty liberal Open Source MIT License. The code still relies solely on core PHP 5 functions and does not demand any extra PHP extensions to be installed.

Functions exposed by the class

- convertUtf8ToUnicodeIntegers()
Convert UTF-8 data to Unicode character integer representations

- convertUtf8ToAscii()
Convert UTF-8 data to Ascii

- convertEntitiesToUnicodeIntegers()
Convert HTML Unicode entitied string to Unicode Integer characters array

- convertEntitiesToUtf8
Convert HTML Unicode entities to UTF-8

- convertEntitiesToAscii()
Convert HTML Unicode entities to Dhivehi Ascii equivalents

- convertUnicodeIntegersToUtf8()
Convert Unicode Integer array to UTF

- convertUnicodeIntegersToEntities()
Convert Unicode char integers to HTML entities

- convertUnicodeIntegersToAscii()
Convert Unicode char integers to Ascii

- convertAsciiToUtf8()
Convert Ascii Thaana to UTf-8

- convertAsciiToUnicodeEntities()
Convert Ascii Thaana to Unicode HTML entities

- convertAsciiToUnicodeIntegers()
Convert Ascii Thaana to an array of Unicode integers

Usage

<?php
// Load class and initialize object
require 'thaana_conversions.obj.php';
$thaana = new Thaana_Conversions();

echo $thaana->convertEntitiesToAscii('ދިވެހި');
echo $thaana->convertAsciiToUtf8('rWacje');
?>

Download:

- Thaana_Conversions.zip (v0.2, 3KB)

Drop me a line if you run into trouble with any of the functionality or have comments/queries. Enjoy :-)

Update (11-Sep-2008): This version is now superseded by the v0.3 release.

Javascript Unicode Keyboard Handler for Thaana

Here's something that is probably going to be very useful to the Maldivian web developers working on Unicode-based Thaana web pages. It is a Javascript utility function that translates keystrokes into the appropriate Unicode Thaana characters. Hence, it makes it possible for HTML text input and textarea fields (and similar) to accept Thaana without having to require the user to switch the keyboard language on their computer. Such a feature contributes for a better user experience as the user can simply enter Dhivehi without the extra hassle. The code has been tested with no problems found on Firefox 1/2/3 and Internet Explorer 5/6/7.

If you would like a demo, I recommend you check out the text entry box at Radheef.com and see the HTML behind it. A few developers seem to have already adopted my code as at Radheef.com and utilized it in their work - haamadaily.com, sangudaily.com and jazeera.com.mv and haveeru.com.mv is using the code far as I know.

I originally wrote this around 2002 while experimenting with different methods of Thaana entry for the web. The version I'm releasing here, marked as version 2.0, is a modified version from 2006. It is being released under the MIT License.

- Download unicodehandler-2.0.js

Usage

1. Link the file in the HEAD section of the page:
<script type="text/javascript" src="/unicodehandler.js"></script>

2. Attach the handler to any text INPUT, TEXTAREA or editable DIV tag:
<textarea rows="1" onkeypress="return juk_HandleKeyPress(event);"></textarea>

3. Set any Unicode-compatible Dhivehi font to be used for the field using CSS.

4. That's it!

Drop a line here if you use it and/or have problems. Enjoy.

Update (16-Aug-2008): This version is now superseded by the new and improved v3.0.

Dhivehi Radheef application for Facebook

I am launching a new Maldivian Facebook application which I had planned out as part of a scheduled series of feature updates to Radheef.com. This new Facebook application displays a random word, and its associated meaning, from the Dhivehi Radheef on your Facebook profile. Words are automatically updated everyday so as to keep your profile fresh and, perhaps even, educational.

Give it a go if it catches your fancy.

- View the application's About page on Facebook


Towards a (true) Dhivehi search engine

As much as I would like the Dhivehi language to die and rot away, it seems it won't happen, atleast for a while. The (relatively) newly minted freedom to publish newspapers and the growth of web-based news sites may have poised Dhivehi for a serious revival of the language. The revival probably isn't so much in terms of improvements in the vocabulary or other more linguistics related changes but rather a revival in terms of the amount of information now being pumped out in Dhivehi - and in my opinion, that's a great start.

A (if not THE) point worth noting here is that much of this new information is being produced - and published - by digital means. Most government authorities now have web portals and an increasing number of them maintain them diligently. Most, if not all, newspapers and magazines also seem to maintain web portals with their content being made available online on the web. This modern revival thus presents a very interesting and a very much modern set of problems (to geeks like me atleast :-P) :- accessing it. It is probably the first time in Maldivian history that a "dhivehi search engine" makes practical sense.

Now, I am aware that Google and other search engines can be used to search for Dhivehi and I'm also aware that there are a few local operations that purport/aspire to be Maldivian search engines but they all share important shortcomings. These shortcomings are mostly inherent to the various methods of writing Thaana as used on the World Wide Web.

Say you want to search for the word "rayyithunge". Typing that into a search engine would bring an entirely different set of results from typing in "rwacyituncge" or "ރައްޔިތުންގެ" - both of which are alternative forms of representing the same thing in Dhivehi. The different set of results arise because of the differences in the representation schemes used on the different sites. A search with the phrase "rayyithunge" would bring in results with pages that seem to mostly contain English and that's because "rayyithunge" is Dhivehi "Latin"ised into English so that we could use standard English characters to write Dhivehi words. People commonly use such Latinised Dhivehi when writing emails or chatting - say "haalu kihineththa" etc. Meanwhile, a search with the phrase "rwacyituncge" results in a listing of content from sites like Haveeru and Miadhu who use standard ASCII coupled with custom Dhivehi fonts with the characters mapped. If you try copy-pasting something written on the Haveeru page you'd see that it comes out as a seemingly meaningless jumble of letters. Lastly, a search with the phrase "ރައްޔިތުންގެ would bring in results from sites like Minivan Daily and Sangu Daily who use Unicode to display Dhivehi. Anyway, the technical explanations aside, the point is that Dhivehi search is (currently) a messy enterprise.

The solution to this problem can (seem to) be pretty simple. A custom search interface could be made to simply take the search query from a user and convert it into the three different representation schemes and then spawn search a search for each representation phrase on any of the existing search engines. This would work just fine... until you run into peculiar problems related to Latinised and ASCII Dhivehi schemes. Take for example the word "ފަލަ" Latinised into "fala" - a search on the word would result in almost entirely non-Dhivehi results totally unrelated to what we really want. Similarly, a search on the ASCII'ed phrase "Oled" (which is the word "ދެލޯ") would result in a large number of non-Dhivehi results with no bearing on what we wanted. These problems occur because Latinised and ASCII Dhivehi representations can result in text that have meaning in English as well - such as the case of "Oled" as above which happens to be a popular technical term in English.

A more sophisticated approach to the search problem probably could successfully iron out (most of) these quirks. An ideal solution would be to do away with the existing search engines such as Google, despite their awesomeness, and develop a custom search engine. A custom engine would allow for the recognition of the various representation schemes used and the subtle differences between them. A search phrase entered on such an engine would perhaps standardize the phrase and search through a standardized index to return results that are a better mirror of the Dhivehi content that is out there. Such a custom search engine could bundle in extra Dhivehi-related facilities such as conversions to allow for lack of (particular) fonts as used on sites and spelling correction among others.

So, perhaps the question now is, is there a real need for a Dhivehi search engine yet? When should a Maldivian "Google" be born?

Thaana Unicode<->Ascii conversions PHP class

Here is something that would probably be very handy to Maldivian web developers dabbling with Dhivehi sites. This PHP class addresses the need for converting text to and from Thaana in Ascii and Thaana in Unicode.

The class makes it easy to standardize text into one format irrespective of how it was/is written. This means that you can take text written in Accent, MS Word 97 (and prior) or written using Unicode as featured on recent MS Word editions and use the class to present output in the format of your choice without the need for imposing restrictions on the people who write the text. The class comes in even more handy when you have a form submission that takes input in Unicode but needs to be stored in the database or presented later as Ascii, or vice versa.

The class was something I originally wrote around 2001 and was used in the free Online Document Converter that featured on maldivianunderground.net. I rewrote it for PHP 5 recently for use in a project I am working on. The original class had support for Letin dhivehi -> Unicode/Ascii conversions as well which I haven't included in this release but will add it a future update.

Usage should be pretty straightforward but here is an example just to illustrate:

Example:
<?php
require 'thaana_conversions.obj.php';
$thaana = new Thaana_Conversions();

echo $thaana->convertUnicodeToAscii('&#1931;&#1960;&#1928;&#1964;&#1920;&#1960;');
echo $thaana->convertAsciiToUnicode('rWacje');
?>

Download:
- Thaana_Conversions.zip (v0.1, 2KB)

Enjoy :-)

Update (7-May-2008): This version is now superseded by v0.2.

Guide to using Thaana on the WWW

Developing Dhivehi web pages is pretty easy and there are quite a few methods to do it. However, information on how to go about it seems to be lacking, leaving newbies stumped. Here is a general overview on the various methods for displaying Thaana on the WWW and should contain enough information to help anyone, designer or programmer, get started.

1. CSS: rtl + bidi-override

This method is applicable only to non-Unicode text. It works on all modern browsers but requires for the user to have atleast one of the fonts specified in the page - otherwise the text would be displayed as a mostly meaningless jumble of English letters.

This is the least-effort route to getting any non-Unicode Thaana text (such as those written using MS Word 97/2000, Accent Express, MLS or Faseyha Thaana) on to the web. The websites of Haveeru and Miadhu currently take this approach.

Usage:
To use this method, apply the following CSS to any HTML elements that contain Thaana text. You may use inline style attributes or CSS class/ids to achieve this. You may change the font names to suit your needs but make sure you list several popular fonts and that the fonts specified are all non-Unicode fonts. You could, of course, also add further CSS styling (font size, font color, line height etc) but the following are the required minimum.
font-family: A_Ilham, A_Randhoo, A_Faruma, A_Waheed;
direction: rtl;
unicode-bidi: bidi-override;

Demo:
View example



2. Unicode Dhivehi

This method is applicable to text in Unicode. It works well on all modern browsers but requires for the user to have atleast one Unicode Thaana font - and unlike method (1) the system defaults to a Thaana font it does have if it cannot find any of the fonts named in the page.

This is the best method for any new and modern Thaana-based website. It is used in the online Radheef, Jazeera Daily and Haama Daily.

Usage:
To use this method, first add the following to the page's HTML HEAD section.


Next, apply the following CSS to any HTML elements that contain Thaana text. You may use inline style attributes or CSS class/ids to achieve this. You may change the font names to suit your needs but make sure that the fonts specified are all Unicode fonts. You could, of course, also add further CSS styling (font size, font color, line height etc) but the following are the required minimum.
font-family: Faruma, "MV Elaaf Normal";
direction: rtl;
text-align: right;

Demo:
View example



3. Image

This approach basically renders the Dhivehi text as an image. This is perhaps the most obvious and was the only method available early on. However, this method is still a pretty lucrative solution especially given that many computers just don't have the required fonts available. Using an image for the text rids the requirement on the client browser/computer to have the proper fonts available.

The basic approach of rendering the text into an image using Photoshop, MS Word etc is pretty tedious as the process is entirely manual. However, there is a more sophisticated approach that renders the text into Dhivehi on-the-fly on the web server side (perhaps coupled with caching to reduce load). A server-side scripting language such as PHP can be used to render text into an image using any font of choice by the designer/programmer. The rendered images (typically PNGs) are of very small size and hence have a negligible effect on the page load time in most cases.

Refer to the imagettftext function for details on how to do it in PHP.



4. Flash

This method uses text loaded in Macromedia Flash with the required font(s) being embedded in the Flash clip. ActionScript and/or Flash variables are used to load the text into text areas in the Flash file. This method has the advantage that it works whether the client computer/browser has Dhivehi font available or not but then again it does require the client to have Flash installed and enabled. If you are only seeking to have nice one-line headline sort of text in Dhivehi then you might consider using sIFR.

Refer to Font Embedding help page at Adobe LiveDocs for details on font embedding in Flash.



5. WEFT

Web Embedding Fonts Tools is a Internet Explorer only solution offered by Microsoft. It involves using the Windows-only WEFT utility to create font "objects" that can then be placed on web pages. This method is not recommended unless the target only involves use of Internet Explorer.

Refer to Microsoft WEFT page for more information.



6. TrueDoc

TrueDoc is a solution offered by Bitstream Inc. It is a solution similar to Microsoft's WEFT in that TrueDoc solutions create a embeddable font resource called a Portable Font Resource. Any font (ie. Dhivehi font) can be loaded once users install a custom font "viewer" (called the Character Shape Player by the company). This solution is NOT free and requires the purchase of special software from BitStream to produce the custom embeddable font packages.

Refer to the TrueDoc site for more information.



Good luck ;-)

Update (24-Nov-2008): Method 1 and 2 rewritten for clarity and demos added.

Dawiyani bas

I was quite perplexed when one of my kid cousins asked me recently if I knew "Dawiyani bas" (literal translation: dawiyani language). Dawiyani, being one of the letters of the Maldivian alphabet and never ever having heard of anything remotely close to such a dialect of Maldivian language, I answered with a puzzled "no!". The cousin then tickled my curiosity of this "bas" by telling me a little about how it is spoken.

Apparently, "dawiyani bas" is a method of speaking that inserts the letter "dawiyani" in between every letter in the normal Dhivehi language speech/text. Though this sounds utterly simple, I remained puzzled till she amazed me by successfully engaging me in meaningful replies to what I say with continuous speech in the said "dawiyani bas". So, to answer me for a question like "gadin kihaa ireh?", she would reply me immediately with almost a machine-gun fire reply of "midahaadaruda gadadinda edagaadarada jedahyda". The resulting speech comes out as meaningless gibberish and I was amused by the speed and ease with she was talking!

A little digging up about the origins of this neat speech trick turned up that this isn't something new. My cousin had learnt it from her mother, who in turn had learnt from her mother, who in turn still narrate vivid memories of how the children of her time engaged in this past time. It was a cute way of talking that amused children and children often found in it a practical method to speak to each other without being understood by the adults and/or other children. In fact, "dawiyani" was just one of the letters used. The letters "gaafu" and "tawiyani" is also said to have been a popular choice of letter to fill up the alternate character spaces - under the same principle with which the "dawiyani bas" operates.

Unless one has practiced and gotten used to this speech trick, decoding the "dawiyani bas" as it is spoken would rattle one's brain cells to and fro fast enough to result in utter confusion. It seriously is quite a tough operation. Anyway, maybe it is time to introduce a bit of ROT13 in the mix as a route towards an even more cryptic speech? :-P