Fingerprinting Thaana

What is the frequency of characters in a typical Dhivehi writing? What is the most commonly used Thaana akuru/fili in Dhivehi? Is there a general pattern of akuru and fili to be expected in any given Dhivehi document?

These questions, and especially the latter, kindled my curiosity yesterday and had me off to explore a little bit. Although seemingly trivial and of no practical use, these are serious questions that probe into the finer details of Dhivehi and help produce computational models of Dhivehi - which have practical applications. Even the generalizations and patterns that result from the simplest statistical analysis transcend the (quirks of) individual writing and give a broader picture of what a language is really like. For example, I'm employing a statistical fingerprint of Dhivehi that was generated during this little exercise as part of an experimental procedure that identifies (the presence of Dhivehi) content in web pages. It takes advantage of the fact that the fingerprint for Dhivehi and that for English are dramatically different thus allowing a computer program to discern the type of content it is dealing with - all without really "understanding" a language.

I conducted the analysis on a dataset consisting of ~5000 Dhivehi articles from Haveeru Daily and ~7000 Dhivehi articles from Jazeera Daily. They may not represent the whole varieties of Dhivehi literature available but I think they are a very good approximation - especially of Dhivehi web content which is what I was mostly interested in. My focus was on the individual character level and ran basic mean, mode, variance, standard deviation and frequency calculations with a further character correlation analysis. Despite these being quite simple analyses, I don't think anyone's ever explored as much before and hence the following should make for (exciting!) new information.

Enjoy :-)

Mean fili usage in Dhivehi writing

Mean akuru usage in Dhivehi writing

Thaana character frequencies

Ovvalhu(gondi): an African game?!

I was watching a presentation tonight titled "African fractals, in buildings and braids" (on TEDTalks) and was totally absorbed in it when an African game board shown in it caught my eye - the board looked eerily similar to something I knew: an Ovvalhugondi. I had always been under the impression that "Ovvalhu" was a distinctly Maldivian game but just like many other supposed Maldivian games of the likes of "Koraa" and "Baibalaa", I wondered if Ovvalhu too was just another foreign game that had been absorbed into our culture. Anyway, I was compelled to look up more about the mentioned African origin game called "Mancala".

Finding out more about Mancala was a much easier task than I thought. Rather than it being an obscure game played in a lone part of Africa that had little mention in any literature, Mancala was literally something of a global phenomenon that had many a mention of it, played all over the world and had dozens of online shops selling the boards. There even were online versions of the game! Mancala or Manqala in Arabic, basically refers to a class of games that all have similar game play - the objective always being to capture more "stones" than the opponent. There are a number of variants (see Wikipedia's list), as adopted by different countries or areas, that differ in the finer details of how its played. Apparent differences obviously include the number of pits in the board and the number of "stones" used in play.

I might be wrong (very wrong, infact) but from what I read I suspect that Ovvalhu takes after the version played in South India or possibly the version played in Ghana. Ovvalhu may not be our national game but Ghana's Mancala variant called "Oware" is supposed to be their national game. I found it amusing that the names sounded similar but that might just be mere coincidence(?!). A similarity that certainly is not a coincidence is that Maldivians also used to play Ovvalhugondi with "Laagulha" (picture), ie. the seed of "Kashikumburu", which is what Oware is supposedly played with (in the Caribbean atleast).

What was even more interesting, to me, was to learn that Mancala (or atleast some variants of it) had been analysed using combinatorial game theory. The game of Awari was tackled by two Dutch scientists who generated the entire state-space for the game - mounting upto almost 9 billion positions - and cracked the perfect play for the game (here's their paper). Perfect play is a game theory term for a strategy or set of moves that guarantees a certain outcome in a game - a win or a draw at the least - if the game allows so, mathematically. I have no interest in Ovvalhu but I find such computational challenges almost erotic. I'm very much tempted to attempt analyzing Ovvalhu for a perfect play as well, so I've added it to my list of future boredom-killer projects.

Anyway, though it is pretty conclusive that Ovvalhu is not a Dhivehi game, I think it is interesting to learn that it certainly is one with an exciting history and background!

Towards a (true) Dhivehi search engine

As much as I would like the Dhivehi language to die and rot away, it seems it won't happen, atleast for a while. The (relatively) newly minted freedom to publish newspapers and the growth of web-based news sites may have poised Dhivehi for a serious revival of the language. The revival probably isn't so much in terms of improvements in the vocabulary or other more linguistics related changes but rather a revival in terms of the amount of information now being pumped out in Dhivehi - and in my opinion, that's a great start.

A (if not THE) point worth noting here is that much of this new information is being produced - and published - by digital means. Most government authorities now have web portals and an increasing number of them maintain them diligently. Most, if not all, newspapers and magazines also seem to maintain web portals with their content being made available online on the web. This modern revival thus presents a very interesting and a very much modern set of problems (to geeks like me atleast :-P) :- accessing it. It is probably the first time in Maldivian history that a "dhivehi search engine" makes practical sense.

Now, I am aware that Google and other search engines can be used to search for Dhivehi and I'm also aware that there are a few local operations that purport/aspire to be Maldivian search engines but they all share important shortcomings. These shortcomings are mostly inherent to the various methods of writing Thaana as used on the World Wide Web.

Say you want to search for the word "rayyithunge". Typing that into a search engine would bring an entirely different set of results from typing in "rwacyituncge" or "ރައްޔިތުންގެ" - both of which are alternative forms of representing the same thing in Dhivehi. The different set of results arise because of the differences in the representation schemes used on the different sites. A search with the phrase "rayyithunge" would bring in results with pages that seem to mostly contain English and that's because "rayyithunge" is Dhivehi "Latin"ised into English so that we could use standard English characters to write Dhivehi words. People commonly use such Latinised Dhivehi when writing emails or chatting - say "haalu kihineththa" etc. Meanwhile, a search with the phrase "rwacyituncge" results in a listing of content from sites like Haveeru and Miadhu who use standard ASCII coupled with custom Dhivehi fonts with the characters mapped. If you try copy-pasting something written on the Haveeru page you'd see that it comes out as a seemingly meaningless jumble of letters. Lastly, a search with the phrase "ރައްޔިތުންގެ would bring in results from sites like Minivan Daily and Sangu Daily who use Unicode to display Dhivehi. Anyway, the technical explanations aside, the point is that Dhivehi search is (currently) a messy enterprise.

The solution to this problem can (seem to) be pretty simple. A custom search interface could be made to simply take the search query from a user and convert it into the three different representation schemes and then spawn search a search for each representation phrase on any of the existing search engines. This would work just fine... until you run into peculiar problems related to Latinised and ASCII Dhivehi schemes. Take for example the word "ފަލަ" Latinised into "fala" - a search on the word would result in almost entirely non-Dhivehi results totally unrelated to what we really want. Similarly, a search on the ASCII'ed phrase "Oled" (which is the word "ދެލޯ") would result in a large number of non-Dhivehi results with no bearing on what we wanted. These problems occur because Latinised and ASCII Dhivehi representations can result in text that have meaning in English as well - such as the case of "Oled" as above which happens to be a popular technical term in English.

A more sophisticated approach to the search problem probably could successfully iron out (most of) these quirks. An ideal solution would be to do away with the existing search engines such as Google, despite their awesomeness, and develop a custom search engine. A custom engine would allow for the recognition of the various representation schemes used and the subtle differences between them. A search phrase entered on such an engine would perhaps standardize the phrase and search through a standardized index to return results that are a better mirror of the Dhivehi content that is out there. Such a custom search engine could bundle in extra Dhivehi-related facilities such as conversions to allow for lack of (particular) fonts as used on sites and spelling correction among others.

So, perhaps the question now is, is there a real need for a Dhivehi search engine yet? When should a Maldivian "Google" be born?

Back from pause

It's been quite a while since I last made a post here on my blog - more than a month to be more precise - but the lack of posts wasn't all due to forgetful negligence or busy schedules. Rather, it was mostly due to deliberate inaction while I contemplated some things regarding blogging.

Simply said, I came to the decision to halt blogging because I was quite intimidated by the effects that my blog was having on my personal life lately. Never had I thought, when I started blogging over two years ago, that the things I publish in the virtual world would lead to alarming consequences for myself in the real world.

One important such consequence is one that is (and should be) reasonably expected: exposure. This is especially relevant for bloggers like me who are upfront about who they are and choose not to hide behind pseudonyms and veils of secrecy. What I say on the blog then becomes directly attributable to me as a person and I am held accountable for what I say rather than all of it being chalked up to some anonymous pseudonym. Further, blog posts - explicitly or implicitly - reveal a lot more about the intellectual dispositions of the author including ideological stances and beliefs. It is important to note that these tend have direct detrimental effects on the blogger's individual privacy and anonymity. I find it startling when random people whom I've never met or heard of strike conversation with a seeming air of supposed familiarity with myself or when people talk of me with a sort of conviction that I'm this or that - all based on my blog posts. Exposure is good for aspiring politicians and performance artists neither of which I have the slightest inclination towards becoming. The balance between anonymity and exposure is a tricky one, especially for someone who wants to stay pretty much under the radar.

Another important consequence of blogging relates to the response it evokes in readers. Personal blogs like mine often contain the ideas, thoughts and ramblings of the author - all of which have the potential of being controversial and disagreeable to some. Sadly, this sometimes goes to the extent that some choose to take extreme offence and retaliate with the vilest of language and threats. The comment system on my blog has been very open and mostly goes unmoderated, yet I had to switch to active comment moderation in the latter part of last year due to growing use of vile language, mindless insults and threats to myself and family. While I appreciate all types of comments and welcome disagreement and discussion, I don't think death threats and chants wishing ill things to me are really warranted. Freedom of expression is a great thing but not all people seem to appreciate it with civility and restraint...

These aren't things that would normally bother me and it hadn't until I met a couple of eager fanatics whose unreserved and unashamed drive to show their disagreement through violence - something that I had the (dis)pleasure of experiencing in the months I was in Male' last year - gave me cause for serious concern. Anyway, I hope to resume blogging regularly again - in spite of the above mentioned consequences...