Fingerprinting Thaana

What is the frequency of characters in a typical Dhivehi writing? What is the most commonly used Thaana akuru/fili in Dhivehi? Is there a general pattern of akuru and fili to be expected in any given Dhivehi document?

These questions, and especially the latter, kindled my curiosity yesterday and had me off to explore a little bit. Although seemingly trivial and of no practical use, these are serious questions that probe into the finer details of Dhivehi and help produce computational models of Dhivehi - which have practical applications. Even the generalizations and patterns that result from the simplest statistical analysis transcend the (quirks of) individual writing and give a broader picture of what a language is really like. For example, I'm employing a statistical fingerprint of Dhivehi that was generated during this little exercise as part of an experimental procedure that identifies (the presence of Dhivehi) content in web pages. It takes advantage of the fact that the fingerprint for Dhivehi and that for English are dramatically different thus allowing a computer program to discern the type of content it is dealing with - all without really "understanding" a language.

I conducted the analysis on a dataset consisting of ~5000 Dhivehi articles from Haveeru Daily and ~7000 Dhivehi articles from Jazeera Daily. They may not represent the whole varieties of Dhivehi literature available but I think they are a very good approximation - especially of Dhivehi web content which is what I was mostly interested in. My focus was on the individual character level and ran basic mean, mode, variance, standard deviation and frequency calculations with a further character correlation analysis. Despite these being quite simple analyses, I don't think anyone's ever explored as much before and hence the following should make for (exciting!) new information.

Enjoy :-)


Mean fili usage in Dhivehi writing


Mean akuru usage in Dhivehi writing


Thaana character frequencies

Trackbacks

    No Trackbacks

Comments

Display comments as (Linear | Threaded)

  1. Maldiveshealth says:

    hmm. never thought of that. Interesting.

  2. Frozen Solid says:

    Interesting.

  3. Shahdy says:

    very interesting. great work jaa!

  4. jaa says:

    thanks Shahdy :-)

  5. ANON says:

    dsnt make any sense...output wud b different if the original input data differs

  6. jaa says:

    The statistical information that is derived provides generalizations and patterns that transcend the individual writing. It doesn't matter whether the individual writings differ in content and/or length because statistical fingerprints, like the character frequency map above, still hold. That is the very beauty of these things.

    The character frequency map for English is very different and each language has more or less a unique such fingerprint. Hope that helps make sense :-)

  7. bulhaa says:

    u really have too much free time O.o

  8. nass says:

    Cool!
    We really don't use the letters with the dots, do we..

  9. Simon says:

    Very interesting work man!

  10. Jangi prophet says:

    The high incidence of meemu and vaavu is not the most encouraging news for an aspiring ventriloquist. Oh well *SIGH*

  11. s. says:

    Similar analysis was made and documented twice before.

    Once was by the crew who developed the Thaana Typewriter in the 1980s. They used the data to arrange the keys on the keyboard.

    Second one was by Dr. Hassan Hameed using a more comprehensive set of data (using a software) in the 90s. His paper was published as well. He used it to fine tune the keyboard layout, to arrange the keys in a way that Thaana can be typed faster.

    I think you could publish your raw data online, it might become a good new contribution.

  12. jaa says:

    That's good to know. Thanks..

  13. Anonymous says:

    nice .. never seen anybody did such a research ... especially on Dhivehi language ..


Add Comment


Standard emoticons like :-) and ;-) are converted to images.