Towards a (true) Dhivehi search engine

As much as I would like the Dhivehi language to die and rot away, it seems it won't happen, atleast for a while. The (relatively) newly minted freedom to publish newspapers and the growth of web-based news sites may have poised Dhivehi for a serious revival of the language. The revival probably isn't so much in terms of improvements in the vocabulary or other more linguistics related changes but rather a revival in terms of the amount of information now being pumped out in Dhivehi - and in my opinion, that's a great start.

A (if not THE) point worth noting here is that much of this new information is being produced - and published - by digital means. Most government authorities now have web portals and an increasing number of them maintain them diligently. Most, if not all, newspapers and magazines also seem to maintain web portals with their content being made available online on the web. This modern revival thus presents a very interesting and a very much modern set of problems (to geeks like me atleast :-P) :- accessing it. It is probably the first time in Maldivian history that a "dhivehi search engine" makes practical sense.

Now, I am aware that Google and other search engines can be used to search for Dhivehi and I'm also aware that there are a few local operations that purport/aspire to be Maldivian search engines but they all share important shortcomings. These shortcomings are mostly inherent to the various methods of writing Thaana as used on the World Wide Web.

Say you want to search for the word "rayyithunge". Typing that into a search engine would bring an entirely different set of results from typing in "rwacyituncge" or "ރައްޔިތުންގެ" - both of which are alternative forms of representing the same thing in Dhivehi. The different set of results arise because of the differences in the representation schemes used on the different sites. A search with the phrase "rayyithunge" would bring in results with pages that seem to mostly contain English and that's because "rayyithunge" is Dhivehi "Latin"ised into English so that we could use standard English characters to write Dhivehi words. People commonly use such Latinised Dhivehi when writing emails or chatting - say "haalu kihineththa" etc. Meanwhile, a search with the phrase "rwacyituncge" results in a listing of content from sites like Haveeru and Miadhu who use standard ASCII coupled with custom Dhivehi fonts with the characters mapped. If you try copy-pasting something written on the Haveeru page you'd see that it comes out as a seemingly meaningless jumble of letters. Lastly, a search with the phrase "ރައްޔިތުންގެ would bring in results from sites like Minivan Daily and Sangu Daily who use Unicode to display Dhivehi. Anyway, the technical explanations aside, the point is that Dhivehi search is (currently) a messy enterprise.

The solution to this problem can (seem to) be pretty simple. A custom search interface could be made to simply take the search query from a user and convert it into the three different representation schemes and then spawn search a search for each representation phrase on any of the existing search engines. This would work just fine... until you run into peculiar problems related to Latinised and ASCII Dhivehi schemes. Take for example the word "ފަލަ" Latinised into "fala" - a search on the word would result in almost entirely non-Dhivehi results totally unrelated to what we really want. Similarly, a search on the ASCII'ed phrase "Oled" (which is the word "ދެލޯ") would result in a large number of non-Dhivehi results with no bearing on what we wanted. These problems occur because Latinised and ASCII Dhivehi representations can result in text that have meaning in English as well - such as the case of "Oled" as above which happens to be a popular technical term in English.

A more sophisticated approach to the search problem probably could successfully iron out (most of) these quirks. An ideal solution would be to do away with the existing search engines such as Google, despite their awesomeness, and develop a custom search engine. A custom engine would allow for the recognition of the various representation schemes used and the subtle differences between them. A search phrase entered on such an engine would perhaps standardize the phrase and search through a standardized index to return results that are a better mirror of the Dhivehi content that is out there. Such a custom search engine could bundle in extra Dhivehi-related facilities such as conversions to allow for lack of (particular) fonts as used on sites and spelling correction among others.

So, perhaps the question now is, is there a real need for a Dhivehi search engine yet? When should a Maldivian "Google" be born?

Trackbacks

  1. No Trackbacks

Comments

Display comments as (Linear | Threaded)

  1. Mr.Blogged says:

    i agree jaa.. a dhivehi Google needs to be born.

  2. Raf says:

    Hi Jaa,

    I have been reading your blogs for a while but this is my first comment to write on your blog as I found that this must not be left out in the history of maldivians.

    i believe the methodology will work and we need support. Support from all of us.

    Therefore I would like to show my interest that I would like to support the project if it happen to be born soon.

    :-)

  3. Jaa says:

    Thanks for the comment. Hopefully the search engine will be born within the next 9 months! :-P

  4. ajaaibu says:

    nice to see that. We really need search engine, dhivehi based websites are increasing day by day.

  5. Raf says:

    great! Let me know where I can help you to support this project. I look forward to communicate with you regarding this project.

  6. subcorpus says:

    i dont think there is pressing need for a dhivehi search engine yet ...
    and i dont think the i'd have to use it that often ...
    dont most website have english equivalents ... ???
    i thought they did ...

  7. cleft says:

    theres no real need for a dhivehi search engine. one of the main reasons being that whatever thats needed that is written in dhivehi can be found in one single book thats only around 500 pages or less.

  8. moyameehaa says:

    i think it is important.9 months? lol.i will visit it at the hospital maybe.
    i use google for searching unicode stuff, but use websites like haveeru and jazeera to search stuff on them. or maybe you can, search like (cnUmuawm site:http://www.haveeru.com.mv) at haveeru.

    but however i believe it is important.subcorpus, i think 'equivalents' for some dhivehi articles dont contain the same information always.and there are some info that does not come in english.

  9. Ahmed says:

    i think the actual problem here is the tendency of most maldivians to always trying to come up with their custom versions of whatever they want to develop....

    and as u said, as more Dhivehi content is made available online, chances are that some clever developer is gonna come up with an even more clever way of representing the electronic dhivehi content...

    i think, in the long run the most beneficial way would be to define a standard way of representing Dhivehi content on the web and try and promote it and encourage people to use it... and if the standardized approach has benefits such as compatibility with Google search (which most of us are familiar with) then i believe they will also be driven in the direction to standardize their content... I know that we do have unicode and the like, but no one actually talks of a standard way of going about representing dhivehi content...

    i believe this is a good topic to discuss, but imho if we go towards customized solutions, we will always be in a never ending race to keep it customized with every new approach somebody puts out there... :-P

    cheers

  10. jaa says:

    Hello,

    Yes, I partially agree with you. Unicode, technically, is and should be considered the standard way to display Dhivehi for now and probably in the future as well. Afterall, Unicode was specifically designed to alleviate the sort of problems that we Maldivians face(d) when trying to represent Dhivehi. I also agree that an effort to try get everyone to switch to Unicode would be a great step towards making Dhivehi usage less of a problem for all involved. As I recall, this was something that was discussed eagerly a few years ago in the few hopeful meetings held towards establishing a Maldivian Computer Society.

    Anyway, not everyone wants to switch to Unicode - not just yet. And most of the time, it is hardly the fault of developers - the decision usually comes from clients who want to stick to the older methods of display for their liking of a particular font better or their use of a particular software package etc.

    However, Unicode usage in Dhivehi sites IS increasing by the day. Until that day comes where all sites use Unicode (or another agreed standard method) we would have consolidation problem on our hands...

  11. muna mohamed says:

    coment may not be directly related but.

    the blogger is now available in three more languages. i am just wondering whether any of you guys are working to add dhivehi to its language. currently everything is in boli the ugly font. i want to write in dhivehi. but boli puts me off.

  12. jaa says:

    Sorry about the late reply... this slipped past me. Anyway, I haven't had a look at Blogger in a long time but as far I know Blogger does allow posting in Dhivehi (or any language for that matter). The news about Blogger adding more languages is more to do with them providing an interface labeled in the particular languages. I doubt Google will see a need to provide us a Dhivehified interface anytime soon :-P

    As for the MV Boli font, it is something you can change when you are making a post. I will post an article on it soon after I fiddle with Blogger a bit... :-)


Add Comment


HTML-Tags will be converted to Entities.
Standard emoticons like :-) and ;-) are converted to images.
To leave a comment you must approve it via e-mail, which will be sent to your address after submission.