Stochastic Diffusion Search for Real-Time Web Search

My third year project at uni (this year) involved investigating the application of Stochastic Diffusion Search to the problem of real-time web search by designing and implementing such a search engine in software. It certainly was interesting work and though the project had been completed successfully, there are a few things I wish I had explored and experimented with a bit more. I have another major project to tackle for the last year of my M.Eng degree in AI and Cybernetics which begins next month.

Anyway, here is a short presentation I had given based on a research paper I had to submit on my work as part of the course. I hope you find it interesting...


Towards a (true) Dhivehi search engine

As much as I would like the Dhivehi language to die and rot away, it seems it won't happen, atleast for a while. The (relatively) newly minted freedom to publish newspapers and the growth of web-based news sites may have poised Dhivehi for a serious revival of the language. The revival probably isn't so much in terms of improvements in the vocabulary or other more linguistics related changes but rather a revival in terms of the amount of information now being pumped out in Dhivehi - and in my opinion, that's a great start.

A (if not THE) point worth noting here is that much of this new information is being produced - and published - by digital means. Most government authorities now have web portals and an increasing number of them maintain them diligently. Most, if not all, newspapers and magazines also seem to maintain web portals with their content being made available online on the web. This modern revival thus presents a very interesting and a very much modern set of problems (to geeks like me atleast :-P) :- accessing it. It is probably the first time in Maldivian history that a "dhivehi search engine" makes practical sense.

Now, I am aware that Google and other search engines can be used to search for Dhivehi and I'm also aware that there are a few local operations that purport/aspire to be Maldivian search engines but they all share important shortcomings. These shortcomings are mostly inherent to the various methods of writing Thaana as used on the World Wide Web.

Say you want to search for the word "rayyithunge". Typing that into a search engine would bring an entirely different set of results from typing in "rwacyituncge" or "ރައްޔިތުންގެ" - both of which are alternative forms of representing the same thing in Dhivehi. The different set of results arise because of the differences in the representation schemes used on the different sites. A search with the phrase "rayyithunge" would bring in results with pages that seem to mostly contain English and that's because "rayyithunge" is Dhivehi "Latin"ised into English so that we could use standard English characters to write Dhivehi words. People commonly use such Latinised Dhivehi when writing emails or chatting - say "haalu kihineththa" etc. Meanwhile, a search with the phrase "rwacyituncge" results in a listing of content from sites like Haveeru and Miadhu who use standard ASCII coupled with custom Dhivehi fonts with the characters mapped. If you try copy-pasting something written on the Haveeru page you'd see that it comes out as a seemingly meaningless jumble of letters. Lastly, a search with the phrase "ރައްޔިތުންގެ would bring in results from sites like Minivan Daily and Sangu Daily who use Unicode to display Dhivehi. Anyway, the technical explanations aside, the point is that Dhivehi search is (currently) a messy enterprise.

The solution to this problem can (seem to) be pretty simple. A custom search interface could be made to simply take the search query from a user and convert it into the three different representation schemes and then spawn search a search for each representation phrase on any of the existing search engines. This would work just fine... until you run into peculiar problems related to Latinised and ASCII Dhivehi schemes. Take for example the word "ފަލަ" Latinised into "fala" - a search on the word would result in almost entirely non-Dhivehi results totally unrelated to what we really want. Similarly, a search on the ASCII'ed phrase "Oled" (which is the word "ދެލޯ") would result in a large number of non-Dhivehi results with no bearing on what we wanted. These problems occur because Latinised and ASCII Dhivehi representations can result in text that have meaning in English as well - such as the case of "Oled" as above which happens to be a popular technical term in English.

A more sophisticated approach to the search problem probably could successfully iron out (most of) these quirks. An ideal solution would be to do away with the existing search engines such as Google, despite their awesomeness, and develop a custom search engine. A custom engine would allow for the recognition of the various representation schemes used and the subtle differences between them. A search phrase entered on such an engine would perhaps standardize the phrase and search through a standardized index to return results that are a better mirror of the Dhivehi content that is out there. Such a custom search engine could bundle in extra Dhivehi-related facilities such as conversions to allow for lack of (particular) fonts as used on sites and spelling correction among others.

So, perhaps the question now is, is there a real need for a Dhivehi search engine yet? When should a Maldivian "Google" be born?