Vijay Krishna's Notes http://vijaykrishna.posterous.com Most of my notes as a student of computer software and everything around it. posterous.com Thu, 14 Apr 2011 06:30:09 -0700 Inverted Indices http://vijaykrishna.posterous.com/inverted-indices http://vijaykrishna.posterous.com/inverted-indices The first time i came across this data structure, i was quite ashamed of the fact that i did not know of it earlier. This seemingly simple concept eases the complicated process of search by leaps and bounds. Most of what i know about this is really a whiff of the top layer of this amazing heap of brilliance. I got most of my introductory notes about this concept from its Wikipedia page. Now, in order to understand this concept of data structuring, or rather to understand it better, lets have a look at Forward Indexing. Let us assume that there are three files, t1,t2 and t3 with the following data: t1 - "an apple" t2 - "a day" t3 - "an apple a day" Now the forward index of the file would look like this: T1 = {"an", apple} T2 = {"a", "day"} T3 = {"an", "apple", "a", "day"} Thus, in order to search for a word like "a", you have to do a sequential search of all three text files for a satisfactory result. This, seemingly simple approach mind you, is a time consuming and complex process as the data set gets bigger. Here we have about 3 text files with not more than 4 words per file with a search query containing one single lettered character. Things are bound to get slow and difficult when you are looking at say about a Million files, with a minimum of 10,000 words in each file and a full phrases sentence for the search query. And imagine doing this again and again. Now, let us look at the Inverted Index: I(an) = {1,3} I(apple) = {1,3} I(a) = {2,3} I(day) = {2,3} I(a) => Index for the word "a" and its results imply the file numbers where the word can be found. So, in order to find the location of the word "a", you really have to goto I(a) which gives you 2 & 3 which are the files where the word is located. It gets better when you have to search for a string of words together, say "day apple": You take the intersection of I(day) and I(apple) for you to get this: {2,3} n {1,3} = {3} Hence 3 is the only text file where the these two word come together. This makes search fast and efficient. Something which we have always been trying to do. Let us take this a step further: What if we were to start storing the position of the words in the text files along with the text file's numbers in the inverted index. So the inverted index would then look something like this: I(an) = {(1,1), (3,1)} I(apple) = {(1,2), (3,2)} I(a) = {(2,1), (3,3)} I(day) = {(2,2), (3,4)} where (3,1) would imply file 3 and word 1. So now when u search for "an apple", you will not only be able to match the words but also the sequence in which they appear in the query string with that in the file: I(an) = {(1,1), (3,1)} I(apple) = {(1,2), (3,2)} I(a) = {(2,1), (3,3)} I(day) = {(2,2), (3,4)} So this elegant data structure is very fast and has been instrumental in all kinds of information retrieval systems, right from search engines to DNA matching algorithms. Tell me, did you know of this one before hand?

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1369599/pic.jpeg http://posterous.com/users/hcGXxsTkwP6SS Vijay Krishna Palepu vpalepu Vijay Krishna Palepu