Inverted Index versus Forward Index

You might be curious why it was called inverted and how the normal index looks like. This article will try to answer that question.

Posted by Darwin Biler on May 1, 2016

What makes search engine so fast in giving the results despite the enormous amount of data it needs to search is because of inverted index. To better understand how it works and why it is much better, let us take a look at the "normal" or "forward" index

How Forward Index works?

Let's say you have a website and would like to have so that each page can be easily search via keywords that is present on each page. Doing a forward Index involves the following steps:

  1. Go to the first page and gather all the keywords
  2. Append the keywords in the index entry for that particular document
  3. Go to the next page and repeat step 1 until all pages is indexed

When all the pages is indexed, you will end up with a list of all pages with associated keywords on it. So whenever you need to find all pages that matches a certain keyword, you'll just search the index itself and you will come up with the list of pages associated with it.

Page Keywords
http://www.example.com/page1 hello, world, hey, wait
http://www.example.com/page2 greetings, world, tube, come-on
http://www.example.com/page3 mr, darwin, cpu, world

Pretty simple eh! but as people who actualy built a search engine (Google/Yahoo/Bing)figured out, that is not how simple things are. This process comes with these problems:

  • The index can grew pretty large! since the indexing process records a lot of duplicate keywords in the index. For example the keyword "world" is stored multiple times
  • Though indexing is fast, since it only appends things in the index as it move "forwards", querying this huge index is inefficient since it has to look at every contents of index just to retrieve all pages that is related to it.

This problem is the very same reason why there is a need for Inverted Index

How Inverted Index works?

  1. Go to the first page and gather all the keywords
  2. For each keyword, check if the keyword is already present in the index
  3. If its present already, add the reference of that page to that index
  4. if not present, create a new entry for that keyword
  5. add additional info such as mow many times it appears in the document, location within the document and etc.
  6. Go to the next page and repeat step 1 until all pages is indexed
  7. Sort the keywords

If you notice, it prevents duplicate keywords from being stored. It is being checked first if the keyword is already present and also records some information. This makes the indexing a bit slower, since for each keyword, it has to process and recheck the previous indexes. But on the other side, this makes querying VERY fast. Since the search engine only need to look up a single record, then all pages is there already!. This is how the result of inverted index looks like:

Keywords Pages
come-on http://www.example.com/page2
cpu http://www.example.com/page3
darwin http://www.example.com/page3
greetings http://www.example.com/page2
hello http://www.example.com/page1
hey http://www.example.com/page1
mr http://www.example.com/page3
wait http://www.example.com/page1
world http://www.example.com/page1
http://www.example.com/page2
http://www.example.com/page3
tube http://www.example.com/page2

If you observe above, when I need to know all pages that contains the word "world", it can quickly return all pages, and since the keywords itself is also sorted, finding it among the keywords is extremely fast.

And so that is what the difference of Forward vs Inverted indexes, it's all about providing a much more efficient querying process. Forward index is easier to do in indexing proces, but sucks in querying, while Inverted index is complicated during indexing time, but extremely fast in querying time.