Inverted Index versus Forward Index

What makes search engine so fast in giving the results despite the enormous amount of data it needs to search is because of inverted index. To better understand how it works and why it is much better, let us take a look at the "normal" or "forward" index

How Forward Index works?

Let's say you have a website and would like to have so that each page can be easily search via keywords that is present on each page. Doing a forward Index involves the following steps:

Go to the first page and gather all the keywords
Append the keywords in the index entry for that particular document
Go to the next page and repeat step 1 until all pages is indexed

When all the pages is indexed, you will end up with a list of all pages with associated keywords on it. So whenever you need to find all pages that matches a certain keyword, you'll just search the index itself and you will come up with the list of pages associated with it.

Page	Keywords
http://www.example.com/page1	hello, world, hey, wait
http://www.example.com/page2	greetings, world, tube, come-on
http://www.example.com/page3	mr, darwin, cpu, world

Pretty simple eh! but as people who actualy built a search engine (Google/Yahoo/Bing)figured out, that is not how simple things are. This process comes with these problems:

The index can grew pretty large! since the indexing process records a lot of duplicate keywords in the index. For example the keyword "world" is stored multiple times
Though indexing is fast, since it only appends things in the index as it move "forwards", querying this huge index is inefficient since it has to look at every contents of index just to retrieve all pages that is related to it.

This problem is the very same reason why there is a need for Inverted Index

How Inverted Index works?

Go to the first page and gather all the keywords
For each keyword, check if the keyword is already present in the index
If its present already, add the reference of that page to that index
if not present, create a new entry for that keyword
add additional info such as mow many times it appears in the document, location within the document and etc.
Go to the next page and repeat step 1 until all pages is indexed
Sort the keywords

If you notice, it prevents duplicate keywords from being stored. It is being checked first if the keyword is already present and also records some information. This makes the indexing a bit slower, since for each keyword, it has to process and recheck the previous indexes. But on the other side, this makes querying VERY fast. Since the search engine only need to look up a single record, then all pages is there already!. This is how the result of inverted index looks like:

Keywords	Pages
come-on	http://www.example.com/page2
cpu	http://www.example.com/page3
darwin	http://www.example.com/page3
greetings	http://www.example.com/page2
hello	http://www.example.com/page1
hey	http://www.example.com/page1
mr	http://www.example.com/page3
wait	http://www.example.com/page1
world	http://www.example.com/page1 http://www.example.com/page2 http://www.example.com/page3
tube	http://www.example.com/page2

If you observe above, when I need to know all pages that contains the word "world", it can quickly return all pages, and since the keywords itself is also sorted, finding it among the keywords is extremely fast.

And so that is what the difference of Forward vs Inverted indexes, it's all about providing a much more efficient querying process. Forward index is easier to do in indexing proces, but sucks in querying, while Inverted index is complicated during indexing time, but extremely fast in querying time.

Did you find this useful?

I'm always happy to help! You can show your support and appreciation by Buying me a coffee (I love coffee!).