It is a data structure that stores mapping from words to documents or set of documents i.e. directs you from word to document.
Or in other words:
- For each term t, we must store a list of documents that contains t.
- Identify each document by a docID, a document serial number
- Fetch all documents and gather the words from each document
- Process each word (e.g. convert to lowercase)
- Check for each word, if it is present then add reference of document to index else create new entry in index for that word.
- Repeat above steps for all documents and sort the words.
- Convert plural to singular
- apples to apple
- Convert to lowercase
- Castle to castle
- Tokenization
- Cut character sequence into word tokens
- Deal with "John's", a-state-of-art-solution
- Cut character sequence into word tokens
- Normalization
- Map text and query term to same form
- You want U.S.A and USA to match
- Map text and query term to same form
- Stemming
- We may wish different forms of a root to match
- authorize, authorization
- We may wish different forms of a root to match
- Stop words
- We may omit very common words (or not)
- the, a, to, of
- We may omit very common words (or not)