contentfield of each document into separate words (which we call terms or tokens), create a sorted list of all the unique terms, then list in which document each term appears. The result looks something like this:
"quick brown"we just need to find the documents in which each term appears:
"quick"appear as separate terms, while the user probably thinks of them as the same word.
"foxes"are pretty similar, as are
"dogs"-- they share the same root word.
"leap", while not from the same root word, are similar in meaning -- they are synonyms.
"+Quick +fox"wouldn't match any documents. (Remember, a preceding
+means that the word must be present). Both the term
"Quick"and the term
"fox"have to be in the same document in order to satisfy the query, but the first doc contains
"quick fox"and the second doc contains
"Quick"can be lowercased to become
"foxes"can be stemmed -- reduced to its root form -- to become
"dogs"could be stemmed to
"leap"are synonyms and can be indexed as just the single term
"+Quick +fox"would still fail, because we no longer have the exact term
"Quick"in our index. However, if we apply the same normalization rules that we used on the
contentfield to our query string, it would become a query for
"+quick +fox", which would match both documents!