Add new comment

Re: Why are you keeping so many words?

Permalink Submitted by Toke Eskildsen (not verified) on October 13, 2010

There's an underlying premise to my suggestion: Out of the 2 billion+ words that you index, most of them are OCR-garbage. If the word "elephant" is indexed as "eleplant", it is not found. There is a huge amount of false negatives.

I am suggesting preventing the bad words in the first place plus a mirrored query parser. Let me try and describe an extreme and probably unusable solution:

Take a dictionary of current use English words only. Force a suggestion for all the OCR words, even when the probability of a match is insanely low. Index the words that the dictionary suggests. Do the same check for the query before a search is performed.

You say that you do not want to eliminate good words, but the point is that they are not. If "mxyzptlk" is OCR'ed properly, but corrected to "myrtle" by the spell checker, the record will still be found as the query parser also corrects "mxyzptlk" to "myrtle". For display, the original OCR'ed text is used.

This extreme solution probably introduces so many false positives that it is unusable. However, it can be improved gradually by adding dictionaries: The more dictionaries, the false positives, while the number of false negatives will always be (a lot) lower than the current solution.

Main menu

Add new comment

Re: Why are you keeping so many words?

Search form