Available Indexes

Add new comment

There's an underlying premise to my suggestion: Out of the 2 billion+ words that you index, most of them are OCR-garbage. If the word "elephant" is indexed as "eleplant", it is not found. There is a huge amount of false negatives.

I am suggesting preventing the bad words in the first place plus a mirrored query parser. Let me try and describe an extreme and probably unusable solution:

Take a dictionary of current use English words only. Force a suggestion for all the OCR words, even when the probability of a match is insanely low. Index the words that the dictionary suggests. Do the same check for the query before a search is performed.

You say that you do not want to eliminate good words, but the point is that they are not. If "mxyzptlk" is OCR'ed properly, but corrected to "myrtle" by the spell checker, the record will still be found as the query parser also corrects "mxyzptlk" to "myrtle". For display, the original OCR'ed text is used.

This extreme solution probably introduces so many false positives that it is unusable. However, it can be improved gradually by adding dictionaries: The more dictionaries, the false positives, while the number of false negatives will always be (a lot) lower than the current solution.

You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.