Available Indexes

Add new comment

I might be jumping the gun on your next blog post here. I tried searching for words with 5 random characters (hhjrt, wnttl, ghhhd and so on) and got a fair amount of hits each time. I tried doing this 10 times and so far haven't found any combination that did not result in a match.

I know that the source of the many terms is OCR-errors and that it is very problematic to preform correct spell checking on them. Nevertheless all the nonsense-words damages the search as (guessing here) noone tries to search for "wntt" when the need the word "want". Wouldn't a bad spelling correction be better than no correction?

Creating a merged dictionary for all the languages, applying it to words of, guessing again, length 6 or less and forcing the corrector to give some suggestion (maybe together with the original term if the probability for a correct guess is below a certain threshold) should give a significantly reduced number of unique terms and an increase in correct matches at the cost of a significant increase in false positives.

As false positives can be reduced by adding more search terms, while the missing correct matches due to OCR errors are harder to dig out, wouldn't this be a fair trade off?

You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.