Available Indexes

Add new comment

Hi Toke,

Are you suggesting a query-time fix to try to improve search results or some kind of analysis fix that would prevent "bad" words from getting into the index in the first place?

As far as preventing "bad" words from getting into the index, it's a really hard problem with 400+ languages. We don't want to accidentally remove a "good" word, so if we used a dictionary-based approach we would need very comprehensive dictionaries. Some problems with a dictionary based approach are that we have works in a wide variety of disciplines (meaning we would need some kind of academic/technical dictionaries), a wide variety of time periods (just the variations in English between 1400 and now are quite large), and many works contain proper names and place names either in the vernacular or transliterated.

I do plan to go into lots more detail in a future blog post. However, we are always looking for ideas on how to reduce the amount of dirty OCR in the index.

Tom

You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.