Add new comment

re:Hapax!

Permalink Submitted by Tom Burton-West on February 18, 2011

Thanks for the suggestion Lance.

I assume facet count=1 is the same as all tokens in the OCR field that occur only once (tf=1 df=1). We were thinking about looking at the lucene index pruning contribution and removing hapax. However, the problem with actually removing the hapax is that in large corpora about 50% of the unique tokens are hapax so we would remove lots of real words. Because these words have a very high idf, if the word occurs in a query it would bring the document containing the word to the top, so removing these really hurts retrieval for those queries that would include a hapax. As you suggest, a large proportion of the OCR errors will occur only once so in some ways this might be a good way for training a classifier. However, there are still many many OCR errors that occur more than once. The promised blog post about OCR errors is partially written, but I keep getting diverted by other issues.

Main menu

Add new comment

re:Hapax!

Search form