Permalink Submitted by Lance Norskog (not verified) on February 17, 2011
('Hapax legomenon' means 'word which appears once'. I love the phrase.)A common strategy for finding mispellings is to pull all facets with count=1. In your case, this is also a good winnow for OCR-burps.
If you OCR something twice, do you get the same OCR-burps? Why do this? Because you can build a large database of such burps, and you get a training set for a classifier! Once you have this, you can start attacking single-count facets. If nobody has done this before, it would make a good grad student project. Also, doing the OCR twice is a good experimental design.