Permalink Submitted by Tom Burton-West on March 24, 2010
Hi David,
We wrote a custom PunctuationFilter and PuntuationFilterFactory and put them in the Solr "plugin" directory, and then just made the proper listing in schema.xml. The filter is relatively brain-dead, but should work correctly for any language that the Java Character.isLetterOrDigit works with.
The filter replaces all punctuation as defined by Java
(!Character.isLetterOrDigit)with white space and reduces all runs of multiple whitespaces to one whitespace. We went this way because we have so many languages that it was easier to use the Unicode complient Java function than to determine which particular characters/code points we might not want to strip.
Thanks for your suggestion about using a CharFilter. The CharFilter wasn't around when we implemented our filter but I'm interested in revisiting the filter. What would be the advantage of using a CharFilter?
Re: solr.PunctuationFilterFactory
Hi David,
We wrote a custom PunctuationFilter and PuntuationFilterFactory and put them in the Solr "plugin" directory, and then just made the proper listing in schema.xml. The filter is relatively brain-dead, but should work correctly for any language that the Java Character.isLetterOrDigit works with.
The filter replaces all punctuation as defined by Java
(!Character.isLetterOrDigit)with white space and reduces all runs of multiple whitespaces to one whitespace. We went this way because we have so many languages that it was easier to use the Unicode complient Java function than to determine which particular characters/code points we might not want to strip.
Thanks for your suggestion about using a CharFilter. The CharFilter wasn't around when we implemented our filter but I'm interested in revisiting the filter. What would be the advantage of using a CharFilter?
Tom