Before we implemented the CommonGrams Index, our slowest query with the standard index was “the lives and literature of the beat generation” which took about 2 minutes for the 500,000 volume index. When we implemented the CommonGrams index, that query took only 3.6 seconds.
After we implemented the CommonGrams Index we looked at the 10 slowest queries for the new index. These queries are different than the slowest queries for the Standard index, because queries containing common words are much faster due to CommonGrams. The slowest query for the CommonGrams index, which took about 9 seconds was the query “histoire de l’art” (entered without the quotes) which is treated by Solr as a Boolean query: “histoire” AND “de” AND “l’art”.
One of the words in this query, "de", is not common in English but is very common in German, Spanish, French, Dutch, and a number of other languages. The word “de” occurred in about 462,000 of the 500,000 documents in the index. The list of common words we used to create the CommonGrams index contained 32 common English words, and the word "de" was not on the list. HathiTrust has content is in over 200 languages with 7 non-english languages having over 70,000 volumes each, and 40 languages having more than 1,000 volumes. (See: Chart of languages for public domain HathiTrustContent.) This indicates a need to consider adding common words in languages other than English to the list of common words for the CommonGrams index.
Solr tools for analysis and performance tuning
Solr has a number of tools that help to determine why queries are slow. We started by using the Solr Administrative tool’s search/query interface and selected “Debug: enable” which runs a Solr debug query. The debug query response shows how the query is parsed and how it is scored.
Solr debug query results for "histoire de l'art"
One key bit of information in the debug output is found by comparing the query as it was entered:
<str name="rawquerystring">histoire AND de AND l'art</str>
and the "parsedquery" which shows how the Solr query parser parses the query:
<str name="parsedquery">+ocr:histoire +ocr:de +PhraseQuery(ocr:"l art")</str>
What we discovered is that the word “l’art” was being searched as a phrase query “l art”. Phrase queries are much slower than Boolean queries because the search engine has to read the positions index for the words in the phrase into memory and because there is more processing involved (See : Slow Queries and Common Words for more details).
In order to estimate how much work Solr has to do to process the phrase query for “l art”, we first did a Boolean query for “l AND art” and discovered that those two words occur in about 370,000 out of the 500,000 documents. We then used a tool we developed to determine the size of the positions list for the word “l” and the word “art”. The word “l” occurs about 224 million times in the index and the word “art” occurs about 14 million times. Estimating 1 byte per position entry, this means that the Solr search engine has to read about 238 MB to process the phrase.
In order to determine why the word “l’art” was being turned into a phrase query for “l art”, we used the Solr Administrative tool’s Analysis panel.
Analysis of query "histoire de l'art"
Selecting verbose output gives more information about each stage in the filter chain
Analysis of query "histoire de l'art" (detailed view)
What we discovered was that one part of our filter chain, the WordDelimiterFilter was splitting “l’art” into two words “l” and “art”. We also discovered that when the filter chain takes a Boolean clause and splits a token into more than one word, the consitutent words get searched as a phrase. This makes sense, but it also slows things down. (Phrase queries require use of the positions list while Boolean queries without phrases do not )
We looked through the other slow queries and discovered several other Boolean queries where one of the words ended up triggering a phase search. For example the second slowest query was “7th International” (without the quotes), which gets searched as a Boolean query “7th AND International”. However, the WordDelimiterFilter breaks the token “7th” into two tokens “7” and “th” and this gets searched as a phrase query. [i].
We took a closer look at what WordDelimiterFilter was doing and discovered that it was creating phrase queries for many words (More details are in the Appendix to this article.)
We decided to replace WordDelimiterFilter with a "punctuation filter" that simply replaces punctuation with spaces. For example “l’art” would be tokenized as a single token: “l art”. This avoids the problem of tokens containing punctuation being split and triggering a phrase query.
We also decided to add more words to the list of common words. In order to come up with candidate words, we did some analysis of the 2500 most frequent words in the 1 million document standard index and our query logs for the beta Large Scale Search. We used the perl Lingua::Stopwords module and looked for stopwords in any of the languages covered by Lingua::Stopwords modules[ii]. We discovered there were 192 unique stopwords in the 2500 most frequent words in the 1 million volume standard index. We also found about 179 unique stopwords in the 11,000 unique queries in our query log. We combined these two list of stopwords with the 200 most frequent terms in the 1 million volume standard index and removed any duplicates from the resulting combined list. The resulting list consisted of about 400 words which we then used as a list of common words to create a new CommonGrams index. The table below compares response times for a 500,000 volume index for the Standard Index, the CommonGrams index with 32 English common words, and the new CommonGrams index with 400 common words and the punctuation filter.
Response time in milliseconds for 500,000 volume index
Index | Average | median | 90th percentile | 99th percentile |
Standard Index | 424 | 34 | 142 | 6,300 |
CommonGrams (32 words) | 140 | 35 | 160 | 3,670 |
CommonGrams (400 words)* | 87 | 35 | 157 | 1,200 |
*The CommonGrams with 400 common words index also used the punctuation filter instead of the WordDelimiterFilter
Adding more common words and using a filter which strips punctuation but does not trigger a phrase query when a word has punctuation within it, reduced the average response time by nearly 50% and the response time for the slowest 1% of queries by about 2/3rds.
We plan to continute to improve performance through an iterative process using the Solr tools to analyze the slowest queries, making an appropriate change to our indexing process, then using the Solr tools to examine the slowest queries in the new index.
Endnotes
[i] The WordDelimiterFilter by default splits words containing numbers for example "p2p" gets split into 3 words "p" "2" "p". Until a patch in December 2008, there was not a way to turn this off. See https://issues.apache.org/jira/browse/SOLR-876.
[ii] The stopword lists were from the perl module Lingua::Stopwords (http://search.cpan.org/~creamyg/Lingua-StopWords-0.09/lib/Lingua/StopWords.pm) and are based on the stopword lists of the snowball project: http://snowball.tartarus.org/. The languages are:English,French,German, Spanish, Italian, Dutch, Russian, Portugese,Hungarian,, Finnish, Danish, Norwegian, and Swedish.
Comments
Thank you for posting
I have a solr instance
As the Andy before me said,
Solr 1.4
Schema and stopwords.txt
Re: Schema and stopwords.txt
Hi Gabrial,
The relevant portion of the schema.xml is here: http://www.hathitrust.org/node/181. I added a link to the list of 400 words to the post, but its also available here: http://www.hathitrust.org/node/180 .Please note that the list is tailored to our content as explained in the post.
After we initially ported the CommonGrams filter from Nutch to Solr, several people contributed extensive improvements and it was committed and became an official part of Solr in September 2009. It now comes with Solr 1.4. If you are planning to use CommonGrams, you should consider the recent patch by one of the Solr committers, Robert Muir who did extensive work to make it work with the new Lucene TokenStream API and to make the underlying code more efficient. The patch is available at https://issues.apache.org/jira/browse/SOLR-1657
Tom
Tuning search performance
PositionFilterFactory
Re: PositionFilterFactory
Hi Robert,
Thanks for the suggestion. I wasn't aware of PositionFilterFactory at the time we implemented our custom punctuatation filter. I also had some problems understanding the behavior of the WDF with various combinations of flags and at that time it did not have enough unit tests to make it easy to understand:). I'll take a look at PositionFilterFactory for when we reindex.
Tom
Re: PositionFilterFactory
Wither solr.PunctuationFilterFactory
Re: solr.PunctuationFilterFactory
Hi David,
We wrote a custom PunctuationFilter and PuntuationFilterFactory and put them in the Solr "plugin" directory, and then just made the proper listing in schema.xml. The filter is relatively brain-dead, but should work correctly for any language that the Java Character.isLetterOrDigit works with.
The filter replaces all punctuation as defined by Java
(!Character.isLetterOrDigit)with white space and reduces all runs of multiple whitespaces to one whitespace. We went this way because we have so many languages that it was easier to use the Unicode complient Java function than to determine which particular characters/code points we might not want to strip.
Thanks for your suggestion about using a CharFilter. The CharFilter wasn't around when we implemented our filter but I'm interested in revisiting the filter. What would be the advantage of using a CharFilter?
Tom
Re: solr.PunctuationFilterFactory
re:PatternTokenizerFactory
Thanks for the feedback David,
Sorry I didn't explain the algorithm very well. I should have probably put a code snippet in my previous response. We aren't splitting on punctuation, we are constructing tokens where white space replaces punctuation.
Examples:
"l'art"=>"l art"
"can't"=>"can t".
Our problem was that the WDF was splitting on punctuation and therefore making "l'art" into two tokens which resulted in a phrase query for the token "l" followed by the token "art". Our filter would just make it a query (boolean clause) for the token "l art" instead of the token "l'art"
Tom
Japanese
Re: Japanese
Hello Tomás,
Yes we are indexing Japanese content with Solr. CJK processing in Solr is complicated and Japanese in particular is interesting due to the multiple scripts (Kanji, Hiragana, Katakana, and Romanji).
Until fairly recently there was a bug/feature in the Lucene Query parser that caused CJK to be searched as a phrase regardless of the tokenizer used (See: https://issues.apache.org/jira/browse/LUCENE-2458)
We index materials in over 400 languages so indexing each language separately is not scalable. For this reason we couldn't use the CJK Tokenizer out-of-the-box. If I am reading the code correctly it also looks like the current CJK Tokenizer only distinguishes between Latin and non-Latin scripts, so if you gave it Thai, Devangari, or Arabic, it would also generate overlapping bigrams for those scripts, which is probably not desirable.
For a number of reasons we are currently using the ICUTokenizer. Unfortunately the ICUTokenizer currently tokenizes Kanji into unigrams. This results in lots of false drops. (See this for an interesting example of the problem: http://www.basistech.com/knowledge-center/products/n-gram-vs-morphologic...)
We have opened an issue to provide an option for the ICUTokenizer to output bigrams but have not had time to work on it. Robert Muir did a bunch of work on it and I believe he will be doing additional work on it in the near future, but haven't checked in with him for a while (https://issues.apache.org/jira/browse/LUCENE-2906).
There are also some facilities for mapping Hiragana to Katakana (https://issues.apache.org/jira/browse/SOLR-814 and https://issues.apache.org/jira/browse/SOLR-822) but I'm don't yet sufficiently understand the implications to know whether this is a good idea. (See these two articles on orthographic variation in Japanese: Halpern: http://www.kanji.org/cjk/reference/japvar.htm
Kummer, Womser-Hacker and Kando: http://www.nii.ac.jp/TechReports/05-011E.pdf)
With our current configuration to get decent results for Japanese, the user has to manually segment Kanji with spaces and put quotes around multi-character words. We hope to change this in a future upgrade.
Tom
Re: Japanese
Multi-language searching
Multi-language searching involves many complex trade-offs. It's probably best to ask questions about your specific use case on the Solr user list.
Here is our configuration. Note that we turn autoGeneratePhraseQueries="false".
Tom
Add new comment