Available Indexes

Tuning search performance

Before we implemented the CommonGrams Index, our slowest query with the standard index was “the lives and literature of the beat generation” which took about 2 minutes  for the 500,000 volume index.  When we implemented the CommonGrams index, that query took only 3.6 seconds. 

After we implemented the CommonGrams Index we looked at the 10 slowest queries for the new index. These queries are different than the slowest queries for the Standard index, because queries containing common words are much faster due to CommonGrams.  The  slowest query for the CommonGrams index, which took  about 9 seconds was the query “histoire de l’art”  (entered without the quotes) which is treated by Solr as a Boolean query: “histoire” AND “de” AND “l’art”.   

One of the words in this query, "de",  is not common in English but is very common in German, Spanish, French, Dutch, and a number of other languages.  The word “de” occurred in about 462,000 of the 500,000 documents in the index. The list of common words we used to create the CommonGrams index contained 32 common English words, and the word "de" was not on the list. HathiTrust has content is in over 200 languages with  7 non-english languages having over 70,000 volumes each,  and 40  languages having more than 1,000 volumes. (See: Chart of  languages for public domain HathiTrustContent.) This indicates a need to consider adding common words in languages other than English to the list of common words for the CommonGrams index.

 Solr tools for analysis and performance tuning

Solr has a number of tools that help to determine why queries are slow.  We started by using the Solr Administrative tool’s search/query interface and selected “Debug: enable” which runs a Solr debug query.  The debug query response shows how the query is parsed and how it is scored.  

Solr debug query results for "histoire de l'art"

Solr debug query results for "histoire de l'art"

One key bit of information in the debug output is found by comparing the query as it was entered:

                  <str name="rawquerystring">histoire AND de AND l'art</str>

  and the "parsedquery" which shows how the Solr query parser parses the query: 

                  <str name="parsedquery">+ocr:histoire +ocr:de +PhraseQuery(ocr:"l art")</str>

What we discovered is that the word “l’art” was being searched as a phrase query “l art”.  Phrase queries are much slower than Boolean queries because the search engine has to read the positions index for the words in the phrase into memory and because there is more processing involved (See : Slow Queries and Common Words for more details).

In order to estimate how much work Solr has to do to process the phrase query for “l art”, we first did a Boolean query for “l AND art” and discovered that those two words occur in about 370,000 out of the 500,000 documents.  We then used a tool we developed to determine the size of the positions list for the word “l” and the word “art”.  The word “l” occurs about 224 million times in the index and the word “art” occurs about 14 million times.  Estimating 1 byte per position entry, this means that the Solr search engine has to read about 238 MB to process the phrase.

 In order to determine why the word “l’art” was being turned into a phrase query for “l art”, we used the Solr Administrative tool’s Analysis panel.

Analysis of query "histoire de l'art"

Analysis of query "histoire de l'art"

Selecting verbose output gives more information about each stage in the filter chain

Analysis of query "histoire de l'art" (detailed view)

Analysis of query "histoire de l'art" (detailed view)

What we discovered was that one part of our filter chain, the WordDelimiterFilter was splitting “l’art” into two words “l” and “art”.  We also discovered that when the filter chain takes a Boolean clause and splits a token into more than one word, the consitutent words get searched as a phrase.  This makes sense, but it also slows things down. (Phrase queries require use of the positions list while Boolean queries without phrases do not )

We looked through the other slow queries and discovered several other Boolean queries where one of the words ended up triggering a phase search.  For example the second slowest query was “7th International” (without the quotes), which gets searched as a Boolean query “7th AND International”.  However, the WordDelimiterFilter breaks the token “7th” into two tokens “7” and “th” and this gets searched as a phrase query. [i].

We took a closer look at what WordDelimiterFilter was doing and discovered that it was creating phrase queries for many words (More details are in the Appendix to this article.)

We decided to replace WordDelimiterFilter with a "punctuation filter"  that simply replaces punctuation with spaces.  For example “l’art” would be tokenized as a single token: “l art”.  This avoids the problem of tokens containing punctuation being split and triggering a phrase query.

We also decided to add more words to the list of common words.  In order to come up with candidate words, we did some analysis of the 2500 most frequent words in the 1 million document standard index and  our query logs for the beta Large Scale Search.  We used the perl Lingua::Stopwords module and looked for stopwords in any of the languages covered by Lingua::Stopwords modules[ii].  We discovered there were 192 unique stopwords in the 2500 most frequent words in the 1 million volume standard index.  We also found about 179 unique stopwords in the 11,000 unique queries in our query log.  We combined these two list of stopwords with the 200 most frequent terms in the 1 million volume standard index and removed any duplicates from the resulting combined list.  The resulting list consisted of about 400 words which we then used as a list of common words to create a new CommonGrams index.  The table below compares response times for a 500,000 volume index for the Standard Index, the CommonGrams index with 32 English common words, and the new CommonGrams index with 400 common words and the punctuation filter.

Response time in milliseconds for 500,000 volume index

IndexAveragemedian90th percentile99th percentile
Standard Index424341426,300
CommonGrams (32 words)140351603,670
CommonGrams (400 words)*87351571,200

*The CommonGrams with 400 common words index also used the punctuation filter instead of the WordDelimiterFilter

Adding more common words and using a filter which strips punctuation but does not trigger a phrase query when a word has punctuation within it, reduced the average response time by nearly 50% and the response time for the slowest 1% of queries by about 2/3rds.

We plan to continute to improve performance through an iterative process using the Solr tools to analyze the slowest queries, making an appropriate change to our indexing process, then using the Solr tools to examine the slowest queries in the new index.

 


Endnotes

[i] The WordDelimiterFilter by default splits words containing numbers for example "p2p" gets split into 3 words "p" "2" "p". Until a patch in December 2008, there was not a way to turn this off. See https://issues.apache.org/jira/browse/SOLR-876.

[ii] The stopword lists were from the perl module Lingua::Stopwords (http://search.cpan.org/~creamyg/Lingua-StopWords-0.09/lib/Lingua/StopWords.pm) and are based on the stopword lists of the snowball project:  http://snowball.tartarus.org/.  The languages are:English,French,German, Spanish, Italian,  Dutch, Russian, Portugese,Hungarian,, Finnish, Danish, Norwegian, and Swedish.

Comments

Thank you for posting practical, real-world information about using Solr - useful and much appreciated.

I have a solr instance running for bespoke data already on Drupal using the solr module. What I would now like to do is to replace core search with solr. The current behaviour is: All content types via drupal native search and 'shopping' content type via solr. The solr index is built offline rather than using the drupal cron, and for drupal native search index for all other content types, i am not able to run the cron as there are millions of nodes from the 'shopping' content type i need to exclude. I thought one way to do this is to run a multi-core Solr but within the *same* drupal site, so that one serves 'shopping' content and the other serves all other content types on Drupal. Any thoughts? Thank you!

As the Andy before me said, this is a nice real world example of Solr. One thing that always seems to snag querying operations is punctuation, nice catch with that sample query.

Do you know that on November 2009 saw the release of Solr 1.4 This version introduces enhancements in indexing, searching and faceting along with many other improvements such as Rich Document processing (PDF, Word, HTML), Search Results clustering and also improved database integration. The release also features many additional plug-ins

Would you post your commongrams schema implementation, including your stopwords file? Thanks!

Hi Gabrial,

The relevant portion of the schema.xml is here: http://www.hathitrust.org/node/181.  I added a link to the list of 400 words to the post, but its also available here: http://www.hathitrust.org/node/180  .Please note that the list is tailored to our content as explained in the post. 

After we initially ported the CommonGrams filter from Nutch to Solr, several people contributed extensive improvements and it was committed and became an official part of Solr in September 2009.  It now comes with Solr 1.4.  If you are planning to use CommonGrams, you should consider the recent patch by  one of the Solr committers, Robert Muir who did extensive work to make it work with the new Lucene TokenStream API and to make the underlying code more efficient.  The patch is available at https://issues.apache.org/jira/browse/SOLR-1657

Tom

To keep this interesting and realistic, it uses a large open source set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project. I have learned how to search this data in a myriad of ways, including Solr's rich query syntax, "boosting" match scores based on record data and other means, about searching across multiple fields with different boosts, getting facets on the results, auto-complete user queries, spell-correcting searches, highlighting queried text in search results, and so on. Your post is quite great example of the theoretical material and manual of Solr.

Hi, did you try using PositionFilter after WordDelimiterFilter in your query analyzer to prevent these phrase queries? I'm using this to prevent the phrase queries you are seeing.

Hi Robert,

Thanks for the suggestion.  I wasn't aware of PositionFilterFactory at the time we implemented our custom punctuatation filter.  I also had some problems understanding the behavior of the WDF with various combinations of flags and at that time it did not have enough unit tests to make it easy to understand:).   I'll take a look at PositionFilterFactory for when we reindex.

Tom

 

>> I also had some problems understanding the behavior of the WDF with various combinations of flags. You aren't the only one! I found there are 512 combinations of these boolean flags while testing them, and some of the flags have bugs if used together. https://issues.apache.org/jira/browse/SOLR-1710 has the background, fixes, and is quite a bit faster than the trunk version if you are concerned with indexing speed... but it is uncommitted and not thoroughly vetted so be cauticus! keep up the good work, cool application!

Where is solr.PunctuationFilterFactory defined? Did you add it to Solr? It seems with Solr 1.4 and beyond, you could do this with a CharFilter

Hi David,

We wrote a custom PunctuationFilter and PuntuationFilterFactory and put them in the Solr "plugin" directory, and then just made the proper listing in schema.xml. The filter is relatively brain-dead, but should work correctly for any language that the Java Character.isLetterOrDigit works with.

The filter replaces all punctuation as defined by Java
(!Character.isLetterOrDigit)with white space and reduces all runs of multiple whitespaces to one whitespace. We went this way because we have so many languages that it was easier to use the Unicode complient Java function than to determine which particular characters/code points we might not want to strip.

Thanks for your suggestion about using a CharFilter. The CharFilter wasn't around when we implemented our filter but I'm interested in revisiting the filter. What would be the advantage of using a CharFilter?

Tom

I don't know if it's faster to do this in the CharFilter or not. It just seemed to align better with what you are doing. Since your punctiation filter is basically splitting on non-alpha-numeric, instead of writing code you very well could have used PatternTokenizerFactory with pattern="[^\p{L}\p{N}]+". I love that in Solr you can do so much without writing code. (though ironically I love writing code).

Thanks for the feedback David,

Sorry I didn't explain the algorithm very well. I should have probably put a code snippet in my previous response. We aren't splitting on punctuation, we are constructing tokens where white space replaces punctuation.
Examples:
"l'art"=>"l art"
"can't"=>"can t".

Our problem was that the WDF was splitting on punctuation and therefore making "l'art" into two tokens which resulted in a phrase query for the token "l" followed by the token "art". Our filter would just make it a query (boolean clause) for the token "l art" instead of the token "l'art"

Tom

Hi, you are indexing Japanese content with Solr right? Are you using the out of the box CJK Tokenizer or are you using a custom one? Did it perform as expected in term of "findability"? Thanks

Hello Tomás,

Yes we are indexing Japanese content with Solr. CJK processing in Solr is complicated and Japanese in particular is interesting due to the multiple scripts (Kanji, Hiragana, Katakana, and Romanji).

Until fairly recently there was a bug/feature in the Lucene Query parser that caused CJK to be searched as a phrase regardless of the tokenizer used (See: https://issues.apache.org/jira/browse/LUCENE-2458)

We index materials in over 400 languages so indexing each language separately is not scalable. For this reason we couldn't use the CJK Tokenizer out-of-the-box. If I am reading the code correctly it also looks like the current CJK Tokenizer only distinguishes between Latin and non-Latin scripts, so if you gave it Thai, Devangari, or Arabic, it would also generate overlapping bigrams for those scripts, which is probably not desirable.

For a number of reasons we are currently using the ICUTokenizer. Unfortunately the ICUTokenizer currently tokenizes Kanji into unigrams. This results in lots of false drops. (See this for an interesting example of the problem: http://www.basistech.com/knowledge-center/products/n-gram-vs-morphologic...)

We have opened an issue to provide an option for the ICUTokenizer to output bigrams but have not had time to work on it. Robert Muir did a bunch of work on it and I believe he will be doing additional work on it in the near future, but haven't checked in with him for a while (https://issues.apache.org/jira/browse/LUCENE-2906).

There are also some facilities for mapping Hiragana to Katakana (https://issues.apache.org/jira/browse/SOLR-814 and https://issues.apache.org/jira/browse/SOLR-822) but I'm don't yet sufficiently understand the implications to know whether this is a good idea. (See these two articles on orthographic variation in Japanese: Halpern: http://www.kanji.org/cjk/reference/japvar.htm
Kummer, Womser-Hacker and Kando: http://www.nii.ac.jp/TechReports/05-011E.pdf)

With our current configuration to get decent results for Japanese, the user has to manually segment Kanji with spaces and put quotes around multi-character words. We hope to change this in a future upgrade.

Tom

Hi Tom, Is it possible for you to share the relevant schema portion pertaining to your implementation of the ICUTokenizer? I am a bit lost in interpreting the documentation to implement multi-language searching, and an example would greatly help. Thanks, Young

Multi-language searching involves many complex trade-offs. It's probably best to ask questions about your specific use case on the Solr user list.

Here is our configuration. Note that we turn autoGeneratePhraseQueries="false".

Tom











Add new comment

You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.