Permalink Submitted by Salman (not verified) on January 7, 2011
Thanks for the quick response.
Our 200k documents index is just 2.9 GB without CommonGrams and 7GB with 1000 words CommonGrams. Its a little more than 1% of the total size so most probably CommonGrams should be faster with a big index file (currently I think since the one without CommonGrams is too small and has less unique terms to its results are slightly better).
Its not very slow right now (I mean even 1-2 secs is acceptable) but the concern is when all documents are indexed would it be faster then or not compared to our normal index? Currently we have tested on simple queries and either the times are quite close or in case of a difference the non-CommonGrams would take few hundred milliseconds and CommonGrams 1-2 sec. We are doing searches on combination of words from our 1000 words list so these ones should be definitely fast, right?
When you it should be faster for simple phrase queries, what about the complex ones? Logically if not that big a difference but still should be faster than non-CommonGrams due to less no. of permutations, isn't it?
PRX file seems to be around 60% of total index file.
Proximity and wildcard searches are really important for our system. We did some test searches for proximity and the results seem to match with the non-CommonGrams index file. Are you sure there can be issues with proximity?
Index size is not a big concern for us (even if its 2 times) but currently for testing purposes we got 1000 words with highest frequencies (what shows up in Luke). We would be discarding few of them but is that the right strategy? as I think Luke shows the occurrence in no. of documents NOT the total occurrences?
We only have 1 resource on SOLR so have to try out things one by one. From our research it seemed CommonGrams would be MOST helpful for phrase queries (even Sharding didn't seem to be much helpful in our scenario and it too comes with few limitations). Do you think we should try out something else first for performance and also how to make sure our real issue is I/O bottleneck?
Also how much do you think SSD will help? (should decrease the search time by 3-4 times at least?)
Thanks for the quick