Available Indexes

Add new comment

Hi Salman,

Would it be all right with you to move this discussion to the Solr list (solr-user@lucene.apache.org )? That way you can get feedback from other CommonGram users who may have more experience with complex queries and proximity queries. Also if we move to the list, Toke Eskildsen, who has done extensive experiments with SSD's can help answer your questions about the benefit of SSD's.

In our experience experiments on small indexes (10GB or less), could not reliably indicate the performance of larger indexes (100GB or more). In particular, with the large indexes the impacts of disk I/O are much more apparent.

As far as complex queries, please give some examples. Wildcard queries could be slower with CommonGrams due to the increase in the number of unique terms.

>>either the times are quite close or in case of a difference the non-CommonGrams >>would take few hundred milliseconds and CommonGrams 1-2 sec.
Can you give an example query where the CommonGrams version is significantly slower? Also a debug/explain query might help to see if there are interesting analyzer issues.

>>currently for testing purposes we got 1000 words with highest frequencies (what shows up in Luke). We would be discarding few of them but is that the right strategy? as I think Luke shows the occurrence in no. of documents NOT the total occurrences?

Yes Luke shows the terms with the highest document frequency, not the highest number of occurrences. We contributed a patch to org.apache.lucene.misc.HighFreqTerms (LUCENE-2393)that will list the terms with the highest total occurences. If you are using a 3x version of lucene you can use the -t flag. However, in order to limit the search space it starts with the top terms by document frequency. So to get the top 1000 terms by occurrence frequency you can ask for the top 10,000 with the -t flag and take the first 1,000 from the resulting list. (Assuming that you don't have a few documents that repeat a term a zillion times)

java org.apache.lucene.misc.HighFreqTerms [-t][number_terms] [field]

( http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contri...)

Let me know if its ok to move this to the list and I'll start a thread.

Tom

You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.