Available Indexes

Add new comment

Hi guys, Your blog has been really helpful! We are also using SOLR to index around 15 million documents and our total index size is around 250GB. This index file has the document contents (text only indexed - for searching) and other meta data which user can search on. Text search on it in many cases is really slow (like even goes to 1-2 mins). Due to business requirements we can't ignore stop words. Searches range from simple Boolean queries to long phrases with even proximity and wildcards etc. For performance optimization we were initially planning to start with sharding but due to our budget constraints right now we won't be able to use more than 2 hard drives so the index would be split to 125GB each but after looking at its limitations and the way it works, we think it won't make a very drastic improvement in our scenario so we decided to go with CommonGrams. Also your individual shards are bigger than our current index file so sharding shouldn't be the issue. As an initial test we indexed 200k documents without common grams, with commongrams using 500 most common words and with commongrams using 1000 most common words. The surprising thing is that the index file is around 2.24 times bigger than the one without commongrams (2.91GB) and almost 2.4 times bigger for 1000 words (Isn't it too much keeping in mind the max it can go is 3 times if we use all words?) Although its a very small data set but almost all of the queries seem faster on the one without commongrams. We understand that as the data size increases the results should change but even at such a small data set still many queries on indexes with commongrams take till 2-3 secs. Although on our current index file these queries take 4-5 times more but once the index file with commongrams also contains the complete data (test data is not even 2%) it seems it will be somewhere close to the current time. My understanding is that more than 2 words phrases wont have a very big difference but still much better than normal index. We are using 2 Xeon Quad E5520 2.27GHz with 32GB Ram on Windows Server 2008 64 bit. Hard drive is a SAS with 10K RPM. We are using SOLR 1.4.1 on Apache Tomcat server and JRocket JVM. Are we missing something? Note: We don't have much load on servers right now so the issue is single query time rather than throughput. We have another index file which has around 120 million rows and is 850 GB (contents stored) but that's not an issue since text search on it is always done on a very limited data set.
You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.