Large-scale Search https://www.hathitrust.org/blogs/large-scale-search/1000common.txt en Challenges for HathiTrust full-text search https://www.hathitrust.org/blogs/large-scale-search/challenges <p>The HathiTrust <a href="https://www.hathitrust.org/">full-text search application</a> provides search services over the full text of more than 16 million volumes.  This is about 5 billion pages and about 16 Terabytes (TB) of text created by <a href="https://en.wikipedia.org/wiki/Optical_character_recognition" title="optical character recognition">optical character recognition</a> (OCR.).  There are a number of characteristics of the data and metadata in the HathiTrust repository that make providing good search services difficult.</p> Thu, 05 Jul 2018 21:19:42 +0000 Tom Burton-West 2677 at https://www.hathitrust.org Practical Relevance Ranking for 11 Million Books, Part 3: Document Length Normalization. https://www.hathitrust.org/blogs/large-scale-search/practical-relevance-ranking-11-million-books-part-3-document-length-normali <p>In <a href="http://www.hathitrust.org/blogs/large-scale-search/practical-relevance-ranking-11-million-books-part-2-document-length-and-rel">Part 2</a> we argued that most relevance ranking algorithms used for ranking text documents are based on three fundamental features:</p> Thu, 20 Nov 2014 23:21:33 +0000 Tom Burton-West 1808 at https://www.hathitrust.org Practical Relevance Ranking for 11 Million Books, Part 2: Document Length and Relevance Ranking https://www.hathitrust.org/blogs/large-scale-search/practical-relevance-ranking-11-million-books-part-2-document-length-and-rel <p><strong>Document Length and Relevance Ranking</strong></p> <p><span style="line-height:1.3em">In</span><a href="http://www.hathitrust.org/blogs/large-scale-search/practical-relevance-ranking-11-million-books-part-1" style="line-height: 1.3em;"> Part 1</a><span style="line-height:1.3em">, we made the argument that the one to two orders of magnitude of difference in document length between </span>HathiTrust<span style="line-height:1.3em"> books and the documents used in standard test collections affects all aspects of relevance ranking. </span></p> Thu, 12 Jun 2014 16:15:10 +0000 Tom Burton-West 1596 at https://www.hathitrust.org Practical Relevance Ranking for 11 Million Books, Part 1 https://www.hathitrust.org/blogs/large-scale-search/practical-relevance-ranking-11-million-books-part-1 <p><strong>Practical Relevance Ranking for 11 Million Books, Part 1</strong></p> <p>This is the first in a series of posts about our work towards practical relevance ranking for the 11 million books in the <a href="http://www.hathitrust.org/">HathiTrust full-text search application</a>.</p> Wed, 21 May 2014 17:35:46 +0000 Tom Burton-West 1532 at https://www.hathitrust.org A Tale of Two Solrs https://www.hathitrust.org/blogs/large-scale-search/tale-two-solrs-0 <p>When we first started working on large scale search we confronted the issue of whether to index pages or complete books as our fundamental unit of indexing.<a href="#_edn1" name="_ednref1" title="" id="_ednref1">[i]</a>   We had some concerns about indexing on the page level.  We knew we would need to scale to 10-20 million books and at an average of 300 pages per book that comes out to about 6 billion pages.  At that time we did not think that Solr would scale to 6 billion pages.<a href="#_edn2" name="_ednref2" title="" id="_ednref2">[ii]</a>  If we indexed by page, we also wanted to be able </p> Fri, 04 Oct 2013 19:58:49 +0000 Tom Burton-West 1380 at https://www.hathitrust.org Multilingual Issues Part 1: Word Segmentation https://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation <p>At the core of the Solr/Lucene search engine is an inverted index.  The inverted index has a list of tokens and a list of the documents that contain those tokens. In order to index text, Solr needs to break strings of text into “tokens.”  In English and Western European languages spaces are used to separate words, so Solr uses whitespace to determine what is a token for indexing.   In a number of languages the words are not separated by spaces.</p> Thu, 08 Dec 2011 23:43:36 +0000 Tom Burton-West 768 at https://www.hathitrust.org Forty Days and Forty Nights: Re-indexing 7+ million books (part 1) https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1 <p><strong>Forty days forty nights: Re-indexing 7+ million books (part 1) </strong></p><p>Forty days and forty nights; That’s how long we estimated it would take to re-index all 7+ million volumes in HathiTrust. Because of this forty day turnaround time, when we found a problem with our current indexing, we were reluctant to do a complete re-index. Whenever feasible we would just re-index the affected materials.</p> Sat, 21 May 2011 00:53:33 +0000 Tom Burton-West 537 at https://www.hathitrust.org Too Many Words Again! https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again <p>After Mike McCandless increased the limit of unique words in a Lucene/Solr index segment from 2.4 billion words to around 274 billion words, we thought we didn't need to worry about having too many words (See<a href="http://www.hathitrust.org/blogs/large-scale-search/too-many-words"> http://www.hathitrust.org/blogs/large-scale-search/too-many-words</a>). We recently discovered that we were wrong!</p> Wed, 06 Oct 2010 00:55:52 +0000 Tom Burton-West 382 at https://www.hathitrust.org Making personal collections from Large Scale Search Results https://www.hathitrust.org/blogs/large-scale-search/making-personal-collections-from-large-scale-search-results <p>We just released a new feature in our full-text <a href="http://catalog.hathitrust.org/" title="Large Scale Search">Large Scale Search</a>. When you do a search,you will see check boxes next to each search result. You can select items you want from the search results and create a personal collection. This should make it much easier to do repeated searches and explore a targeted subset of the HathiTrust volumes. If you are not logged in, the collection will be temporary. If you log in you can save the collection permanently.</p> Tue, 06 Jul 2010 20:44:19 +0000 Tom Burton-West 253 at https://www.hathitrust.org Too Many Words! https://www.hathitrust.org/blogs/large-scale-search/too-many-words <p>When we read that the Lucene index format used by Solr has a <a href="http://lucene.apache.org/java/3_0_0/fileformats.html#Limitations">limit of 2.1 billion unique words per index</a> segment,  we didn't think we had to worry.  However, a couple of weeks ago, after we optimized our indexes on each shard to one segment, we started seeing java "ArrayIndexOutOfBounds" exceptions in our logs.  After a bit of investigation we determined that indeed, most of our index shards contained over 2.1 billion unique words and some queries were triggering these exeptions.  Currently ea</p> Fri, 19 Feb 2010 22:59:20 +0000 Tom Burton-West 192 at https://www.hathitrust.org Performance at 5 million volumes https://www.hathitrust.org/blogs/large-scale-search/performance-5-million-volumes <p>On November 19, 2009, we put <a href="http://www.hathitrust.org/blogs/large-scale-search/new-hardware-searching-5-million-volumes-full-text">new hardware</a> into production to provide full-text searching against about 4.6 million volumes.  Currently we have about 5.3 million volumes.  The average response time is about  3 seconds,  90% of queries take under 4 seconds, 9% of queries take between 4 seconds and 24 seconds, and 1% of queries take longer than 24 seconds.</p> Thu, 18 Feb 2010 23:11:15 +0000 Tom Burton-West 174 at https://www.hathitrust.org Scaling up Large Scale Search from 500,000 volumes to 5 Million volumes and beyond https://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-from-500000-volumes-5-million-volumes-and-beyond <p>To scale up from 500,000 volumes of full-text  to 5 million, we decided to use <a href="http://wiki.apache.org/solr/DistributedSearch" title="Solr's distributed search">Solr’s distributed search</a> feature which allows us to split up an index into a number of separate indexes (called “shards”).  Solr's distributed search feature allows the indexes to be searched in parallel and then the results aggregated so performance is better than having a very large single index.</p><p><strong>Sizing the shards</strong></p> Mon, 01 Feb 2010 20:56:48 +0000 Tom Burton-West 175 at https://www.hathitrust.org Common Word list for CommonGrams https://www.hathitrust.org/blogs/large-scale-search/common-word-list-commongrams <p>This is the list of 400 (actually 415) common words used for our current Solr configuration as described in <a href="http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance">Tuning Search Performance</a>.</p> Thu, 21 Jan 2010 22:15:11 +0000 Tom Burton-West 385 at https://www.hathitrust.org New Hardware for searching 5 million+ volumes of full-text https://www.hathitrust.org/blogs/large-scale-search/new-hardware-searching-5-million-volumes-full-text <p>On November 19, 2009, we put new hardware into production to provide full-text searching against about 4.6 million volumes.  Currently we have about 5.3 million volumes indexed.  Below is a brief description of our current production hardware.  Future posts will give  details about performance and background on our experiments with different system architectures and configurations.</p><p><strong>Hardware details</strong></p><p><strong><em>Solr Server configuration</em><br /></strong></p> Thu, 07 Jan 2010 22:43:24 +0000 Tom Burton-West 173 at https://www.hathitrust.org Tuning search performance https://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance <p>Before we implemented the <a href="http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2">CommonGrams Index,</a> our slowest query with the standard index was “the lives and literature of the beat generation” which took about 2 minutes  for the 500,000 volume index.  When we implemented the CommonGrams index, that query took only 3.6 seconds. </p> Fri, 28 Aug 2009 22:44:31 +0000 Tom Burton-West 159 at https://www.hathitrust.org Slow Queries and Common Words (Part 2) https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 <p>In<a href="http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1"> part 1</a> we talked about why some queries are slow and the effect of these slow queries on overall performance. The slowest queries are phrase queries containing common words.  These queries are slow because the size of the positions index for common terms on disk is very large and disk seeks are slow.  These long positions index entries cause three problems relating to overall response time:</p> Mon, 27 Jul 2009 21:18:14 +0000 Tom Burton-West 146 at https://www.hathitrust.org Current Hardware Used for Testing https://www.hathitrust.org/blogs/large-scale-search/current-hardware-used-testing <p>This is a brief note on the  current hardware and software environment we are using for Solr testing.</p><p><strong>Solr Servers</strong></p><ul><li>Two Dell PowerEdge 1950 blades</li><li>2 x Dual Core Intel Xeon 3.0 GHz 5160 Processors</li><li>8GB - 32GB RAM depending on the test configuration</li><li>Red Hat Enterprise Linux 5.3 (kernel: 2.6.18 PAE)</li><li>Java(TM) SE Runtime Environment (build: 1.6.0_11-b03)</li><li>Solr 1.3</li><li>Tomcat 5.5.26</li></ul><p><strong>Storag</strong><strong>e Server</strong></p> Fri, 24 Jul 2009 22:41:11 +0000 Tom Burton-West 149 at https://www.hathitrust.org Slow Queries and Common Words (Part 1) https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1 <p><strong>All Queries are not created equal</strong></p> Thu, 23 Jul 2009 20:19:04 +0000 Tom Burton-West 138 at https://www.hathitrust.org Update on Testing (Memory and Load tests) https://www.hathitrust.org/blogs/large-scale-search/update-on-testing-memory-and-load-tests <p>Since we finished the work described in the <a href="http://www.hathitrust.org/technical_reports/Large-Scale-Search.pdf" target="_blank" title="Large Scale Search Report">Large Scale Search Report</a> we have made some changes to our test protocol and upgraded our Solr implementions to Solr 1.3. We  have completed some testing with increased memory and some preliminary load testing.</p><p>The new test protocol has these features</p> Wed, 15 Jul 2009 17:56:17 +0000 Tom Burton-West 120 at https://www.hathitrust.org Large-scale Full-text Indexing with Solr https://www.hathitrust.org/blogs/large-scale-search/large-scale-full-text-indexing-solr <p>[Copied from the <a href="http://mblog.lib.umich.edu/blt/">Blog for Library Technology</a>]</p><p>A recent blog pointed out that search is hard when there are many indexes to search because results must be combined. Search is hard for us in DLPS for a different reason. Our problem is the size of the data.</p> Mon, 08 Jun 2009 18:17:20 +0000 pfarber 121 at https://www.hathitrust.org