Challenges for HathiTrust full-text search

Submitted by Tom Burton-West on July 5, 2018

The HathiTrust full-text search application provides search services over the full text of more than 16 million volumes. This is about 5 billion pages and about 16 Terabytes (TB) of text created by optical character recognition (OCR.). There are a number of characteristics of the data and metadata in the HathiTrust repository that make providing good search services difficult.

The HathiTrust repository includes materials in over 400 languages [1] with variable OCR quality.[2] Metadata on language and date is of variable quality, with missing or ambiguous dates for over half a million volumes. Feature metadata, such as where chapters or articles begin and end, is based on a combination of heuristics, machine learning, and human tagging and is not particularly accurate. [3]

Scaling and performance issues

The current index is about 13 TB in size. Commercial entities such as the library vendors that include HathiTrust volumes in their search products, Etsy, Biz360, CareerBuilder.com and the Internet Archive would spread an index of this size over 40-100 machines.[4] We currently serve the index on 6 machines.[5] Spreading the index over 6 machines means that each machine indexes over 2.7 million volumes and has indexes totaling 2.2 TB in size.

In contrast to the query load on commercial sites we have a comparatively low query load. Under this load, 6 machines provide decent performance.[6] However, a number of search features that are feasible on a large number of machines (or with smaller indexes) are not feasible on HathiTrust’s current hardware due to unacceptable performance. In some cases this is due to the total size of the index (i.e. all 13 TB), and in some cases it is due to the large size of the index on each machine.[7]

Complicating the issues related to the size of the index are issues related to indexing materials in multiple languages and reliance on OCR for the text. The combination of texts in many languages and dirty OCR results in an average of over 3 billion unique terms per index. This combination of large indexes and large numbers of unique terms make some desirable Solr/Lucene features infeasible due to performance and scalability issues.[8] The large number of terms has also resulted in repeatedly running into limitations in Solr/Lucene that have required patches to the Lucene codebase.[9]

Within the technical community (Solr/Lucene and Code4lib), HathiTrust is recognized as a leader in finding workable solutions to scalability issues with large indexes.[10] Most of our efforts to date have been solving performance issues. Of necessity, we have implemented measures for performance monitoring and testing. Indeed, a large part of our ongoing work is testing new versions of Solr and/or new features of Solr to determine if they will provide acceptable performance on our large indexes. [11]

Impact on search and relevance ranking

When searching over the full-text of 16 million volumes it is easy to get large result sets listing tens of thousands or hundreds of thousands of volumes. Two features that are necessary to help users deal with potentially very large result lists are good relevance ranking, and providing easy to use methods for interactively refining queries.[12] The characteristics of the materials mentioned above lead to various issues with relevance ranking. Some of the issues are listed below:

We have evidence that the default Solr relevance ranking algorithm (We are currently using Solr 4 with the lucene tf*idf default) ranks short documents too high. Potentially thousands of relevant documents are not seen by users because (less relevant ) short documents are pushed to the top of the search result list. [13] Solr makes available implementations of a number of more modern ranking algorithms with better length normalization that are unlikely to have this problem. However, they all require tuning. We need a testing framework to choose which algorithm to use and to tune the length normalization parameters of the algorithms to work on book length documents[14]
When only part of a HathiTrust volume is relevant, such as a chapter in a book, or an article in a bound journal, the volume is not ranked appropriately because scoring for the volume is currently based on the entire volume. In addition to books where only one chapter may be relevant, this is likely to affect most of the 3 million serial volumes in HathiTrust as well as hundreds of thousands of reference works such as encyclopedias and dictionaries. We believe we have found a practical approach to this problem[15]. However, we need a testing framework to determine the best way to break documents into parts, and to determine the best way to rank document parts or combine rankings of document parts to rank documents.
Because we have both OCR and high quality MARC metadata, our current relevance ranking in production combines scores from the OCR field and various MARC fields using Solr/Lucene’s boosting capability to weight fields. Our current weights were determined by trial and error. We need a testing framework to provide a systematic way to determine the optimum relative weights of the OCR and the MARC fields.[16]
Most techniques to improve information retrieval in the presence of OCR errors do not scale to collections of this size or complexity[17].
Language-specific processing is necessary for good information retrieval in some languages.[18]
1. Most approaches to dealing with the indexing and searching of multiple languages will not scale to 400 languages.[19]
2. Language independent approaches such as n-gram based retrieval also have scalability issues.[20]
3. General techniques used to improve recall, such as the use of stemming or synonyms are language-specific and domain-specific. The out-of-the-box implementations of these techniques provided with Solr do not scale to collections of this size and complexity.[21]
4. Languages that do not use spaces to separate words such as Chinese and Japanese require special processing for search in those languages to work at all.[22]

In summary, we have the challenge of an industrial-scale search and relevance ranking problem with the comparatively small resources of an academic library.[23]

Towards practical solutions

We’ve been working on these issues for several years. In contrast with academic research, we are not trying to find the ideal solution to the various search issues. Instead we look for practical solutions that significantly improve the user’s search experience with a reasonable amount of implementation effort. We try to leverage work in the open source community whenever possible.[24]

We’ve talked with a number of information retrieval experts in industry and academia at various conferences over the last few years.[25] The consensus among academic experts is that while web search and TREC-type newswire search is well understood, full-text book retrieval is under-researched.[26] Generally when we ask if a particular technique from the research literature or industry practice will work for our large collection of books, or ask which of several algorithms we should use, we are told both by academic experts and industry experts, that we need to run tests in order to determine whether the techniques will work for our users and our collection. [27]

In order to create a test collection or testing strategy, we need to define user tasks and types of queries. When discussing possible testing strategies with the experts, two questions that invariably come up are:

What kinds of queries are your users issuing in HathiTrust search?
What are the tasks your users are trying to accomplish?

The answers to these questions are needed to design a test collection or testing plan as well as to provide direction for possible feature improvements. These questions can best be answered by a combination of query log analysis and user studies.

Towards Industry Best Practices for HathiTrust Search

We believe that to the extent of our ability and resources HathiTrust needs to adopt the following industry best practices for search[28]:

User needs analysis and user studies
1. Log analysis to classify types of queries/users/uses/tasks
2. Task analysis (based on log analysis and user research)
3. Market segmentation and use case analysis. i.e. what are different user groups, their needs and use cases, and the organization’s priorities for meeting their needs.
A principled method for testing and tuning relevance ranking[29]
1. Test collection based evaluation for getting algorithms in the right ballpark
2. Use of click logs and A/B testing on the production system for verifying results of test collection tests with real users, and for fine tuning of algorithms.
Usability testing: Iterative usability testing to determine if features and their implementations meet user’s needs and are easy to learn and use.
Performance monitoring and testing
1. Load monitoring to make sure search response times are acceptable
2. Testing to make sure proposed features are efficient enough to keep response times acceptable

References

Michael Bendersky and Oren Kurland. 2010. Utilizing passage-based language models for ad hoc document retrieval. Inf. Retr. 13, 2 (April 2010), 157-187. DOI=10.1007/s10791-009-9118-8 http://dx.doi.org/10.1007/s10791-009-9118-8 (abstract here: http://dl.acm.org/citation.cfm?id=177351

Tom Burton-West (2013) Towards Practical Relevance Ranking for 10 Million Books. Code4LIB 2013 (see abstract and video). (February 2013)

Tom Burton-West (2012). "Practical Relevance Ranking for 10 Million Books" In CLEF (Conference and Labs of the Evaluation Forum) 2012 Working Notes, Rome, Italy, September 17-20, 2012, ed. Pamela Forner, Jussi Karlgren, Christa Womser-Hacker and Nicola Ferro, September 2012.

Tom Burton-West(2012) HathiTrust Large Scale Search: Scalability meets Usability. Code4Lib 2012.

Tom Burton-West (2010) HathiTrust Large Scale Search. Lucene Revolution 2010. (October 2010)

Tom Burton-West (2010) Big, Bigger, Biggest. Large-scale Issues: Phrase Queries and Common Words OCR. Lucid Imagination Webinar: Practical Search with Solr: Beyond Just Looking It Up.

Tom Burton-West (2008) Breakout Session: “Solr in Libraries” Code4lib 2008 http://code4lib.org/conference/2008/breakout

Ben Carterette, Evangelos Kanoulas, and Emine Yilmaz. 2010. Low cost evaluation in information retrieval. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '10). ACM, New York, NY, USA, 903-903. DOI=10.1145/1835449.1835675 http://doi.acm.org/10.1145/1835449.1835675

Raman Chandrsekar and Susan Price(2013a), “Actions Speak Louder than Words: Analyzing large-scale query logs to improve the research experience” Code4lib Journal Issue 21, 2013-07-15

Raman Chandrsekar and Susan Price(2013b), “Actions Speak Louder than Words: Analyzing large-scale query logs to improve the research experience”http://code4lib.org/conference/2013/chandrasekar-diamond

Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012. Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30, 1, Article 6 (March 2012), 41 pages. DOI=10.1145/2094072.2094078 http://doi.acm.org/10.1145/2094072.2094078

Trey Grainger and Timothy Potter (2014). Solr in Action. Manning Publications 2014.

Grant Ingersoll. 2009. “Debugging Search Application Relevance Issues” : https://lucidworks.com/blog/debugging-search-application-relevance-issues/

Timothy Jones, Andrew Turpin, Stefano Mizzaro, Falk Scholer, and Mark Sanderson. 2014. Size and Source Matter: Understanding Inconsistencies in Test Collection-Based Evaluation. InProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM '14). ACM, New York, NY, USA, 1843-1846. DOI=10.1145/2661829.2661945 http://doi.acm.org/10.1145/2661829.2661945

Kazai G., Koolen M., Kamps J., Doucet A., Landoni M. (2011) Overview of the INEX 2010 Book Track: Scaling Up the Evaluation Using Crowdsourcing. In: Geva S., Kamps J., Schenkel R., Trotman A. (eds) Comparative Evaluation of Focused Retrieval. INEX 2010. Lecture Notes in Computer Science, vol 6932. Springer, Berlin, Heidelberg . https://doi.org/10.1007/978-3-642-23577-1_9

Paul McNamee, Charles Nicholas, and James Mayfield. 2009. Addressing morphological variation in alphabetic languages. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '09). ACM, New York, NY, USA, 75-82. DOI=10.1145/1571941.1571957 http://doi.acm.org/10.1145/1571941.1571957

Fuchun Peng, Nawaaz Ahmed, Xin Li, and Yumao Lu. 2007. Context sensitive stemming for web search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '07). ACM, New York, NY, USA, 639-646. DOI=10.1145/1277741.1277851 http://doi.acm.org/10.1145/1277741.1277851

Ivan Provalov (2011) “Relevance Improvements at Cengage” Apache Lucene Eurocon 2011. http://www.slideshare.net/lucenerevolution/provalov-cengage-relevancecengageenglishnonenglish1eurocon2011

Tamar Sadeh (2012) Relevance Ranking in the Scholarly Domain. http://code4lib.org/conference/2012/sadeh

Mike Schultz 2012 “Search Engine Relevancy Tuning - A Static Rank Framework for Solr/Lucene” Colde4lib 2012 http://code4lib.org/files/code4libSchultz.pptx, http://code4lib.org/conference/2012/schultz

Willis, C., Efron, M. (2013). "Finding information in books: characteristics of full-text searches in a collection of 10 million books." Proceedings of the Association of Information Science and Technology.

[1] Providing language-specific processing on this scale is not practical. Since we index all the languages in one index, we have to be careful not to implement features that would improve searching for English language materials that would hurt searching in other languages. Because stop words in one language are content words in another language, we needed to implement an alternative to using stop words for dealing efficiently with queries containing common words. We ported the Nutch “CommonGrams” approach to Solr and contributed it to the Solr community. See: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

The content covers a wide range of time-periods and fields of learning so even for one language the vocabulary includes both historical word usage and spelling variants, and domain-specific terms. This makes the use of dictionary-based natural language processing methods and standard stemming or synonym lists impractical.

[2] The full-text content consists of OCR of varying quality which has numerous adverse effects on retrieval and relevance ranking. The generally accepted information retrieval techniques for dealing with OCR errors, do not scale to large collections or to multilingual collections. Determining what approaches might work with collections like the HathiTrust collection is an open problem.

[3] The unit of ingest is a bound volume; a volume might consist of a number of journal issues bound together, or two or more monographs bound together. Because metadata indicating where chapters or articles begin and end is not very accurate we can’t do specific indexing or ranking on the chapter or article level. Note that due to the size of the corpus and the lack of available ground truth, estimating the extent of errors in metadata indicating chapter or article beginnings and endings would be a research project in itself.

[4] One of the reasons for spreading a large index over a large number of machines is to be able to get the whole index either into memory (as Google does) or onto Solid State Drives. The other reason is to increase the available CPU cycles. Sources: Personal conversations with vendors using HathiTrust index and the following: "Making the Interent Archive's full text search faster" , http://2010.lucene-eurocon.org/slides/Scaling-Solr-to-3Billion-Documents_Jason-Rutherglen.pdf.

Etsy has 20 million items and uses 2 clusters of 20 machines each. CareerBuilder has about 1 billion documents indexed in thousands of indexes.

[5] HathiTrust actually runs a full copy of the index at two sites, one in Michigan and one in Indiana. Each site serves their copy of the index from 6 servers. Due to our relatively low query load, budget concerns, and concerns about system administration resources and power use, early in the development process, we decided to go with a small number of high-end machines rather than a large number of commodity machines.

[6] We receive between ½ and 2 queries per second. Average response time is under a second. Our performance goals are to keep response time for all but the 10% slowest queries less than 10 seconds and response times for the 1% slowest queries less than 30 seconds. Commercial services serve queries in the range of hundreds to hundreds of thousands of queries per second with response times measured in milliseconds.

[7] For example, we would like to add a feature that ranks documents where all the query terms in a query occur on the same page higher than documents where some of the query words are on one page and the rest are on another page. However, on current hardware, addition of this feature would cause unacceptable response time. As another example, a commonly used approach to add features to Solr is to copy the contents of a field to a new field for special processing. Each use of this approach for the OCR field would double the size of the 12TB index. Providing both case-sensitive and case insensitive search is one feature normally implemented using this approach. Doubling the size of the index has performance implications as well as cost implications. The implementation of Solr’s “more-like-this” feature would also require a significant increase in index size. See HathiTrust Large Scale Search: Scalability meets Usability. Code4Lib 2012. (February 2012) for more examples of features affected by scalability issues. We also have had to radically change the default indexing settings to be able to index 16TB of OCR in a reasonable time. See: http://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1

[8] For example early tests of the Solr/Lucene wildcard feature indicated that wildcard searches could take up to ten minutes and have significant adverse effects on every other query that hit Solr during that 10 minutes. Similar problems exist for Solr’s fuzzy queries and other features that use multi-term queries. We plan to test whether the new FST based indexing improves performance to the point that we can enable these features.

[9] See: http://www.hathitrust.org/blogs/large-scale-search/too-many-words , LUCENE-2257, http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again, LUCENE-4635, LUCENE-6192, and Mike McCandless’ January 21,2015 Google plus post: https://plus.google.com/+MichaelMcCandless/posts/RQiF1J2XAzV

[10] In addition to blog posts and presentations, contributions include our early work on performance benchmarking: (Large-Scale Search Benchmarking. (December 2008) - Tom Burton-West, Jeremy York.), work on CommonGrams ( https://issues.apache.org/jira/browse/SOLR-908), contribution of a tool to help determine common words in a Solr/Lucene index (https://issues.apache.org/jira/browse/LUCENE-2393), work to improve CJK processing (LUCENE-2906, SOLR-3589), and implementation of an improvement to the BM25 ranking algorithm to deal with long documents (https://issues.apache.org/jira/browse/LUCENE-5175.) Our INEX paper (Burton-West 2012) was used by one of the INEX Book Track participants, professor Michael Preminger, as an example of “practice based research” in a seminar he was teaching.

[11] Our early performance and scalability tests confirmed that we would need to modify Solr in order to scale up Solr to 5 million volumes. Our tests showed that without modification queries containing common words would have unacceptable response times (1-2 minutes.) We contributed the Solr CommonGrams code which solved our problem and is used by a large number of organizations using Solr. We continue to test Solr features for scalability and performance. Some of our more recent tests were tests of the performance of Solr faceting and grouping when we indexed one million documents at a page level (300 million pages) and grouped by document.

[12] Facets and advanced search options are the present mechanisms for interactively refining queries. Both of these could benefit from usability studies. We also plan to implement “relevant facets”. See Burton-West(2012) HathiTrust Large Scale Search: Scalability meets Usability for more details.

[13] We have several examples of queries where very short documents are obviously ranked too high. We also argue that Lucene’s default algorithm has the same bias towards short documents that was discovered in the Vector Space algorithm that Lucene’s default algorithm is based on. See: http://www.hathitrust.org/blogs/large-scale-search/practical-relevance-ranking-11-million-books-part-3-document-length-normali

[14] See: https://www.hathitrust.org/blogs/large-scale-search/practical-relevance-ranking-11-million-books-part-2-document-length-and-rel. Extensive investigation of the INEX Book Track test collections has revealed that the INEX collections are not appropriate for relevance testing of HathiTrust. The 2007 queries are underspecified and both the 2007 and subsequent 2008-2010 collections have issues with relevant unjudged documents. See Kazai (2010) for discussion of issues with collecting enough relevance judgments for the INEX collection.

[15] We believe we have found a “good-enough” approach to a ranking algorithm that would solve this problem and significantly improve relevance ranking for HathiTrust search. Our proposed approach is based on an approach suggested by Bendersky and Kurland (2010) for ranking documents based on relevance evidence from the documents constituent passages. The approach involves breaking HathiTrust volumes into sub-documents such as chapters, groups of pages, or pages, and using Solr’s grouping facility to rank volumes/documents based on a function of the scores of the sub-documents. More details will be available in a forthcoming blog post

[16] We also suspect that the optimum weight depends on query intent. In particular known items searches would probably benefit more from higher weights for the MARC metadata. We hope to be able to gather information on known-item searches and ranking when we implement click-logging.

[17] Techniques such as ngrams or fuzzy matching will match too many volumes and will match words in other languages as well as OCR errors. Techniques based on error correction or language models are not scalable to a collection of this size or a multilingual collection that spans multiple technical domains as well as multiple time periods.

[18] For example Arabic, Hebrew, Chinese and Japanese.

[19] In general, the same language-specific processing must be done on both the query and the document in order for good retrieval. While language identification of long documents is feasible, language identification of queries is an open problem in the research literature. Experiments with several state-of-the-art language identifier programs have confirmed that accurate identification of the language for HathiTrust queries is an open problem. Even if we could solve the problem of identifying both query and document language accurately, indexing each of 400 languages in a separate index is not feasible and dealing with multilingual queries or queries for proper names would require combining searches and scores across indexes. See Grainger and Potter (2014 Chapter 14) for an overview of current best practices for multi-lingual searching with Solr. We currently use the ICUTokenizer and ICU filters which were designed by a group of multilingual unicode experts to work pretty well with a wide variety of languages and scripts. We continue to investigate potential approaches that might scale to improve searching for perhaps the top 10 or 20 languages as well as any approaches that improve searching for a particular language that will not adversely impact searching in other languages.

[20] Character n-grams such as those used by McNamee and Mayfield (2009) take up to 6 times the space of regular word-based indexing. Such an approach would require an index in excess of 60 TB for the HathiTrust index. The truncation approach used by McNamee and Mayfield is unlikely to work with the long documents in the HathiTrust corpus due to problems with very high recall, and would not be applicable to a number of languages in HathiTrust including Arabic.

[21] Assuming the issues with identifying the language of queries were solved, the stemming and synonym features provided with Solr would not work well due to retrieval of inappropriate documents caused by large numbers of false matches. These matches are due to overstemming (i.e. search for “marinated vegatables” retrieves documents for “marine vegetation” because both phrases stem to “”marin vegat” ) or to synonyms that match the wrong word sense. Techniques used in large web search engines such as context-sensitive stemming would be more appropriate(Peng et al. 2007). Issues to be investigated include whether the amount of feature engineering and training for the machine learning algorithms to implement context-sensitive stemming would require more than the resources currently available for working on HathiTrust search and whether appropriate techniques can be found to deal with data sparsity. It is an open question whether the size of the query and click logs available from HathiTrust search would be sufficient to train context-sensitive algorithms. ( Blog post in progress. )

[22] See http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation

[23] An additional issue not addressed here is that HathiTrust search consists of four different applications (with indexes at different scales). These applications must appear to the user to work together seamlessly. There are many complex UX issues involved . The applications are:

[24] In general we try to leverage features already available in the open-source Solr/Lucene software. When necessary we work with others to make contributions to the software (for example our work with CommonGrams). We worked extensively with Lucene/Solr committer Robert Muir, and Naomi Dushay at Stanford on CJK issues. See blog post cited above, LUCENE-2906, SOLR-3589, and http://discovery-grindstone.blogspot.com/2014/01/searching-in-solr-analyzing-results-and.html

[25] Many of the Solr committers, in particular Robert Muir, Mike McCandless, Chris Hostetter, Andrzej Bialecki and Erik Hatcher have been particularly helpful. Trey Grainger, Gheorghe Muresan , and other Solr/Lucene users at the Lucene Revolution conferences have also been helpful as have Marijin Koolen and other participants at the INEX Book track. Thomas Emerson, an expert on CJK issues, now with EBSCO, provided helpful comments on our approach to CJK issues. The IR evaluation team at the Royal Melbourne Institute of Technology, Professors Mark Sanderson and Falk Scholar and Research Fellow Timothy Jones gave helpful feedback on plans for click logging and the creation of a test collection. Ian Soborof of the Retrieval Group at NIST and James Allan of the Center for Intelligent Information Retrieval, provided helpful advice on determining user tasks for creating a test collection.

[26] There is a large literature on searching book metadata using library catalogs, but not much on searching the full-text of books. See Willis and Efron (2013) for a discussion of the book search literature. Professor James Allan of CIIR confirmed that full-text book search is under-researched.(Personal conversation at SIGIR July 2018.)

[27] Results of tests on one collection can not be easily extrapolated to apply to other collections with different characteristics such as different types of materials, different sizes of the collection, or where the user task is different. See Jones et al. 2014.

[28] We have implemented performance monitoring and testing. We are currently doing A/B testing using interleaving in production. See Chapelle et.al. 2012 for details on interleaving. More details on interleaving forthcoming in a future blog post.

[29] We investigated various methods for low-cost creation of test collections, including crowdsourcing. (See Carterette et.al 2010 and Kazai 2010.) . We also investigated A/B testing using interleaving (Chapele et. al. 2012). See Ingersoll (2009) for an overview of various approaches used primarily in e-commerce. I was particularly influenced by conversations with Ivan Provalov about his experience at Cengage http://www.slideshare.net/lucenerevolution/provalov-cengage-relevancecengageenglishnonenglish1eurocon2011

). Ivan started by tuning algorithms on a TREC collection that was similar to Cengage’s content and then went on to develop a relevance testing framework to further tune the algorithms. Conversations in sessions on relevance ranking and user testing at various Lucene/Solr and code4lib sessions through the years have been helpful. Also helpful have been conversations with presenters on relevance ranking issues at code4lib. Some notable presentations are: the presentations from Tamar Sadeh of Ex Libris (http://code4lib.org/files/relevance%20ranking%20Code4Lib%20f.pptx), Raman Chandrsekar and Susan Price of Serials Solutions http://code4lib.org/conference/2013/chandrasekar-diamond, http://journal.code4lib.org/articles/8693 and Mike Schultz (formerly Summon architect) http://code4lib.org/files/code4libSchultz.pptx

Printer-friendly version

Main menu

Challenges for HathiTrust full-text search

Add new comment

Main menu

Challenges for HathiTrust full-text search

Add new comment

Search form