Available Indexes

Practical Relevance Ranking for 11 Million Books, Part 1

Practical Relevance Ranking for 11 Million Books, Part 1

This is the first in a series of posts about our work towards practical relevance ranking for the 11 million books in the HathiTrust full-text search application.

Relevance is a complex concept which reflects aspects of a query, a document, and the user as well as contextual factors.  Relevance involves many factors such as the user’s preferences, the user’s task, the user’s stage in their information-seeking, the user’s domain knowledge, the user’s intent, and the context of a particular search.[i]  

While many different kinds of relevance have been discussed in the literature, topical relevance is the one most often used in testing relevance ranking algorithms. Topical relevance is a measure of “aboutness”, and attempts to measure how much a document is about the topic of a user’s query. 

At its core, relevance ranking depends on an algorithm that uses term statistics, such as the number of times a query term appears in a document, to provide a topical relevance score. Other ranking features that try to take into account more complex aspects of relevance are built on top of this basic ranking algorithm. [ii]

In many types of search, such as e-commerce or searching for news, factors other than the topical relevance (based on the words in the document) are important.   For example, a search engine for e-commerce might have facets such as price, color, size, availability, and other attributes, that are of equal importance to how well the user’s query terms match the text of a document describing a product.  In news retrieval, recency[iii] and the location of the user might be factored into the relevance ranking algorithm.

In a blog post about the conference of the Special Interest Group on Information Retrieval (SIGIR), Grant Ingersoll, a long-time Lucene and Solr committer and chief technology officer for LucidWorks, commented on the role Lucene’s relevance ranking model plays in relevance ranking for business use cases:

“Practically speaking, most people in Lucene simply tune relevance through query representation and external features that yield very good results … when focused on their real world cases (which is almost always precision @ 10, biased by business needs like price, time, page rank, social network, margins, inventory and other internal scoring models) rather than on the model itself.”      ( http://searchhub.org/2012/08/24/sigir-2012/ )

 In web search, measures of popularity and authority based on links pointing to a web page, the text of links pointing to a web page, and numerous other features derived from click logs and other usage data, are combined with the basic relevance score of the content.  Some of this usage data is used to try to guess user intent, to take into account past searches by the user, and to understand the context of the web search. Web search engine companies such as Google and Microsoft (Bing) use sophisticated machine learning algorithms with hundreds of features based on massive query logs and other usage data to try to take these factors into account. [iv]

HathiTrust full-text search has a much lower query volume than Google (a few thousand queries per day compared to hundreds of millions of queries), as well as fewer resources.   Instead of trying to algorithmically guess user intent, we rely on the user to interactively refine their searches.  For example, Google has sophisticated algorithms that try to determine when words in a query should be searched as a phrase (i.e. the words [New York] should be searched as a phrase rather than as the separate words “New” and “York”.)   HathiTrust relies on the user to surround any phrases with quotes.[v]

In contrast to web search or e-commerce search applications, the relative importance of the text is much higher in full-text book search, so correctly ranking the relevance of the text is crucial. [vi] 

HathiTrust full-text search uses the Solr/Lucene open-source search engine software.[vii] We believe that Lucene’s default relevance ranking algorithm does not work well with our book-length texts because Lucene’s algorithm tends to rank short documents too high.[viii]  Unlike more recent algorithms, Lucene’s algorithm does not have any tuning parameters to account for document length.

There are several fixes available for Lucene’s tendency to rank short documents too high.[ix]  In addition, with the release of Lucene 4.x, there are now available a number of more recent ranking algorithms that can be used with Lucene.  These include BM25, Language Models, Divergence from Randomness (DFR), and Information Based models. [x]  Most of these new algorithms have tuning parameters to adjust the ranking algorithm to take into account document length.

This leaves us with the practical question:  How do we determine which of the new models would work best with our collection  of book-length documents?

There is a fairly large literature on the performance of BM25 and Language Models, with a smaller number of articles dealing with the newer DFR and Information Based models.   However, with a few exceptions, all the published research on relevance ranking algorithms has been done on relatively short documents. [xi]   Much of the work is based on newswire stories averaging about 500 words or 3KB[xii].  Academic research on web retrieval has relied on test collections with truncated documents with an average document length of 16-25K, or around 2000 words.[xiii]  Most of the research on ranking algorithms is based on the test collections provided by the Text REtrieval Conference  (TREC). 

Even relatively recent work on passage retrieval (Bendersky and Kurland , 2010); length normalization (He and Ounis, 2007, Fang et al. , 2011, Cummins and O’Riordan, 2012 ); and new relevance ranking algorithms such as  DFR (Amati and van Rijsbergen, 2002);  and Information-Based Models (Clinchent and Gaussier , 2011) is almost exclusively based on the TREC ad hoc and web collections of relatively short documents.[xiv] 

 The chart and table below compare the average document size of  HathiTrust documents with various research collections.[xv]

 

Collection

Total Size

Documents

Average Doc length

 Average Doc length

 

(TB)

(millions)

(KB)

(words/tokens)

HathiTrust

7

11

760

100,000

Clueweb2009(b)

1.2

50

25

2,000

Clueweb2009(a)

25

1,040

25

1,400

TREC WT10g

0.01

1.70

8.9

611

TREC WT2g

0.002

0.24

6.2

1,056

TREC ad hoc discs45-CR

0.002

0.50

 

500

TREC  AP (disks 1 and 2)

0.0005

0.17

3

473

TREC FR

0.0005

0.05

10.6

1,340

HathiTrust pages

7

3,240

3

350

Unfortunately, there is very little literature to help a practitioner determine how to extrapolate from the results of experiments on test collections, to collections with different characteristics that are used in production. In our case, the lack of guidance on how to extrapolate from test results is exacerbated by the difference in document length of one to two orders of magnitude between HathiTrust books and the documents used in standard test collections.  This difference in document length affects all aspects of relevance ranking.  The next blog post will discuss how document length affects relevance ranking.

 


[i] For a good overview of 50 years of research on Relevance, see Saracevic (2007a, 2007b) and citations therein.  Another good overview is Mizzaro (1997.) See Huang and Soergal (2013) for an argument about the primacy of topical relevance and the notion of sub-topics and topical relationships.

[ii] Voorhees (2008:p1880-1881) and Sparck Jones (2001, 2005)  discuss how the approach to testing ranking algorithms used in the Text REtrieval Conference (TREC) abstracts away user needs, user interfaces and other kinds of relevance besides topical relevance.  They argue that operational information retrieval systems are built upon the “core competence” of topical relevance ranking.  See reference iv below for literature discussing how web search engines take other kinds of relevance into account.

[iii] Recency, also called freshness, refers to user preference for the most current information.  See Dong et al (2010)

[iv].  Baeza-Yates and  Maarek (2012) discuss usage data , Jiang et. al (2013) survey methods and applications of web log and usage data.  Examples of user studies and the use of data from those studies, are available on the Microsoft CLUES team home page http://research.microsoft.com/en-us/groups/clues/  and at Ryan White’s home page  http://research.microsoft.com/en-us/um/people/ryenw/publications.html.  Kohavi et al (2013). discusses Microsoft’s (Bing’s) experimental infrastructure and Tang et al (2010) discuss Google’s experimental infrastructure.  Both companies run hundreds of experiments per year.

[v] Baeza-Yates and  Maarek (2012) argue that massive usage data is necessary for web search engines to remain competitive and lack of massive usage data is a barrier to entry for new web search engines trying to enter the market.  Academic search applications, including HathiTrust full-text search, have access to usage data that is several orders of magnitude smaller than the data used by web search engines. Issues of data sparsity make it difficult for academic search applications to effectively apply machine learning techniques to improve search.

[vi] Later posts will discuss other aspects of relevance ranking such as weighting of fields (e.g. subject and title) and other techniques used to tweak the basic topical relevance score, but this post focuses on the core relevance ranking algorithm that provides a score for the full-text of a book.

[vii] Solr is an application built on top of the Lucene search library and uses Lucene’s ranking algorithms. 

[viii] We index whole books as Solr documents rather than pages primarily for scalability reasons. See the previous post (http://www.hathitrust.org/blogs/large-scale-search/tale-two-solrs-0) for a discussion of indexing whole books versus pages of books.

As an example of short documents being ranked too high, the first result for  a query for [the book of job] in “Only Full Text” is a volume that contains only four words of OCR: “The Book of Job”. The rest of the book is in Welsh and the OCR engine did not recognize the pages.

A second example can be seen searching for “dog”.  The first hit is “Dog Dream”, which is a musical score which contains a total of only 655 words.  Other short documents in the top ten hits are “The war dog”, a poem containing 1,300 words and  “Josie’s Buttercup”, a childrens book containing a total of 300 words.  The average number of words for a HathiTrust volume is 100,000.  

[ix]  The alternatives are to use “sweet-spot” similarity (http://lucene.apache.org/core/4_7_1/misc/org/apache/lucene/misc/SweetSpotSimilarity.html,)   to write a custom similarity that sets minimum length (http://mail-archives.apache.org/mod_mbox/lucene-dev/200605.mbox/%3CF9F27...

), or to use Robert Muir’s “pivoted document normalization patch(https://issues.apache.org/jira/browse/LUCENE-2187  ).  In 2007 researchers at IBM Haifa modified Lucene’s ranking formula and got very good results with “sweet-spot” similarity and changing the tf normalization ( http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team and

http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf ).  It is not clear why sweet-spot similarity worked as well or better than pivoted length normalization in this case.  We don’t believe that sweet-spot is a general solution.  See Yonik’s comment here: http://lucene.472066.n3.nabble.com/Lucene-s-default-settings-amp-back-compatibility-tt590080.html#a590215 and Hoss’s explanation of the original use case for sweet-spot similarity :

“To give a practical example: when i was working with product data we found that the sweetspot for the length of a product name was between 4 and 10 terms.  products with less then 4 terms in the name field were usually
junk products (ie: "ram" or "mouse") and products with more then 10 terms in the name were usually junk products that had keyword stuffing going on. “
http://lucene.472066.n3.nabble.com/SweetSpotSimilarity-tt3747178.html#a3785703 and this thread http://lucene.472066.n3.nabble.com/jira-Created-LUCENE-577-SweetSpotSimi...

Nutch’s similarity algorithm prior to Nutch 2.0 had field specific length normalization and for content fields (text) imposed a limit: http://svn.apache.org/viewvc/nutch/tags/release-1.2/src/java/org/apache/nutch/indexer/NutchSimilarity.java?revision=1001087&view=markup

[xi] Patent retrieval and legal document collections use long documents but these are such a different domain that the results are not transferable.  INEX for several years used a collection of around 12,000 journal articles averaging about 3,000 pages, but they were focused on element retrieval (Kamps et al. 2004).   Norbert Fuhr, in a keynote talk at the CLEF 2010 conference argues that we don’t understand the effect of document length on information retrieval because different IR evaluation exercises not only use documents of different lengths, but use incomparable definitions of relevance, types of topics, and collection parameters ( http://www.is.informatik.uni-duisburg.de/bib/pdf/talks/Fuhr_10t.pdf, Slide 29.)  The INEX Book track is an exception, in that from 2007-2010, they used a collection of 50,000 books.  The 2007 track used a collection of 250 queries with relevance judgments provided by a commercial search engine.  However, only two groups were able to participate in the track and one group ran into technical issues which prevented them from indexing whole books.  The subsequent tracks followed the INEX tradition and had participants generate topics and relevance judgments. Unfortunately, however, they were not able to get enough participation and sufficient relevance judgments for the collections to be used in research outside of the Book Track.   We will discuss the INEX Book Track in more detail in a later post.

[xii] In the main text we are using the more familiar term “words.”  All the counts given are actually counts of tokens.  Tokens are units of indexing that usually correspond to words, but various decisions are made in the “tokenizing” part of the indexing process which may cause a divergence between the “word” count and the “token” count.  For example hyphenated words may be indexed as one or two tokens. See (http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html).  The ad hoc collections for the Text REtrival Conference (TREC) (http://trec.nist.gov/), the Conference and Labs of the Evaluation Forum/Cross-Language Evaluation Forum (CLEF) (http://www.clef-initiative.eu/) and the Forum for Information Retrieval Evaluation(FIRE)(http://www.isical.ac.in/~clia/) are based on collections of newswire articles.  See Voorhees and Harman (2005) for statistics on TREC and Peters and Braschler (2004) for statistics on CLEF.

[xiii]  For the Clueweb09 collection, during the crawling process downloaded documents were truncated at 256 KB  (this includes html markup etc), and the average document size is about 25KB or about 3,000 tokens.  (Truncation at 256KB and average document size of 25KB from personal communication with Jamie Callan of  CMU March 12, 2012, number of tokens estimated from  Zhou et al. 2011 based on an average document length of about 1,500 tokens after stemming and stopping) One of the reasons for truncating documents, besides keeping the disk size of the collection  manageable is that much of the interest  in web retrieval is in link structure rather than just content. 

[xiv] . Bendersky and Kurland used TREC ad hoc collections where the collection with the largest average document length was 935 tokens (Federal Register collection) and the smallest was the Wall Street Journal collection with an average length of 263 tokens.  p 166 . 

He and Ounis (2007) used TREC ad  hoc and the WT2G and WT10 collections. They don’t provide document sizes, but the statistics for these collections are given in the other papers below.

Fang et al. (2011) use the TREC ad hoc and TREC8 Web collections. The collection with the largest average document length is the Federal Register with an average of 1,340 tokens, with the other collections ranging from about 120-1,000 tokens.  Fang et al. 2011 is one of the few papers that also gives the standard deviation of the document lengths per collections p 7:14.

Cummins and O’Riordan (2012) used 5 TREC collections. The WSJ and FT collections had an average document length of about 200 tokens, FBIS about 240, Federal Register about 333.  The WT2G web collection average document size is given as 623 tokens and the WT10G as 352. They also give the standard deviation of the document length. (The smaller size for the Federal Register documents is probably due to a different stop word list.)

Amati and Van Rijsbergen (2002) used two different combinations of the TREC ad hoc collections with the first one having an average document length of 210 tokens and the second 265 tokens.  They also note that part of the average for the second collection is due to the inclusion of the Congressional Record collection, which by itself has an average of 624 tokens. p 378)

Clinchant and Gaussier (2011) used the ROBUST (TREC), CLEF03 AdHoc Task, GIRT (CLEF

Domain Specific 2004–2006) collections with average document lengths between 110 and 290 tokens (p 17.)

One reason for lack of the exploration of longer documents such as journal articles and books is the lack of suitable test collections.   The INEX conference used a collection of journal articles for several years and ran a book track with a collection of the 50,000 full-text books for several years.  (See comments on INEX in reference xi)

[xv] Estimates of document size vary depending on the source even for the same collection, however they should be close enough to see the order of magnitude differences.  Variations for average document size  in tokens are affected by the particular list of stopwords used.  See Voorhees and Harman (2005)  and Singhal (1997) for the document sizes for the TREC collections


References

Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002), 357-389. DOI=10.1145/582415.582416 http://doi.acm.org/10.1145/582415.582416

Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012. http://www.springerlink.com/index/58255K40151U036N.pdf

Michael Bendersky and Oren Kurland. 2010. Utilizing passage-based language models for ad hoc document retrieval. Inf. Retr. 13, 2 (April 2010), 157-187. DOI=10.1007/s10791-009-9118-8 http://dx.doi.org/10.1007/s10791-009-9118-8  (abstract here: http://dl.acm.org/citation.cfm?id=1773514)

Stephane Clinchant and Eric Gaussier. 2011. Retrieval constraints and word frequency distributions a log-logistic model for IR. Inf. Retr. 14, 1 (February 2011), 5-25. DOI=10.1007/s10791-010-9143-7 http://dx.doi.org/10.1007/s10791-010-9143-7

Doron Cohen, Einat Amitay, and David Carmel. 2007. Lucene and Juru at Trec 2007: 1-Million Queries Track  http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf

Ronan Cummins and Colm O'Riordan. 2012. A constraint to automatically regulate document-length normalisation. In Proceedings of the 21st ACM international conference on Information and knowledge management (CIKM '12). ACM, New York, NY, USA, 2443-2446. DOI=10.1145/2396761.2398662 http://doi.acm.org/10.1145/2396761.2398662

Anlei Dong, Yi Chang, Zhaohui Zheng, Gilad Mishne, Jing Bai, Ruiqiang Zhang, Karolina Buchner, Ciya Liao, and Fernando Diaz. 2010. Towards recency ranking in web search. In Proceedings of the third ACM international conference on Web search and data mining (WSDM '10). ACM, New York, NY, USA, 11-20. DOI=10.1145/1718487.1718490 http://doi.acm.org/10.1145/1718487.1718490

Hui Fang, Tao Tao, and Chengxiang Zhai. 2011. Diagnostic Evaluation of Information Retrieval Models. ACM Trans. Inf. Syst. 29, 2, Article 7 (April 2011), 42 pages. DOI=10.1145/1961209.1961210 http://doi.acm.org/10.1145/1961209.1961210

Ben He and Iadh Ounis. 2007. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst. 25, 3, Article 13 (July 2007). DOI=10.1145/1247715.1247719 http://doi.acm.org/10.1145/1247715.1247719

Xiaoli Huang and Dagobert Soergel. 2013. Relevance: An improved framework for explicating the notion. J. Am. Soc. Inf. Sci. Technol. 64, 1 (January 2013), 18-35. DOI=10.1002/asi.22811 http://dx.doi.org/10.1002/asi.22811

Daxin Jiang, Jian Pei, and Hang Li. 2013. Mining search and browse logs for web search: A Survey. ACM Trans. Intell. Syst. Technol. 4, 4, Article 57 (October 2013), 37 pages. DOI=10.1145/2508037.2508038 http://doi.acm.org/10.1145/2508037.2508038. Also available : http://www.hangli-hl.com/uploads/3/1/6/8/3168008/tist_2013.pdf

Jaap Kamps, Maarten de Rijke, and Börkur Sigurbjörnsson. 2004. Length normalization in XML retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '04). ACM, New York, NY, USA, 80-87. DOI=10.1145/1008992.1009009 http://doi.acm.org/10.1145/1008992.1009009

Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '13), Inderjit S. Dhillon, Yehuda Koren, Rayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman, and Ramasamy Uthurusamy (Eds.). ACM, New York, NY, USA, 1168-1176. DOI=10.1145/2487575.2488217 http://doi.acm.org/10.1145/2487575.2488217\   and http://www.exp-platform.com/Pages/default.aspx

Stefano Mizzaro. 1997. Relevance: the whole history. J. Am. Soc. Inf. Sci. 48, 9 (September 1997), 810-832. DOI=10.1002/(SICI)1097-4571(199709)48:9<810::AID-ASI6>3.0.CO;2-U http://dx.doi.org/10.1002/(SICI)1097-4571(199709)48:9<810::AID-ASI6>3.0.CO;2-U

Peters, Carol and Braschler, Martin 2004 CLEF 2003 Methodology and Metrics in Comparative Evaluation of Multilingual Information Access Systems, Lecture Notes in Computer Science V 3237 http://dx.doi.org/10.1007/978-3-540-30222-3_2

Tefko Saracevic. 2007a. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. J. Am. Soc. Inf. Sci. Technol. 58, 13 (November 2007), 1915-1933. DOI=10.1002/asi.v58:13 http://dx.doi.org/10.1002/asi.v58:13

Tefko Saracevic. 2007b. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. J. Am. Soc. Inf. Sci. Technol. 58, 13 (November 2007), 2126-2144. DOI=10.1002/asi.v58:13 http://dx.doi.org/10.1002/asi.v58:13

Amitabh Kumar Singhal. 1997. Term Weighting Revisited. Ph.D. Dissertation. Cornell University, Ithaca, NY, USA. UMI Order No. GAX97-14899.  http://ecommons.cornell.edu/bitstream/1813/7281/1/97-1626.pdf

Sparck Jones, K. 2005. Epilogue: Metareflections on TREC. In TREC: Experiment and evaluation in information retrieval, E. M. Voorhees and D. K. Harman, Eds. The MIT Press, Cambridge, MA, 421--448.

Sparck Jones, K. 2001. ”Automatic language and information processing: rethinking evaluation'', Natural Language Engineering, 7, 2001, 1-18. http://www.cl.cam.ac.uk/archive/ksj21/ksjdigipapers/nle01.pdf

Diane Tang, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: more, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '10). ACM, New York, NY, USA, 17-26. DOI=10.1145/1835804.1835810 http://doi.acm.org/10.1145/1835804.1835810

Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing). The MIT Press.

Ellen M. Voorhees. 2008. On test collections for adaptive information retrieval. Inf. Process. Manage. 44, 6 (November 2008), 1879-1885. DOI=10.1016/j.ipm.2007.12.011 http://dx.doi.org/10.1016/j.ipm.2007.12.011

Xiaofeng Zhou, Jimmy Xiangji Huang, and Ben He. 2011. Enhancing ad-hoc relevance weighting using probability density estimation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR '11). ACM, New York, NY, USA, 175-184. DOI=10.1145/2009916.2009943 http://doi.acm.org/10.1145/2009916.2009943

Add new comment

You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.