[Copied from the Blog for Library Technology]
A recent blog pointed out that search is hard when there are many indexes to search because results must be combined. Search is hard for us in DLPS for a different reason. Our problem is the size of the data.
The Library has been receiving page images and OCR from Google for a while now. The number of OCR'd volumes has passed the 2 million mark. This raises the question of whether it is possible to provide a useful full text search of the OCR for 2 million volumes. Or more. We are trying to find out.
Our primary tool is Solr, a full text search engine built on Lucene. Solr has taken the Library world by storm due to its ease of use. Many institutions have liberated MARC records from the confines of the OPAC, indexed them with Solr and created innovative, faceted search interfaces to aid discovery. However, these systems are based on searching the metadata.
Metadata search is fine as far as it goes but what if you'd like to find books containing the phrase fragment: "Monarchy, whether Elective or Hereditary"? You need full-text search.
These Solr-based systems have indexes of millions of MARC records and perform well. Their index size is in the range of a few tens of gigabytes or less. We've discovered that the index for our one million book index is over 200 gigabytes or 10 times the size of a metadata index for only one-tenth as many items. We hope to scale to 10 million books. Our data show index size is linearly proportional to data size so we expect to end up with a two terabyte index for 10 million books.
The large size of the index is is due to a variery of factors. One is the fact that the OCR is "dirty" and contains many terms that are useless non-words. Another is that we are indexing works in many languages, again introducing many unique terms. But the biggest contributor is simply the index of word positions. At some point most unique terms are already present in the index and the only growth is in the size of the list of documents containing that term. But every occurrence of every additional word adds its position to the index. The average size of the OCR for a book is about 680,000 bytes, so a single book will add a very large number of entries to the position index. In fact, we've found that this part of the index accounts for 85% of the total index size.
Why do we need word positions, anyway? The answer is that Lucene does phrase searching by comparing the proximity of the terms in the phrase and it uses positions to compute proximity. We absolutely want phrase searching. Our performance tests using a few terms with great informational weight such as "english" and "constitution", return results in under one second. This hold true even for the phrase "english constitution". These terms are infrequent enough that the computational load to search for them and compare proximities is small. Unfortunately, phrases or term queries that contain words with low weight like "of", "and", "the", i.e. common terms, impose such a large computational load that it can take minutes for these queries to complete.
But even queries containing many common terms perform adequately if the index size is just a few gigabytes. So what's the problem? It has to do with the speed of data retrieval from RAM vs. disk. RAM is orders of magnitude faster than disk. When most or all of the index fits in RAM performance is optimal. When Lucene must retrieve pieces of an index from disk because it's too large to fit entirely in RAM performance suffers. This is the case for our index. We are exploring several paths toward a solution.
One involves splitting the index into what Solr calls "shards" and distributing the shards to two or more machines. In this way, more of the index will fit into RAM. Solr supports result merging from several shards. So far it is unclear how many machines we will need to obtain good performance.
The second is a special kind of indexing that creates bi-grams. These are tokens composed of pairs of words. This decreases the number of common terms that must be indexed.
Stay tuned for developments.