Available Indexes

Add new comment

As I understand your setup, your indexing process is batch-based and do not have near-realtime requirements. This makes it possible (and often desirable) to have hardware dedicated for indexing and hardware dedicated for searching. With such a setup, enterprise-level stability on the search-side is not needed as catastrophic hardware-crash does not mean loss of data. Your argument about not maximizing performance is technically valid: RAIDing or just connecting SSDs up to the TB level would probably saturate most standard controllers. However, not achieving maximum possible performance still leaves room for a huge performance boost over conventional hard drives. Your setup is very interesting as you need both fast random IO and high bulk transfer rate. Our setup is not heavy on the bulk side (we don't use phrase searches much) and on a 4-core machine with a single previous-generation SSD, 4 parallel searches performed at 308% of a single search, indicating that the CPU was the main bottlenect. Thus, I would not worry too much about the random access performance for a commodity RAID setup with Lucene. This still leaves bulk transfers, but here I guesstimate that hardware specs will be fairly accurate as it is simpler to design for and measure.
You are browsing an archive of the HathiTrust website. In July 2023, we launched a new site at www.hathitrust.org.