HathiTrust Technology Summary

This summary was last updated in March 2023.

HathiTrust serves its repository from a University of Michigan-managed data center in Ann Arbor with a mirror site in Indianapolis managed by Indiana University. Each data center has a 1.4 petabyte spinning-disk storage array holding a complete copy of the images and OCR text of all 17.6 million digitized books. There are Apache Solr indexes comprising over 12 terabytes of full text from these books and a separate index with library catalog metadata (MARC records) for each item. We manage a variety of metadata in MariaDB and in MongoDB including information about holdings from member libraries, copyright and licensing information, US federal government documents, and more.

Our applications allow search, discovery, and access to material in the repository as well as managing content ingest and indexing for data coming in to the repository. Our applications were written in-house, primarily in the Perl and Ruby programming languages. Much of the code for our applications is publicly available in HathiTrust's GitHub. Our applications increasingly use containers for development and testing with Docker and in production with Kubernetes.

For more information about the HathiTrust technology environment, see Technology, Standards, and Specifications.

Printer-friendly version

Main menu

HathiTrust Technology Summary

Our Digital Library

Main menu

HathiTrust Technology Summary

Search form

Our Digital Library