You can get access to HathiTrust data through a variety of methods.
Datasets
Interested in doing text analysis and data mining against the HathiTrust collection? The HathiTrust Research Center provides a number of tools and services that allow you to do analysis against texts online. However, if you need to run your own tools against texts, we can provide you with a dataset for download to your own servers. More information can be found in the “Datasets” page.
HTRC Derived Datasets
The HathiTrust Research Center makes available two additional datasets based on HathiTrust collections. The Extracted Features Dataset includes basic bibliographic metadata as well as counts for various elements in each book (e.g., number of pages, number of words on a specific page). The Word Frequencies in English-Language Literature, 1700-1923 dataset provides word frequency counts for genres in the English language. Learn more about these datasets.
APIs
You can use the HathiTrust APIs to query and retrieve data when you have a known identifier. HathiTrust APIs are not search APIs (e.g., where you use a keyword to search across the collection).
Bibliographic API
You can use the Bibliographic API to do real-time querying against the HathiTrust collection and to retrieve a limited number of bibliographic records. Using a variety of common identifiers (e.g., ISBN, LCCN, OCLC, etc.) as well as HathiTrust identifiers, you can retrieve information about any works associated with those identifiers. The API can provide you with brief or full bibliographic records.
Data API
The Data API allows you to retrieve page images, OCR text for individual pages, and METS metadata. To retrieve the OCR for more than a few volumes, we recommend that you request a dataset. Restrictions apply.
OAI
You can retrieve bibliographic records for full-view content through HathiTrust’s OAI feed, managed by the University of Michigan. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol used in libraries and archives for the automated delivery of structured bibliographic metadata. You can use this option to retrieve metadata in MARC21 or unqualified Dublin Core formats.
Tab-delimited Files (Hathifiles)
Metadata describing all works in the HathiTrust collection are available for download as tab-delimited files. Full documentation on these metadata is available under Hathifiles Description. These files include some bibliographic metadata as well as data elements unique to the HathiTrust collection. You can use this data to do some analysis of the HathiTrust collection. Libraries may find these files useful for collection management and deciding which records to retrieve from HathiTrust or building links to HathiTrust works.
Renewal ID file
A tab-delimited file containing US copyright renewal registration numbers in connection with a HathiTrust volume identifier is available for download. This data resulted from CRMS-US copyright reviews. In the context of CRMS historical copyright review data, a Renewal ID might represent a renewal registration for the exact edition, for a prior edition of the work published 1923 and later, or for partial content within the volume such as a short story. The data file for download is available at https://www.hathitrust.org/files/CRMSRenewals.tsv.
For more information on the file see the page Renewal ID data file.
HathiTrust Data in Discovery Products
HathiTrust publicly-available data is consumed into a number of different vendor discovery products, including the following: Summon, EBSCO Discovery Services, Ex Libris Primo, OCLC WorldCat Discovery, Innovative Interfaces Inc Encore. In addition, a number of knowledge bases include HathiTrust data, including EBSCO’s Knowledge Base and OCLC’s WorldShare Collection Manager. Other vendor products may also include HathiTrust data.
Quality and accuracy of HathiTrust data in vendor products may vary. We are happy to work with vendors to determine how best to characterize collections or sets and manage data over time. Contact our User Support team.
HathiTrust and OCLC records
OCLC and HathiTrust work together to synchronize WorldCat with the HathiTrust catalog nightly. HathiTrust records are added to WorldCat as e-resource records. The vast majority of records representing the HathiTrust collection are in WorldCat today, with links to the HathiTrust content.
Notes on working with HathiTrust data
While working with HathiTrust bibliographic metadata or digital content, it may helpful to keep the following in mind.
The HathiTrust collection is not static. Works get added to the collection every day, and sometimes a digital item may be updated with a new version. Bibliographic records can be updated when contributors send us corrections. Copyright and access statuses may change as items undergo copyright review or we receive permissions agreements from copyright holders.
The HathiTrust collection is composed of works from over 50 different libraries located in the United States and around the world. Bibliographic records represent many different cataloging practices and may even be in different languages.
We work closely with our contributing libraries to try to correct errors in bibliographic records and digital content (including poor OCR). Users can notify us about errors using the “feedback” link in the header or footer of most pages. Because the originating library or vendor may need to make the change on their end, it may take a while for corrections to be made in HathiTrust.