The amount of content you will be able to submit is contingent on a number of factors, such as what you provided for content estimates previously and what other contributors have estimated and ingest.
There are no charges above the yearly partnership fee for materials deposited by partner institutions (we no longer charge a per GB fee for ingest). All members share in the cost of maintaining public domain content, and maintenance costs of in-copyright content is shared among partners who have indicated via their print holdings data that they hold those volumes in their collection.
Namespaces are used to avoid clashes between identifiers. Whenever a new identifier scheme will be used, a new namespace is needed. Information about preferred identifiers is available in our Deposit Guidelines.
A new Digital Assets Submission Inventory (DASI) is needed whenever the responsible entity, content provider, or digitization source is different from previous submissions and/or when the number of volumes or time period covered by the DASI have been exceeded. We prefer that DASIs cover only a one year period and cover multiple batches as opposed to receive a separate DASI for each batch.
A new administrative coversheet is needed when the data previously provided in the coversheet changes. Typical events that prompt submission of a new administrative coversheet include: migration to a new bibliographic management system, use of a new identifier scheme, and ingest of content from a new digitization source.
Yes, you can submit that content into HathiTrust. On the Digital Assets Submission Inventory, please indicate the HathiTrust partner in the “depositing institution” field and the name of the non-HathiTrust partner in the “content provider” field.
Currently, no such capacity exists. All content stored in HathiTrust is, at the least, accessible through the full-text search feature.
We try to communicate with project leads as things progress, and we will certainly be in touch when problems occur; however, for larger projects with large amounts of content or for Internet Archive or Google content where ingest is automated, you may find it easier to track this yourself. We provided two kinds of ingest reports. Our ingest reports provide weekly overviews of ingest activities, and each report is broken down by partner. The ingest logs provide item-specific reports, also available for each week of activity. The ingest logs include the item identifier and whether ingest succeeded or failed.
This is dependent on a number of factors. Ingest of the digital objects cannot begin until the bib records have been made available to repository systems. Once you submit your bib data, it takes 2 days for the records to be loaded to Zephir and then exported and added to the HathiTrust catalog. Bib data should be submitted before digital content is submitted. Once we have received the digital content, it takes time to remediate and package content for ingest into the repository. If there are significant problems with the content, we may send it back to you for additional work on your part. Other ingest activities are typically going on at the same time and take up staff and machine bandwidth, as well. Ingest of volumes occurs overnight, and there is a limit on the number of volumes that can be ingested in one night.
At this time we are only accepting master files in TIFF ITU G4 or JPEG2000 format. Our general practice is to compress continuous tone TIFF files into JPEG2000. Specific image requirements are described in our Technical Requirements for Digitized Page Images Submitted to HathiTrust. For information on packaging images for submission, see our Submission Package Requirements. These pages will also likely be helpful:
OCR is required where it is possible to be generated. We recognize that OCR is infeasible or impossible for some materials (e.g., handwritten manuscripts).
Because of the large size of lossless JP2s, we typically do not ingest these types of files without first verifying the reason why lossless compression is desired. This may be because there is something in particular about the materials that is particularly special, e.g., they are rare books or special collections materials, where it is important to preserve the artifactual elements of the content. For general collections materials where there is no particular reason to capture the artifactual elements, our general practice based on research led by Harvard and with considerations of file size and storage is to use a certain compression rate for images. We understand that you would like the highest possible quality files preserved. Based on our preservation practices, however, which consider matching the fitness of preserved files for the intended uses, we would like to understand if there are any particular aspects of these files that set them apart from other general collections materials, where we have identified an appropriate compression rate that maintains quality while being sensitive to the resources expended for preservation.
A description of the formats that we download and our rationale is available in this document.
Internet Archive is just the delivery mechanism in this case, and often materials that were not digitized directly by Internet Archive may not meet our requirements and will be rejected. Prior to beginning ingest, please provide a few samples to typical items by linking us to the items in Internet Archive. We will inspect our preferred file formats to ensure they meet our specs. If other formats are desired for ingest, they should meet our specifications, and it’s possible that direct ingest may be preferable.