Update on December 2011 Activities

January 13, 2012 Syndicate content

[Download PDF]

Late Breaking News

HathiTrust Passes 10 Million Volumes

View statistics and a timeline on the HathiTrust blog.

Top News

Changes to Tab-delimited Files

On February 1, HathiTrust will be adding three additional columns to the tab-delimited inventory files (“hathifiles”) available at http://www.hathitrust.org/hathifiles. The files are frequently used by partners and non-partners as a means to obtain full bibliographic records for HathiTrust items to load into local catalogs (see HathiTrust Data Availability and APIs). The additional columns will identify the publication date and publication location of volumes in HathiTrust, as well as volumes that have been identified as U.S. federal government documents.


Works Digitized Locally and by Internet Archive

Staff at Michigan continued conversations with staff at the University of Florida regarding ingest of locally-digitized materials, and staff at several other institutions regarding ingest of Internet Archive-digitized materials.

Working Groups and Committees

Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups for more information.


Communications Working Group

The Communications Working Group continued to work on on a public services-oriented communications package, as well as announcements for new partners and the major milestone of 10 million volumes.

User Experience Advisory Group

The User Experience Advisory Group began reviewing the current home page and discussed additions and issues that will need to be addressed in a forthcoming redesign. Group member Jenny Emanuel contributed a "Perspectives from HathiTrust" blog post about the group's persona work that was completed in November. 

User Support Working Group

The User Support Working Group is still seeking nominations for new members. See the Update on November Activities for details.

The table below contains a summary of the issues received by the User Support Working Group in December.

Issue Type November December
Content 107



Non-partner Digital Deposit



Cataloging 4330
Access and Use 103107








Print on Demand


Inter-library loan


Full-PDF or e-copy requests




Data Availability and APIs


Reuse of content

Web applications 2418

Functionality problems


Problems with login specifically


General Questions about login


Partners setting up login


Usability issues


Feature requests

Partner Ingest 35
General 4750







*See User Support Working Group Issue Types for a description of the types of issues included in each category.


Bibliographic Data Management System

Team members from California Digital Library continued work on processes to compare bibliographic records in Zephir, the new metadata management system under development, with records in HathiTrust’s existing system. Zephir team members continued to load and test new records as well, and refine the timeline for migration of bibliographic metadata management services to Zephir in coordination with staff at the University of Michigan.

HathiTrust Publishing (HTPub)

Staff at the University of Michigan revised the goal statement for HTPub (see the project web page) and plans for system architecture. Staff also began work on establishing a project timeline.

HathiTrust Research Center

Several changes were made to the HTRC leadership in December. John Unsworth, a key member of the Team at Illinois, accepted a position as vice provost for Library and Technology Services and chief information officer at Brandeis University. He will be leaving the University of Illinois but remain on the Executive Management Team. The Team will keep its base composition of 2 members from the University of Illinois and 2 from Indiana University, so this change will add one new member. Stephen Downie, Associate Dean for Research at the University of Illinois Graduate School of Library and Information Science, will fill the position left by John. Stephen’s research has focused on music information retrieval and data mining. This work has involved building significant infrastructure for research, including grappling with issues of allowing computational access to in-copyright material. Finally, Marshall Scott Poole is stepping aside as co-director of the HTRC for personal reasons, though he will remain on the Executive Management Team. Stephen Downie will take his place as co-director of the HTRC with Beth Plale, who is co-director on the Indiana University side. Beth also chairs the Executive Management Team. The changes are in effect as of January 1, 2012.

IMLS Quality Grant

In December, project staff completed physical review of more than 90% of the volumes in the first 1,000 volume sample drawn from HathiTrust. Staff are working to arrange on-site review with cooperation from HathiTrust member libraries for the approximately 70 volumes that are not available via inter-library loan due to poor condition, non-circulating collection, or other reason.

Project staff concluded page-level data collection for the second production sample in December (see the Update on September 2011 Activities for details on the composition of the sample). The full dataset will be sent to the project statistician in early January for analysis. Data collection for the third production run began in the late December. The third production run focuses on Internet Archive-digitized volumes published pre-1923.

Project staff continue to define requirements for a new quality review interface, targeted specifically for review of volume-level errors such as missing, duplicate, and out-of-order pages. Please visit the project website for updates.

Development Updates

Full-text Search

Michigan staff released a new version of the full-text search index in December. The new release corrected an error in the “Original Location” metadata facet and provided additional metadata for advanced search and relevance ranking. It also made it possible for full-text search results and facets to reflect whether or not users from partner institutions are able to view in-copyright items when lawful access is permitted (HathiTrust is currently pursuing providing access to in-copyright works to users who have print disabilities, for preservation uses, and in circumstances where works are copyright-orphaned). Access in these circumstances, which are still pending deployment to partners, is dependent on partner institutions owning or previously owning print copies of works in question and users’ location inside or outside the United States.

Michigan staff continued development on an advanced search feature for full-text search, including preliminary testing of the first working prototype in HathiTrust’s development environment.

California Digital Library continued work on a spelling suggestion feature for full-text search queries. A CDL developer established an account in the HathiTrust development environment and used a sample index of public domain materials to test strategies for automatically building a bigram dictionary of words with different spellings users might enter.

Tom Burton-West's proposed talk on "HathiTrust Large Scale Search: Scalability meets Usability", was accepted by popular vote for the 2012 Code4Lib Conference in Seattle, WA.


Staff at Michigan released a new throttling mechanism for HathiTrust, which allows throttling levels to be set at more granular levels. Users are now less likely to be throttled in the course of normal use as the new throttling policies are applied to specific scenarios such as viewing thumbnail or page images, or downloading PDFs, as opposed to all use generally. Throttling ensures compliance with third-party restrictions on bulk download of materials, and helps to ensure a consistent and reliable experience for all users.


In connection with HTPub, Michigan staff continued work to adapt the HathiTrust PageTurner to display XML content.

Security Risk Assessment and Vulnerability Test

Michigan Library staff continue to work with central IT security analysts to complete the Risk Assessment that was started in November, and have received the final report of the vulnerability penetration test. The report revealed no vulnerabilities that enabled direct or indirect access to the repository, but noted software issues such as cross-site scripting vulnerability and also made recommendations for increased firewalling at the Michigan site. All software issues noted in the report were addressed in December. A broader firewalling project for the data center where the Michigan instance is hosted is already in progress but not yet complete, and so some provisional steps were taken to tighten security while that effort continues.


HathiTrust services were inaccessible or diminished for several periods in December due to problems related to the release of the new throttling system (all times EST): on Tue, Dec 13 4:25-4:30pm, Wed, Dec 14 11:10am-12:00pm, and Wed 12-21 7:30-10:30am, all page viewing was affected, and on Tue, Dec 13 3:45-5:00pm, full-book PDF download was affected. Additionally, page viewing of volumes classified as "Public Domain in the United States" in HathiTrust was intermittently unavailable on Wed 12-21 from approximately 1-4:30pm EST due to an apparent outage with an externally-hosted proxy detection system.

HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org.

Papers & Presentations

All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth

As of November 1:

  December Total
Columbia University 4
Cornell University 9,871
Duke University 21 4,522
Harvard University 434 53,440
Indiana University 324
Library of Congress 15,769 89,411
North Carolina State University 0 3,196
University of North Carolina - Chapel Hill 0 8,087
Northwestern University 237 5,649
New York Public Library 76
Penn State University 1,821
Princeton University 350
Purdue University 0
University of California 114,906
The University of Chicago 1,730
University of Illinois 0 14,503
Universidad Complutense 28 108,668
University of Michigan 22,907 4,504,601
University of Minnesota 916
University of Wisconsin 15,902 527,334
University of Virginia 12 47,396
Utah State 0 46
Yale University 0 23,674
Total 185,311 9,966,572

Public Domain (~27%)

Total* 50,434 2,712,626

January Forecast

  • Continue work on the advanced search feature for full-text search

