Perspectives from HathiTrust

Collection Creation Case Study: African American Fiction

jbelle — Fri, 29 Jul 2022 18:15:18 +0000

By Chris Powell, Coordinator, Encoded Text Services, University of Michigan Library

I've been building collections since the summer of 2008, when collection building functionality was added to HathiTrust's predecessor, MBooks. What started as a task to demonstrate the functionality and utility of the feature has turned into an ongoing fascination with stumbling onto a topic and seeing if I can locate material on that topic in HathiTrust. If there is intriguing material, I inevitably build a collection – generally a private one for my own use – but if there are enough titles and a reasonably interesting focus I will make a public collection. One such collection is African American Fiction, which contains fiction (including plays) by Black American authors.

I believe that I started this collection in early 2019, probably after seeing some Black History Month list or another. It is possible that the continuing popularity of Nella Larsen's Quicksand on the HathiTrust top ten list spurred me to see what else might be available that people would be interested in reading if they knew it was in HathiTrust. This is the point where I need to say that while I'm a librarian, I'm not a reference librarian and I'm not a scholar in the field, so while the collection is fairly carefully curated, it's not in any way authoritative. I'm just a curious person with good searching skills who believes that other people probably share my curiosity, so I should pass along what I find.

As I usually do, I started with the Advanced Full-text Search and searched for the title and author of the various books on the list. While there are about a thousand or so catalog records from Emory University where they have added a local Index Term-Genre/Form field containing "African American author.", there is no straightforward way to search the entire collection for Black authors, or for fiction, either. You have to work by authors and titles individually. Using the Advanced Full-text Search gives results for individual copies of books, so you can check the boxes for the volumes you want and immediately add them to a collection. Early on I had to make some decisions about including copyrighted works (identified in the interface as Limited (search-only)) and whether or not to add duplicates of individual titles. I opted to include search-only works and to choose the earliest version of a title and include multiple copies if it was open for reading, in case of bad scans or heavy reader annotations.

Over time, I filled out the collection more by consulting other sources. The stacks are right outside my office in the Hatcher Graduate Library and the Zs (bibliography) are just a half-floor away, so I picked up some bibliographies that were on the shelves – Afro-American Fiction, 1853-1976; A Century of Fiction by American Negroes, 1853-1952; A Selected Bibliography of Black Literature: The Harlem Renaissance; and Black American Women in Literature. In early 2020, a colleague at another institution tweeted a list of the "100 greatest books ever written by African American women" and I searched the authors included, choosing not only the greatest novels but others that were in HathiTrust as well. Between this list and Black American Women in Literature, it probably skews the collection toward women, so I noted that in the collection description.

Finding out about the Novel Collections at the Black Book Interactive Project was probably the biggest leap forward. As I mention in my collection description, I worked my way through the entries for 1800-1965, where the greatest number of titles open for reading might be found. At some point I'll resume searching but I found fewer and fewer titles as the 20th century progressed, so it became a bit of an unrewarding slog. That doesn't stop me from searching individual titles as I find them, though. I read a review of James Hannaham's Delicious Foods after the Kara Walker cover caught my eye, and while it isn't in HathiTrust, his earlier novel God Says No is. I am adding volumes as I encounter them, and always appreciate a pointer to a new resource. If you have titles to suggest for inclusion in the collection, pass them on to feedback@issues.hathitrust.org.

Read more about creating HathiTrust collections or watch this video.

HathiTrust's Top Ten Books of 2021

jbelle — Thu, 10 Feb 2022 16:09:19 +0000

Passing by Nella Larsen. New York, A. A. Knopf, 1929. (Image credit: IMDB)

With the Netflix release of the 2021 movie based on the book, the novel Passing soared to the top of the list in the HathiTrust collection.

Member Perspectives on the Collection: Findings from the 2020 Community Week Session by the Digital Collection Strategy Working Group

jbelle — Wed, 11 Aug 2021 15:42:27 +0000

Authored by: Digital Collection Strategy Working Group (DCSWG)

During HathiTrust’s inaugural Community Week in October 2020, the Digital Collection Strategy Working Group (DCSWG) hosted two facilitated conversations to explore member perspectives on issues related to the content of the HathiTrust Digital Library (HTDL), including diversity and representation in collections, real and perceived barriers to contributing content to HathiTrust, and potential approaches to identifying and addressing collection gaps in the HTDL.

During the sessions, several participants highlighted their library’s renewed emphasis on diversity, equity, and inclusion in their collections strategies. HathiTrust member libraries identified projects in the following areas:

Acquiring materials focused on social justice
Working with more diverse publishers and vendors
Connecting collections to curricular initiatives addressing diversity
Developing values-based collection development plans

The members of the DCSWG see these and similar initiatives as opportunities for HathiTrust to partner with, support, and benefit from member libraries seeking to make their collections richer and more representative. As an example, HathiTrust could work with member libraries acquiring books from BIPOC publishers to ensure that these materials are digitized and contributed to the HTDL, thus broadening the impact of these important works and ensuring their digital preservation and long-term access.

In order to add content to the HTDL, libraries must establish the necessary local workflows to enable digitization and ingest; in both Community Week sessions, participants described barriers to participation in this aspect of HathiTrust’s work. HathiTrust’s process for submitting materials for ingest is perceived as dauntingly complex, especially for libraries with limited staff capacity in areas such as digitization and metadata. Of particular interest to members of the DCSWG was the observation that libraries, especially those newer to HathiTrust membership, would appreciate more guidance about the kind of content HathiTrust seeks. Members are eager to help fill gaps in the HTDL but are unsure of how to prioritize their library’s activities in support of this goal.

In addition to discussing how their own libraries could contribute to the growth of the HTDL corpus, attendees brainstormed about other possible paths for community contribution of important content, especially materials that could address underrepresented groups and subject areas. For example, public libraries, especially those with specialized holdings such as the Chicago Public Library’s African-American cultural collections, could be very welcome contributors of material for digitization. Thinking beyond HathiTrust’s current focus on digitized monographs and serials, participants suggested that archival and born digital materials be considered for inclusion in the HTDL as well.

The DCSWG recognizes that moving into new areas like these will be a significant shift for HathiTrust and will require careful thought, planning, and resourcing, as well as significant input from and investment by members. A commitment to filling collection gaps and diversifying the HTDL will require including content subject to historic barriers such as structural racism in collecting practices as well as infrastructure and funding limitations that have contributed to their exclusion from traditional publishing platforms and digitization efforts in the past. The DCSWG looks forward to further engagement with HathiTrust members over the coming year. Together we can reach the goal of including a broader range of content and more diverse voices found in (and perhaps beyond) current members’ collections in the expanding HTDL.

1950 U.S. Census Publications Available in HathiTrust

jbelle — Fri, 16 Apr 2021 17:32:19 +0000

In anticipation of the National Archives releasing the Seventeenth Census of the United States, 1950 on April 1, 2022, we are highlighting a new collection curated from reports available in HathiTrust, the U.S. Census of Population: 1950. We expect that the 2022 release of the head of household primary source records will generate tremendous public interest in the census, and this collection of published census volumes in HathiTrust offers a complementary and equally rich insight on the United States at the beginning of a pivotal decade.

Collection Description

HathiTrust contains many digitized reports from the 1950 census of population, and we’ve gathered key reports into this collection. If you are not familiar with these reports, here’s an overview of what you will find:

Volumes 1 & 2 contain a national overview as well as reports from every state.
Volume 3 contains reports about metropolitan areas, divided into census tracts (tracts contain around 4,000 people and 1,400 housing units).
Volume 4 contains reports on various topics, including occupation, place of birth, and education.
Volume 5 contains special reports, which are indexes to data reported elsewhere, plus a report on farms and farmers.

Also included are the procedural studies on how the census was conducted, and the census monograph series, which draws on government data reported elsewhere and highlights significant results of the 1950 census. These monographs provide detailed analyses of topics such as fertility, families, and agriculture; title examples include American Agriculture: Its Structure And Place In The Economy; Immigrants And Their Children, 1850-1950; Social Characteristics Of Urban And Rural Communities; American Families; Residential Finance, 1950; America’s Children; and Fertility Of American Women. Census monograph reports were not produced for 1930 and 1940 due to the Great Depression and war efforts respectively. From these titles, one can see that these reports provide open access, primary source, quantitative and qualitative data of interest to all disciplines.

To have all of these census reports immediately accessible in one location is a great service for researchers given the difficulty of finding and traveling to a library that has a complete set of print reports. This collection makes it possible for scholars, students, and librarians to browse more than 190 digitized reports. The availability of keyword searching within the scoped collection makes it easier to find buried topics or obscure geographic locations, and can save researchers a considerable amount of time. For example, keyword searching for the name of a state will bring up publications that don’t have the name in the title.

The collection is easily accessible under the “Collection” header once you’ve accessed the HathiTrust or directly by using this shareable link: https://babel.hathitrust.org/cgi/mb?a=listis&c=1986287266.

A Group Effort

This collection is an example of what one group can accomplish using their shared expertise. Working together, the HathiTrust Federal Documents Advisory Committee created this resource for researchers interested in the 1950 census and the discovery of lesser-known or less-used 1950 census special reports. To guide their work, the committee used the Bureau of the Census Catalog of Publications, 1790-1972 (SuDoc #: C 56.222/2-2:790-972) which provides a complete list of publications associated with the 1950 census.

The committee created two evaluation rubrics to assist in selecting the best volume available in HathiTrust. These were Digitized Rendering Quality: text, images, and gutters/margins; and Metadata Quality: item description overall, volume enumeration and chronology, analytic cataloging of monographic series, and completeness of series. Because of the committee’s familiarity with the reports, a known concern was the very narrow interior margins that can be a digitization problem when scanning from volumes that are not disbound and can result in incomplete page scans.

The group also used a spreadsheet to inventory the reports and to track HathiTrust identifiers such as volume ID# and volume URL, and to record important notations. Searching HathiTrust yielded a complete list of the reports found, as well as identified potential gaps or reports that may not be in HathiTrust. From this inventory, reports were selected for the collection, and HathiTrust staff assisted with the final steps necessary to create a new collection in HathiTrust.

Further Assistance with Census Reports

The HathiTrust Federal Documents Advisory Committee encourages you to browse, use, and promote this collection of federal documents which are available to HathiTrust members and the public. We also encourage users to seek out federal depository library collections near you and to consult with depository coordinators at those libraries. If you have questions about any of these reports, please contact a library that is part of the Federal Depository Library Program or contact the U.S. Census Bureau.

For information on the U.S. Census of Population: 1950 HathiTrust collection or to report quality issues, contact feedback@issues.hathitrust.org.

Assessment and Advancement of our Shared Print Program

jbelle — Thu, 18 Mar 2021 17:25:00 +0000

Contributed by Heather Weltin, Shared Print Program Officer

“We are stronger and better when working together in most things, especially in preserving resources in the future.” - discussion attendee

At the end of August 2020, HathiTrust began a process to gather feedback from our then 77 retention libraries regarding their views and opinions about our Shared Print Program. The goal was not only to learn about satisfaction levels, but to also discuss and identify potential new services and enhancements for our program.

Our assessment was conducted through a survey of our retention libraries and discussion groups. The survey was HathiTrust’s first opportunity to hear directly from retention libraries regarding their opinions on the goals, satisfaction, and potential new services of the shared print program. During the survey, we also organized 7 different stakeholder discussions with 36 participants in attendance representing 32 different member institutions. The stakeholder participants were identified through a review of members who are shared print retention libraries, and have shown a high level commitment to digitization and ingest.

Below is a summary of our key findings from both the survey and discussions, and an outline of advancements we plan to focus on for the rest of this year and into next, but we encourage you to read the full report here for more robust details.

As one respondent noted, “I see the benefits [of the shared print program] as extending beyond the membership. All libraries, especially research libraries, have an obligation to preserve and share (in print and digital forms) the collective collection now and in the future.”

Key Findings

Member Satisfaction and Program Value

● Most libraries have a high level of satisfaction with the HathiTrust Shared Print Program (HTSPP) and their own library’s level of participation in the program.

● The HTSPP goal of preserving both the print and digital collections remains important and is one of the main reasons why libraries participate.

● Several services proposed in the HathiTrust Print Monographs Archive Planning Task Force Report[1] still resonate with participants but a few ideas mentioned there were perceived to be of lesser value today than in 2015.

Services

● HTSPP is seen as an expansion and service of membership and one that members are inclined to do simply as part of their roles as members. It is of high value because, like HathiTrust’s other services, HTSPP demonstrates that collectively libraries can do more than alone.

● Analytics and data are essential to enhance shared print programs. HathiTrust and members require this data to responsibly manage local and global collections, assign value to shared print, and have a better understanding of collection risks.

● Preservation continues to be seen as a unique value of HTSPP because of the connection between the digital and print but access to materials is also important. But more data is needed in order to understand risks to preservation stemming from circulation for the global shared print collection.

● In general, there are very few barriers for participation in the HTSPP but improvements around data requirements and submissions, insufficient local resources, and lack of talking points around the value of shared print are challenging for some.

Future Directions

● The lack of verification and condition assurances on commitments is of concern to members when thinking about long term preservation of shared print items. On the other hand, members recognized that not requiring verification of committed volumes made the barriers to participating as a retention partner quite low.

● Expansion of HTSPP is critical to respondents. Extension of the length of commitments, which is currently 25 years, is of less concern, rather, HTSPP should focus on new commitments, new formats, and data gathering first.

● There is general interest in developing a non-circulating sub-collection of commitments as long as the focus is on special collection or collections of distinction and consideration of future digital access is included. However, respondents indicated it would be difficult to commit space to non-circulating items.

● Unless unique services provided by retention libraries are involved (validation, digitization services for shared print, etc.), financial compensation for retention libraries is seen as less important.

● More scanning and digitization are needed and considered closely aligned with shared print and opportunities for development, but enabling unique discovery to delivery systems for just HathiTrust shared print commitments is a lower priority. Members would rather we focus on helping improve the landscape of current resource sharing services for shared print materials.

Using the survey and discussions findings as our guide, the Shared Print Advisory Committee (SPAC) will be focusing on all of these throughout the year but overall, our key focus will be on data and analytic capabilities and needs, considerations around more preservation-like efforts (i.e. validation, commitments, etc.), and the collaborative role HathiTrust can play for our members around different shared print initiatives. Above all else, will be our role in finding a balance between new services, ideas, their costs, and what value they add for our members.

[1] HathiTrust Print Monographs Archive Planning Task Force Report. June 2015. https://www.hathitrust.org/print_monographs_archive_charge.

Beyond Access: Using ETAS to Improve 15K Catalog Records

jbelle — Wed, 12 May 2021 19:22:50 +0000

May 13, 2021

With the onset of the pandemic, many member librarians found themselves working remotely for the first time. While some work easily translated to a new virtual reality, other work did not, and those teams and individuals found themselves turning to creative solutions to continue the important work of their library. Leigh Billings, Metadata Management Librarian at the University of Michigan Library (HathiTrust member since 2008), discovered a way to continue a project that had previously relied on interaction with physical volumes. She and others at the UM Library used digital access to in-copyright works provided by ETAS to review and correct cataloging errors and omissions.

Leigh says, “A similar project in 2019 took over 9 months to fix about 1,200 records in one unit, while in the past year staff working remotely have been able to fix over 15,000 records . . . We simply would not be able to do this work without the ETAS access and scans.”

Using the scans available through the Emergency Temporary Access Service (ETAS), the project focused on cleaning up bibliographic records with missing or erroneous metadata that originated from the initial transference of information from the printed catalog cards into online MARC records.

Leigh says, “Many of these materials are stored offsite, and before HathiTrust ETAS access was provided, we had to request that materials be physically transferred, creating issues with the availability of staff to pull materials and space to store them — a similar project in 2019 took over 9 months to fix about 1,200 records in one unit, while in the past year staff working remotely have been able to fix over 15,000 records.”

In addition to enabling member patrons to continue teaching, learning, and researching despite library closures due to the pandemic, the temporary access service also provided a means to improve discoverability of the U-M library collection overall, a benefit that will extend beyond these turbulent times.

“We simply would not be able to do this work without the ETAS access and scans. It was a matter of perfect timing: Folks needing remote work and having the ability to view materials online came together to allow this project to happen.”

HathiTrust Access to 1.4 Million+ U.S. Federal Documents

Heather — Fri, 08 May 2020 19:13:24 +0000

By HathiTrust Federal Documents Advisory Committee

Are you now faced with the difficult path of continuing your scholarly work remotely? Have a research question involving federal government information? Or just looking for interesting reading? During the current health emergency when many libraries are closed, HathiTrust’s U.S. Federal Documents Collection remains open for use, with resources addressing government response to pandemics throughout U.S. history.

What Can You Find in the Collection?

The HathiTrust U.S. federal documents collection is a “library at web scale,” digitized volumes of the print items held in research libraries. One of the largest sets of openly available U.S. federal publications on the web, our collection covers the full range of topics that intersect with the U.S. federal government. Wondering which questions have been included in the U.S. Census, going back to the first census in 1790? Here is a document for that. Wondering about various imports to the U.S. in the 1800’s, including opium? We have that too. Or maybe you want to read the 9/11 Commission Report, the congressional hearing on the Titanic disaster, or a U.S. Navy cookbook, with recipes.

Our collection of digitized Congressional publications is likely the largest outside of the federal government and commercial providers. Some examples of the range of Congressional hearings in HathiTrust include the topics Record labeling (1985); Russia : how Vladimir Putin rose to power and what America can expect (2000); Public safety and civil rights implications of state and local enforcement of federal immigration laws (2009); and U.S. Membership in the World Health Organization (1947).

Our digital collection is unique in that it also reflects the breadth and the “long tail” of what is held in research library collections. In particular, you can find a wide variety of Executive Branch (agency and department) publications. For example, the Department of Energy’s assessment of The Potential Climatic Effects of Increasing Carbon Dioxide (1985), NASA’s Commercial Development Plan for the International Space Station (1998), or the newsletter of the President’s Council on Physical Fitness and Sports.

If you’re looking for a diversion, try our Never A Dull Moment collection of fun finds that features both popular and offbeat federal publications. Included are cookbooks, travel, and language guides, theatrical performance scripts, and comic books, as well as the timely title Telecommuting: a 21st century solution to traffic jams and terrorism (2006).

A Historical Pandemic Perspective

HathiTrust is a source of rich historical context regarding the COVID-19 pandemic. For example, documentation of federal government responses to epidemics from Avian Flu to H1N1; Congressional hearings regarding the medical supply chain, and the SARS threat; serial publications from the Surgeon General tracking the Influenza outbreak of 1918; and more recent reports from the National Center for Infectious Diseases covering Emerging Infectious Diseases.

Using HathiTrust to Access U.S. Federal Publications

As government information librarians, we encourage anyone who has internet access to make use of this rich collection of federal publications in HathiTrust. You can access via browser or mobile device, and the Using the Digital Library help section or Information Sheet for Students offer general guidance. The Federal Depository Library Program (FDLP)-hosted presentation Using the HathiTrust Fed Docs Collection also provides background.

HathiTrust offers both full-text and “catalog” search of the entire collection (17.4 million volumes, including documents). Almost all U.S. federal documents are in the public domain and available for reading and page download. If you are affiliated with a HathiTrust member institution you can log in with your university credentials to do all this, plus create collections of publications within HathiTrust and download public domain items.

Search Tips for HathiTrust U.S. Federal Documents

We have a number of suggestions for zeroing in on U.S. federal documents and tips for using the search capabilities of HathiTrust specifically for federal documents, based on front-line librarians’ experience:

Use the Author Facet The simplest way to find federal publications in the HathiTrust Digital Library is to perform any search, then use the author facet in the left-hand navigation bar to limit the results to a particular government agency, or to the U.S. Government Printing Office in general. One advantage to using the author facet is that state and federal agencies having the same name (such as “Department of Agriculture”) are easily distinguished. The author search facet allows one to view federal materials without state or even foreign agency materials in the mix. Or, choose a subject facet that includes the phrase “government publications” in some way, and add a few additional subjects for more highly refined results.
Understand Publishing Protocols Being aware of early 20th century government publishing protocols helps in interpreting HathiTrust’s search results. Many federal agencies’ most important reports were issued in series with such prosaic titles as Bulletin, Circular and perhaps the most creatively-challenged one of all, Miscellaneous Publication. If such titles show up in yourfull text search results, this could be a sign that you have uncovered some highly relevant material. In series like these, each issue was structured like a full length book, focusing on a single topic in great depth. For example, the 1913 document with the cover title Ten Hour Maximum Working-Day for Women and Young Persons is in HathiTrust under the title Bulletin of the United States Bureau of Labor Statistics. The take-away is that you might find some of the best material hidden under titles like “Bulletin.” Don’t hesitate to click on them and search deeper into the text, if they do turn up in your full text search results.
Dig into the Statistics One of the common reasons for seeking government information is to obtain trustworthy statistics. Most figures are presented in tables, but tables do not always read accurately in OCR. This can stymie a full text search. Instead, be like the librarians of yesteryear and simply consult the Statistical Abstract of the United States. Published annually from 1878-2012, it’s the most complete compendium for statistics from all the major federal agencies. Although it presents only the statistics that were frequently requested, because each table is cited, one can use it as a starting place for tracking down the source agency’s own publications to find even more data.
A Tip on OCR Misreads When searching for words in the full text of 18th-century documents, try spelling words as if there is a medial “s,” which OCR software often misreads as the letter “f.” To find information on horses, include the variant spelling horfes. Similarly, you can try Congress or Congrefs, British or Britifh, commissioner or commiffioner, president or prefident, etc.
Know Your Stages from Your Stages One of the best full-text search tips applies not only to government documents but to all historical publications in HathiTrust. Choose words and phrases that were commonly used during the desired time period. For example, one may start with a search for the word “stagecoach” but upon viewing results, discover that it used to be written as two words -- stage coach -- and that it was often abbreviated as “stages” as in the sentence, “The steamboat line connects with stages and railroad cars running east to Chicago.”

HathiTrust U.S. Federal Documents Collections

Given that HathiTrust is huge, we’ve scoped some full-text searchable sets as starting points for you to discover U.S. federal documents in HathiTrust. These collections include all of the in-scope volumes digitized by our member libraries, and we continue to add to them and fill in gaps as new digitized items come in.

U.S. Federal Documents: The more than 1.4 million volumes in HathiTrust that have been identified as U.S. federal documents. A great place to start if you want a way to narrow your search at the beginning.
U.S. Congressional Serial Set: The Congressional Serial Set contains a wealth of information on any topic discussed by Congress over the course of the last 200+ years, a lot! This collection gathers all the digitized Serial Set materials that we have in one, searchable place.
Bureau of Indian Affairs publications: This collection includes publications on such topics as education, health care, art, environmental impact assessments, ethnographic studies, and government policies towards Native Americans over time.
U.S. Environmental Protection Agency publications: This collection contains materials produced by the U.S. Environmental Protection Agency as well as some predecessor offices. Topics include water quality studies, pollution abatement assessments, gas mileage, and environmental impact assessments.
Foreign Relations of the United States: Foreign Relations of the United States, produced by the Department of State, is a series that “presents the official documentary historical record of major U.S. foreign policy decisions and significant diplomatic activity.”
Statistical Abstract of the United States: A complete set of Statistical Abstract of the United States volumes produced by the U.S. government between 1878 and 2012.
U.S. Civil Rights Commission: This collection gathers together publications produced by the U.S. Commission on Civil Rights, including annual reports, investigations of inequity including age and racial discrimination, and the text of hearings held around the country in the 1960’s and 1970’s.

HathiTrust’s U.S. Federal Documents collection is an outcome of many years of library investment in digitization and curation, and HathiTrust is committed not only to broad access, but also to digital preservation of these documents for the long term. We thank the many libraries who have contributed!

--HathiTrust Federal Documents Advisory Committee

It's No Secret - Millions of Books Are Openly in the Public Domain

keden — Thu, 10 Oct 2019 04:00:00 +0000

By
Kristina Hall, Copyright Review Program Manager, HathiTrust
Greg Cram, Associate Director of Copyright and Information Policy, New York Public Library

Since 2008 the HathiTrust Copyright Review Program has been researching hundreds of thousands of books to find ones that are in the public domain and can be opened for view in the HathiTrust Digital Library. Over the past 11 years, 168 people across North America have worked together for a common goal: the ability to share public domain works from our libraries. As of September 2019, the HathiTrust Copyright Review Program has performed copyright reviews on 506,989 US publications; of those, 302,915 (59.7%) have been determined to be in the public domain in the United States. The opening of these works in HathiTrust has brought the total of openly available volumes to 6,540,522.

The Copyright Review Program, now an operational program of HathiTrust, began as a grant-funded ambition of the University of Michigan Library, under the leadership of Melissa Levine. The Institute of Museum and Library Services (IMLS) funded three consecutive grants enabling the University of Michigan Library and grant collaborators to build a copyright review management system. The program is still going strong eleven years later, resulting in hundreds of publications determined to be in the public domain each week.

One way the Copyright Review Program determines the copyright status of items in the HathiTrust corpus is to determine whether they were properly renewed. In the United States, the copyright in works published between 1924 and 1964 had to be renewed about 28 years after the item was published; works could move into the public domain when their initial term of protection expired. The Stanford Copyright Renewal Database was one of the first to host monograph renewal records in an open access database, but much of the initial copyright registration information remains difficult to search.

In 2018, the New York Public Library began the difficult process to unlock the record of American creativity embedded in the US Copyright Office’s Catalog of Copyright Entries (CCE), comprised of 450,000 pages of registration and renewal records. The CCE is the published index of the records that are critical to understanding the copyright status and ownership of copyrighted works. The Copyright Office has been working to make images of these records available online, but searching these imaged records with precision and confidence remains elusive. No search function exists to reliably search the entire CCE; instead, users rely on analog techniques by opening multiple digitized volumes and paging through the records.

NYPL has embarked in an effort to enable accurate searching of the CCE by converting CCE records for 1923–1977 publications into a machine-searchable format. To make the records searchable, NYPL has begun to extract the CCE data as text. NYPL’s approach is to accurately transcribe the data, then parse the data into the appropriate fields so that users can facet their searches. The raw data is then made available on the project page and is freely accessible and usable. NYPL is actively gathering user stories for how users might access and use this data to build a set of requirements for a search interface.

HathiTrust has been enthusiastically following the work of NYPL to determine the possibilities of this data set. Copyright determination is an information problem at heart. The work that NYPL is doing to make information about copyright registrations available in a powerfully searchable format will greatly assist libraries who want to make digital collections broadly available to the public.

Before we jump to the possibilities, remember that not all books lacking a copyright renewal are public domain. From our experience, the media articles claiming 80% of titles from this period are public domain don’t appear to take into account the complexities such as:

restoration of copyright for foreign authorship or foreign publication
layers of copyright in a translated work
qualifying for copyright registration in another format like serialized novels, drama, poetry, lectures, and short stories
inclusion of materials reproduced by permission such as illustrations
renewal of a prior edition

HathiTrust sampled a small set of the NYPL registration records, specifically records lacking a renewal where the item had not yet been opened as public domain in HathiTrust. HathiTrust staff discovered it could help prioritize items awaiting a copyright review and to identify places where incomplete metadata in catalog records was hindering the search for items in the public domain. Out of 1,946 registration records in the NYPL sample set, 15% were already awaiting a HathiTrust copyright review and could be prioritized to go first. 18% were completely new items to add to the review queue. They had originally been passed over due to missing place of publication in the catalog record. 42% had already received a HathiTrust copyright review and were either opened public domain or had encountered one of the previously mentioned complexities. The remaining 25% had some indication in the catalog record that a public domain outcome would not be likely, such as having a foreign publisher.

Based on the sample set, HathiTrust has begun acting on the NYPL data to prioritize items for copyright review. The next round of work will include efforts to match more of the NYPL records to HathiTrust records. Then HathiTrust will be able to identify more books where insufficient metadata in the catalog record has prevented earlier review.

NYPL’s efforts to convert the entire CCE to machine-searchable text continue. NYPL was awarded a National Leadership Grant from IMLS in July 2019 to convert another 10,000 CCE pages, which would complete the registrations for Class A works registered between 1970 and 1977. At the conclusion of this grant, NYPL will have converted all Class A registrations for books registered between 1923 and 1977. Because the Copyright Office’s modern, machine-readable records begin in 1978, completing these registrations would close the gap to the historical records and enable searching across nearly a century of records. In addition to IMLS, NYPL’s project has been generously funded by the Ford Foundation and Arcadia Fund.

This is an exciting time for libraries as we strive to make digital collections broadly available. HathiTrust and NYPL are just one example where new access to old data has helped improve ongoing work in copyright. Yes, millions of books are public domain, it’s no secret. We are grateful for the chance NYPL and HathiTrust have to work together and put more public domain books in the hands of the public.

HathiTrust Shared Print Program - From Task Force Recommendations to Today

weltin — Thu, 22 Aug 2019 20:42:40 +0000

HathiTrust Shared Print Program - From Task Force Recommendations to Today

By Heather Weltin, HathiTrust Shared Print Program Officer

Since its start in 2016, the HathiTrust Shared Print Program has stayed true to its goals of ensuring preservation of print and digital collections by linking the two, to reduce overall costs of collection management for HathiTrust members, and to catalyze national/continental collective management of collections, all of which were identified as major goals in the Task Force Report.

Today, HathiTrust member libraries have committed to retain almost 18 million monograph volumes to be retained for 25 years under the HathiTrust Shared Print Program. These volumes represent more than 5.4 million individual titles held in the HathiTrust Digital Library (about 76% of all HathiTrust digital monographs). As we wrap-up the current phase of the program, we thought it was a good time to look at the history and success of the program compared to the original Task Force recommendations, and ponder the future.

A Blueprint for Shared Print: Task Force Report

The origin of the HathiTrust Shared Print Program has its roots in 2011. In October 2011, HathiTrust members approved a ballot initiative to develop a Distributed Print Monographs Archive corresponding to volumes represented within HathiTrust Digital Library. From Spring 2014 to March 2015 the HathiTrust Print Monographs Archive Planning Task Force worked on a Final Report outlining recommendations for a HathiTrust Print Monographs Archive, now called the HathiTrust Shared Print Program.

The HathiTrust Print Monograph Archive Planning Task Force Final Report provided a useful blueprint for the HathiTrust Shared Print Program. The goal of the Shared Print Program, called HathiTrust Shared Print Monographs Program (HTSPMP), was to secure matching print copies of every digital monograph in the HathiTrust Digital Library’s corpus and to do so through a quick launch of the program. The Task Force Report had an “initial target of matching 50% of the digital collection of monographs, roughly 3,000,000 titles, and will be built from large commitments by a group of volunteer libraries.”

Starting Phase I and Expanding to Phase II

In June 2017, HathiTrust launched Phase 1 of the Shared Print Program. During that initial Phase, 50 HathiTrust member libraries committed to retain more than 16 million volumes for 25 years. This is equivalent to 4.8 million individual book titles held in the HathiTrust Digital Library (about 65% of all HathiTrust digital monographs). Through volunteer member library commitments, HathiTrust was able to secure more print commitments than the 50% estimate the Task Force Final Report expected during Phase 1.

Following the quick launch of Phase 1, Phase 2 focused on a more thoughtful collective collection building analysis. HathiTrust worked with Sustainable Collection Services (SCS) to analyze print holding records across shared print retention libraries and to develop retention commitment lists. For this Phase, we also built in one of the Task Force’s recommendations “to disperse commitments geographically." Phase 2, which was completed in June 2019, we secured another 1,578,626 monographs as part of this phase (more than 750,000 individual titles) distributed across the U.S. Census locations.

We adopted several Task Force recommendations that would help establish a lightweight program that relied on the “common interests and values, low-cost voluntarism, and services currently in place among members.” Following the recommendations, we accepted retention commitments for circulating items, placed no conditions on the type of facility or shelving in which they would be held, and did not require validation or verification of these retained volumes.

Because of the comprehensive details outlined in the initial Task Force Final Report and the number of responses from Member Libraries interested in participating, HathiTrust’s Shared Print Program was able to quickly launch a new program while also building on and strengthening existing shared print programs, allowing it to become the largest monograph shared print program nationally.

Using the Blueprint to Envision the Future of HathiTrust’s Shared Print Program

As we look to the future, we'll begin to reevaluate several Task Force Final Report recommendations that we deferred. These recommendations include infrastructure development, discovery and access of materials, and business modeling. HathiTrust’s Shared Print Program relies upon our Shared Print Registry that records the commitments made by all of the Retention Libraries. This data includes item-level information for commitments and limited metadata. As our program develops, we will start investigating ways to develop and provide Members with collection reports comparing library print holdings to retention commitments. We also plan to include the ability to search, display, and download information about shared print retention commitments made by the Member's own or another HathiTrust library.

Along with these capabilities, we will also turn an eye towards resource sharing networks and the ability to facilitate discovery to delivery of HathiTrust Shared Print Program materials. We plan to rely on the strength of existing resource sharing networks in order to do this. The goal is to enhance resource sharing for this collection and not disrupt services already in place at all Member Libraries.

Disclosures of commitments in local systems will continue to be a cornerstone preservation aspect of our program in order to prevent deselection of these volumes that the retention libraries have committed to retain. HathiTrust will continue discussions with OCLC regarding their shared print registration service. We anticipate including HathiTrust’s Shared Print Program commitments in aggregated data on library collections OCLC accumulates in order to support shared print collection management and advancement.

Intentionally Growing the Shared Print Community

The initial costs for developing and maintaining the repository’s technical infrastructure were funded by HathiTrust. As we begin to develop member services and evolve the Shared Print Program, HathiTrust will work toward understanding the total cost of shared print activities among the retention libraries and at HathiTrust, to better ensure that the program is financially resilient and sustainable well into the future. While the initial Task Force Final Report did evaluate options, several years have passed so we plan to start reviewing again.

With all of the above, HathiTrust plans to focus on ways to ensure an annual pace of growth for our program. These ideas include securing commitments on newly deposited materials, the expansion to different formats, and partnering with other Member Libraries. As called for in the Task Force Final Report, our Shared Print Program will continue to enable HathiTrust to “make further, transformative impacts on the management of libraries” and enable and “encourage a community- wide approach to the management of the collective collection by producing a critical mass of public retention commitments, defining new preservation and collection management standards, and catalyzing enhanced service development.” Between growth and development in services and structure, HathiTrust’s Shared Print Program will continue to position itself as a leader in print retention as it has been in digital preservation.

Read more about the program, see lists of retention libraries, program metrics, and the original planning documents online at HathiTrust Shared Print Program.

2019: HathiTrust at Mid-Year & Upcoming Opportunities

furlough — Tue, 06 Aug 2019 17:54:45 +0000

Six months into the first year of implementing Strategic Directions 2019-2023, I want to share some highlights and take a look at where we are headed in the next six months. HathiTrust’s Strategic Directions, 2019-2023, reinforces our commitment to digital preservation and access, to our role as stewards of the cultural record, our interdependent success as libraries, and to our collective work to support this mission. The specific investments and actions we are undertaking in 2019 align with the plan’s overarching objectives to Transform, Enhance, and Empower.

Between January 2019 and June 2019, HathiTrust and its members

Welcomed new staff Eleanor Dickson Koehl (Digital Scholarship Librarian) and Heather Weltin (Shared Print Program Officer).
Approved formal Membership Criteria, Fee Model changes, and Board Bylaws to clarify policies and support diversity in membership.
Improved accessibility for those with print disabilities by updating the user interface and features of the HathiTrust Digital Library.
Opened nearly 54,000 publications published in 1923 that are now in the public domain in the United States. . Over 48,000 additional volumes are now also open to global users.
Completed Phase 2 of the Shared Print Program, with 79 libraries committing 5.4 individual print titles also held in the HathiTrust Digital Library (about 76% of all HathiTrust digital monographs).
Added eight new members: Clemson University; Monash University (Australia); University of Auckland (New Zealand); University of Kentucky; University of North Texas; University of Southern California; Queen’s University (Canada); San José State University.

HathiTrust staff and member-led advisory groups continue to advance other strategic projects, as well. We are also pleased to have a role in the Mellon Foundation- funded Federating Repositories of Accessible Learning Materials for Higher Education (FRAME).

Between July 2019 and December 2019, we will continue to improve accessibility for those with print disabilities; plan new services of the Shared Print Program; delve deeper into collection analysis; and build our organizational expertise by hiring an Enterprise Architect and a Metadata Analyst.

In the near term, I want to draw your attention to several opportunities to participate that you’ll hear about in the coming weeks. As core HathiTrust staff planned for the latter half of 2019, they identified several activities in which members can meaningfully participate and add their unique perspective. These include opportunities to

Serve on one of several new Program Steering Committee working groups and task forces, coordinated by PSC chair, Karla Strieb. We will soon begin to establish, charge, and invite members.
Host a HathiTrust Research Center 2019-2020 Workshop in the new series coordinated by Digital Scholarship Librarian, Eleanor Koehl.
Register for the 2019 Member Meeting, to be held October 23, 2019 in Rosemont, IL, coordinated by the program planning group.
Support access for those with print disabilities on your campus by passing along improved accessibility documentation or enrolling your campus in the Equitable Access service for those with print disabilities.

Please consider each opportunity and pass along the information to those in your organization who may be interested. I appreciate all that you have contributed to our work in 2019 thus far and look forward to seeing many of you at the October Member Meeting.

mike

HathiTrust Marks Global Accessibility Awareness Day 2019 by Planning Website Update

azaytsev — Thu, 16 May 2019 13:47:34 +0000

By Angelina Zaytsev

On Global Accessibility Awareness Day 2019, when many developers and designers commit to learning more about accessibility and improving the websites under their control, HathiTrust looks forward to an exciting update that will improve the accessibility of the HathiTrust Digital Library. The Digital Library portion of the website allows users to search, retrieve, and read books and other items within the library. This update to the Digital Library interface will be made in the next couple of weeks (follow @hathitrust on Twitter to receive news).

In the past few months, we have undertaken a comprehensive accessibility evaluation of our Digital Library, which was conducted by University of Michigan student, Luke Kudryashov (read more about our Accessibility Review Process below). As a result of this review, we identified a number of areas where we could improve the HathiTrust website and make it easier for users with disabilities to navigate the website and read books.

The last significant overhaul of the website occurred in 2013. Internet technologies have changed in the last six years, allowing us to do things in 2019 that weren't possible in 2013. In addition, recommendations have changed, and additional guidance and best practices for creating accessible web content have emerged. This was a lesson to us of the costs of stasis: even if we haven't changed, the world around us has.

While these changes are intended to most benefit users who navigate the Digital Library using screen reader software and other adaptive technology tools, our sighted patrons will also notice some of the changes that will improve their ability to use the Digital Library.

Changes Coming Soon

We will soon release the following updates that collectively increase the accessibility of the HathiTrust Digital Library:

Replace search tabs with radio buttons
Increase the base font size from 13-16 pixels to 16 pixels throughout the site
Increase the contrast of text and other visual elements
Move some content from the left side of the screen to the right to allow users to get to critical content more quickly
Move “jump to page” feature to the bottom of the book display
Add new scrollbar to navigate between pages in the book display
Modify use and function of pop-ups in the book display
Improve spelling and labelling issues for features detected by screen reader software.
Enhance book display graphics to help users identify when a page is loading vs. blank
Implement more informative download dialog to tell users what percentage of the download is complete.
Improved tab focus throughout the digital library

The following screenshot depicts some of the expected changes. Overall, the style and design of the Digital Library will remain familiar, with some small tweaks in the look of the site.

For more on HathiTrust’s commitment to increasing access to the HathiTrust Digital Library for patrons with low-vision or other disabilities, visit the Accessibility page on the website.

HathiTrust's Accessibility Review Process

We reviewed the accessibility of the HathiTrust Digital Library using the following methods. Although many automated tools can check some elements for accessibility, these tools must be used in coordination with manual processes to identify all accessibility problems.

Checked the website against the main website accessibility guidelines, notably:

W3C WCAG 2.1 at Levels A and AA,
United States Revised Section 508 standards, and
European Union EN 301 549 standards (PDF).

Conducted manual keyboard checks to make sure all features are accessible using only a keyboard without a mouse.
Tested color contrast with the ColorZilla Chrome extension and the WebAIM Color Contrast Checker.
Tested the site’s compatibility with NVDA (in Firefox) and VoiceOver (in Safari) screen readers.
Tested various operating system and browser settings to ensure that the site preserved user determined settings (color contrast, overall zoom, text zoom, font type, text spacing).
Tested downloadable PDFs with NVDA and VoiceOver.
Evaluated the consistency of navigation and layout of analogous pages.
Evaluated the accuracy and understandability of error identification and suggestions for correction.

Many resources provide more information about conducting accessibility reviews, and we recommend starting with the W3C’s “Evaluating Web Accessibility Overview” and WebAIM’s many articles.

HathiTrust Members: Opening State and Local Agriculture Documents

keden — Tue, 07 May 2019 14:49:50 +0000

Wealth of Agriculture Documents

HathiTrust Digital Library is a trove of preserved agriculture documents and traditional or historical practices. Several institutions have made these items available using a Creative Commons license, including a recent release of more than 1,000 agricultural bulletins and reports published by Michigan State University’s Agriculture Experiment Station and Cooperative Extension Office.

Suzanne Teghtmeyer, Agriculture, AFRE, Botany, Forestry & Horticulture Librarian with Michigan State University Library, has been instrumental working with MSU to get the university’s agricultural documents opened. MSU’s story is part of a growing awareness among HathiTrust members that they can play an important role in securing rights information for their institution’s own publications.

Document Collection: Agricultural Documents of Michigan State University

MSU Extension Bulletin 468: Michigan Agriculture. https://hdl.handle.net/2027/uiug.30112019644803?urlappend=%3Bseq=117

Ensuring Continuing Access

What was going on one hundred years ago to fight a disease, grow crops, or prepare food? The state publications of the Agriculture Experiment Station and Cooperative Extension Office preserve a wealth of information for researchers today. Most states produced them. The mission of land-grant institutions such as Michigan State University was to ensure continuing access to these documents for research and historical agricultural practices. Many are brittle and in need of preservation.

Localized Agricultural Research

Suzanne routinely gets calls from Agriculture Experiment Station personnel and Extension educators asking for access to older documents. Sometimes they’re requested through interlibrary loan. Now that the collection is available to access through HathiTrust, patrons can be pointed directly to the HathiTrust online repository.

The agricultural documents of Michigan State University are a record of university research from the university’s early days. The localized nature of the agricultural research highlights regional crops, so Michigan would have more information on cherries and blueberries. For someone wanting to grow those crops using a heritage method, the collection gathers options for those not interested in using modern technologies. The information is tailored to the growing conditions and thus is very state and region oriented.

Belle Sarcastic, the record holding milk producer. Annual report of the Agricultural Experiment Station, Michigan State University. Vol. 9, 1896. https://hdl.handle.net/2027/uc1.$b647868?urlappend=%3Bseq=266

Copyright Challenges

Unlike federal documents, state publications published after 1923 may be protected under copyright law. They remain categorized as such, and thus in limited view in HathiTrust, unless individually reviewed for copyright status or opened with a Creative Commons license. Many of these agriculture documents were freely disseminated to the public at the time of publication. In recent years there has been a big push to get all the bulletins and circulars digitized, preserved, and released to the public.

The biggest challenge for Suzanne was in finding and identifying the rights holder of the documents. She didn’t know if that would be the Director of the Agricultural Experiment Station, Dean of the College of Agriculture, or the University Board of Trustees. After speaking with the Library's copyright librarian, Suzanne reached out to MSU Technologies, the technology transfer and commercialization office who manages the university's extensive intellectual property portfolio. They in turn reached out to the agricultural deans for guidance and permissions, and once they received it, the Creative Commons license was signed.

“The National Agricultural Library is also in support of public dissemination of these documents,” says Suzanne. “It makes the nation stronger in the realm of food and livestock production. If they know which states have experiment station bulletins available in HathiTrust they can direct people to find the open access information.”

To view all MSU agriculture documents both Creative Commons licensed and public domain, see the collection Agricultural Documents of Michigan State University

HathiTrust Members Act to Open More Ag Docs

HathiTrust members Cornell University, Texas A&M, and the University of California have all been actively improving access to agricultural documents through Creative Commons licensing agreements for open access. For a good summary of how to get your agricultural documents open in HathiTrust see “Opening Ag Pubs in HathiTrust” by Cornell University, University of California, and the California Digital Library, https://sites.google.com/view/open-pubs-in-ht/home. Also see McGeachin, Robert B. 2017. “Set Your Agricultural Publications Free in the HathiTrust Repository.” Journal of Agricultural & Food Information, 18:1, 3-8, DOI: 10.1080/10496505.2016.1263201 . https://pubag.nal.usda.gov/catalog/5621977

If you are interested in working with HathiTrust to get your institution's publications open with a Creative Commons license, Kristina Eden of HathiTrust will present a webinar with tips for compiling a list of items and working with your institution to seek permission.

“Opening Your Institution’s Publications in HathiTrust”
May 30, 2019 from 2-3pm ET
Kristina Eden, Copyright Review Program Manager
No registration required. Anyone with the link can join: https://bluejeans.com/957875719

Opening Up 15,000+ Federal Documents: An Algorithm Story

Heather — Wed, 10 Apr 2019 16:52:38 +0000

By Heather Christenson

Over 15,000 federal documents previously in limited view are now full view in HathiTrust thanks to an adjustment in our metadata management system. This is something to celebrate for a number of reasons, first and foremost, these thousands of digitized federal documents are now available to anyone! Additionally, we’ve improved our system so that new federal documents coming into HathiTrust will be identified more effectively and thus available in full view. This work is a great example of librarians owning and managing our system, and in this case, engaging with an algorithm and optimizing it for public good.

When our libraries contribute digital volumes to HathiTrust, we receive bibliographic data along with them. For many volumes in HathiTrust we have received multiple contributions of the same title, both the metadata and digital volumes. In those cases the multiple bibliographic records are clustered together in our system, with one record chosen as the preferred record that is then used in HathiTrust discovery and access services. The preferred record is chosen by a record scoring algorithm that assigns points to different aspects of each record and chooses the record with the highest score. Importantly, the viewability status of an item relies on data within the preferred record, so viewability within the HathiTrust Digital Library can be affected by which record the algorithm designates "preferred”.

In HathiTrust, it is our intent to provide federal documents in full view to the extent legally permissible. A group of us decided to look at the record scoring algorithm to see if there were adjustments that could be made to prefer any record within a cluster that indicates that a given volume is a federal document.

For some libraries, cataloging of federal documents has necessarily been minimal, and this is generally reflected in the bibliographic data that HathiTrust receives. We also saw cases where more richly cataloged bibliographic records for federal documents — that had been sometimes specifically invested in by member libraries — were not being chosen as a preferred record by the record scoring algorithm.

We found that the record scoring algorithm’s generalized actions sometimes chose preferred records for federal document volumes that did not include the information that the volume was a federal document. When these preferred records were acted upon in the bibliographic rights determination process, we were not able to identify them as federal documents, so in many cases were not able to present them to users in full view.

We decided to improve the algorithm by assigning higher weight to records indicating U.S. federal document status. (For those wanting details — checking the 008 field for the “f” and “u” flags, the ‘f’ in 008/28 and ‘u’ in 008/17).

By making this change we’re now providing more than 15,000 federal documents in full view that were previously in limited view, bringing to a total of 1,182,913 federal documents in full view in HathiTrust as of April 1, 2019. As new digitized volumes and associated bibliographic data are contributed by our member libraries, federal documents will be identified more effectively and thus provided in full view.

Examples of the range of federal documents now in full view include volumes of Statistics of Income from the Internal Revenue Service, initially requested by a user, and this U.S. Senate Hearing on the Role of Giant Corporations, from 1971. Or, the briefly titled Leaflet from the U.S.D.A., covering the Cotton Aphid, The Meadow Spittlebug and How to Control It, or Centipedes and Millipedes in the House (eek!). These publications and many many more can be found in the U.S. Federal Documents collection.

Reflecting on this project, it is important to recognize that it’s possible to make these kind of improvements because we as libraries control the processes, we can bring collective expertise to bear on problems, and we do this in the spirit of greater access!

Many thanks to all who contributed to this project, especially: Tim Prettyman, Charlie Collett, Angelina Zaytsev, Josh Steverman, Kathryn Stine, Sandra McIntyre, and former HathiTrust staffer Valerie Glenn.

Part II Popular LibGuide Draws on HathiTrust Repository: Development and Maintenance of an Integrated LibGuide

jbelle — Thu, 04 Oct 2018 17:41:07 +0000

Marie’s inspiration for the “Prices and Wages by Decade” LibGuide was the well-worn, paperback copy of The Value of a Dollar. With the HathiTrust’s collection of digitized U.S. Federal Documents, as well as those maintained by FRASER (Federal Reserve database), Marie has been able to build the site using links to primary source documents freely available in the public domain. Hosting and maintaining the site requires a site host, some website tools, a little raw labor, and a lot of passion for making information more easily available.

“This fills the gap that the general Internet can’t,” she explains, noting that Internet records or born-digital items in the federal government tend to express more recent data. Using the digitized and historic federal documents contributed to the HathiTrust corpus by member institutions, Marie provides a searchable, reputable source database. People want what she calls “fine-tuned historical evidence.”

The site points to U.S. government publications listing retail prices for typical consumer purchases and wages for common occupations. This includes prices on everything from homes to “boxes of corn flakes . . . canned corn . . . and dairy products.” In an example of retail prices from 1955, the site links directly to a page in HathiTrust’s digital repository citing Agricultural Statistics from the U.S. Department of Agriculture, digitized and contributed by The Ohio State University (HathiTrust member since 2008).

To build and operate the site, she uses her library’s existing content management system to integrate HathiTrust repository links, references, and stats. For some details, she relies on manual coding by student researchers. To track the site usage and measure it’s reach, Marie uses Google Analytics and Data Studio. But there’s always more she’d like to do . . .

She works on the site over the weekends and after 28 years in the library field, recognizes it as a highlight of her career. “There are so many more things to add to it!” she says. When she identified that most users of the LibGuide came from a mobile device, she optimized the site for mobile use. She’d like to add an accordion feature to make the amount of information easier to read.

As HathiTrust continues to acquire digitized U.S. Federal Documents and fill in missing volumes, Marie will continue to enrich the data in the “Prices and Wages by Decade” LibGuide and to refine the site. The 500,000+ annual site visitors may not know the behind-the-scenes story, but they will continue to gain the benefits.

What will you do with HathiTrust?

Read Part I: Unleashing the Power of U.S. Federal Documents

Is your institution using HathiTrust materials in a similar way? Do you have a resource you would like to share with other HathiTrust member institutions? Please contact Jessica Rohr, member engagement and communications specialist, jbelle@hathitrust.org.

HathiTrust Research Center Extends Non-Consumptive Research Tools to Copyrighted Materials: Expanding Research through Fair Use

jbelle — Thu, 20 Sep 2018 17:53:02 +0000

HathiTrust has reached a tremendous milestone in the history of HathiTrust and the HathiTrust Research Center’s services.

Since 2011, HTRC has been developing services and tools to allow researchers to employ text and data mining methodologies using the HathiTrust collection. To date, this service has been available only on the portion of the collection that is out of copyright. With the development of a landmark HathiTrust policy and an updated release of HTRC Analytics, HTRC now provides access to the text of the complete 16.7-million-item HathiTrust corpus for non-consumptive research, such as data mining and computational analysis, including items protected by copyright.

This extraordinary opportunity to use copyrighted materials for non-consumptive research purposes expands research access to the entire HathiTrust digital collection, which is sustained by HathiTrust’s 140+ member libraries. Researchers may access HTRC’s easy-to-use computational tools ideal for beginners, as well as more complex tools to meet advanced data analysis needs.

HTRC Algorithms	A set of tools for assembling collections of digitized text from the HathiTrust corpus and performing text analysis on them.	Including copyrighted items for ALL USERS.
Extracted Features Dataset	Dataset allowing non-consumptive analysis on specific features extracted from the full text of the HathiTrust corpus.	Including copyrighted items for ALL USERS.
HathiTrust+Bookworm	A tool for visualizing and analyzing word usage trends in the HathiTrust corpus.	Including copyrighted items for ALL USERS.
HTRC Data Capsule	A secure computing environment for researcher-driven text analysis on the HathiTrust corpus.	Public domain for all users. Exclusive member benefit: full corpus access for the Data Capsule service, including copyrighted items.

How is This Possible?

This work has been several years in the making. A primary goal of HathiTrust is to enable the widest possible lawful research and educational uses of the HathiTrust collection. In recent years, US courts have recognized the solid legal basis for non-consumptive research on copyrighted materials. In 2016, HathiTrust established a working group to develop the Non-Consumptive Use Research Policy to ensure the responsible research use of copyrighted items.

The policy is now enacted in an updated release of HTRC Analytics, which allows researchers to conduct computational text analysis on copyrighted items as permitted under US copyright law. Non-consumptive research use DOES NOT change the legal status of items protected under copyright.

Thanks to all who have helped HathiTrust reach this milestone in our 10th anniversary year. HathiTrust looks forward to supporting researchers in using these resources.

Additional Resources

HTRC Analytics
HTRC Help & Documentation
Chart on HTRC Analytics Tool Access
Getting Started with HTRC Guide
HathiTrust Non-Consumptive Use Research Policy

If you have specific questions or need help getting started, please contact htrc-help@hathitrust.org. Media inquiries, please contact Jessica Rohr, Member Engagement & Communications Specialist, at jbelle@hathitrust.org.

Popular LibGuide Draws on HathiTrust Repository: Unleashing the Power of U.S. Federal Documents

jbelle — Fri, 07 Sep 2018 13:53:12 +0000

Prior to becoming the Head of Government Information & Data Archives at the University of Missouri (HathiTrust member since 2011), Marie Concannon spent ten years at the State Historical Society of Missouri. A self-proclaimed history lover and purveyor of government sources, she enjoyed answering visitor questions such as ‘What did such-and-such cost in such-and-such time period?’

When she joined the University of Missouri, she discovered that one of the most popular reference books in the collection was a tattered copy of The Value of a Dollar, a collection of United States prices and wages over the decades with data derived primarily from the U.S. Bureau of Labor Statistics.

Always considering how to better connect patrons with sought-after data, Marie thought, “I should just tell them where to find the information!” And thus she began building the “Prices and Wages by Decade” library guide.

Marie knew she wanted to make a website and in 2013-2014, she enlisted the help of a work study student to get it started. Hosted at the University of Missouri, the site uses HathiTrust’s collection of more than 1.2+ million US federal documents for 75% of its references. When the student graduated and Marie looked through the usage statistics of the site, Marie realized what a powerful resource she had created. “People are finding this! Oh my gosh. I’ve got to fix this thing up,” she recalls thinking.

She further refined the site to improve the HTML and basic search functionality. Today, the interactive site receives approximately 1,300 hits per day and more than 500,000 annually — more than the library's main webpage. Marie says the site has already hit its half-million-mark in 2018.

So just who is finding and using the “Prices and Wages by Decade” libguide?

“Most researchers appear to be in education in some way,” Marie observes. The site is free and open to anyone online. She receives email messages from individuals at middle and high schools, as well as instructors in higher education. Teachers often reference the site or information derived from the site in classroom materials.

While California and Texas account for tens of thousands of visitors, usage data shows visits from researchers around the world. In the month of June 2018, for example, Marie identified visits from Denmark, Venezuela, Macau, Oman, Bruni, Canada, Australia, and the UK, as well as Guam. Because U.S. governmental documents often include source data from countries outside the United States, researchers may find data pertaining to more than just United States wages and prices. Currently, Congressional Hearing reports citing costs of prescriptions in 1970s are among the most popular.

The site receives the most traffic on weekdays (owing to its strong educational user base), but Marie has observed a jump on the weekends in certain categories of questions.

“‘How much did beer cost in 1978?’ is a big theme,” she laughs. Even that essential data is preserved in the HathiTrust repository and made even more accessible through “Prices and Wages by Decade.”

In Part II: Development and Maintenance, go behind-the-scenes to learn how Marie integrates HathiTrust sources into the site, how other member institutions can do the same, and what she does to keep it relevant and running today.

Thousands of Historical California Legislative Publications Digitized and Openly Available Online!

jbelle — Fri, 07 Sep 2018 13:20:06 +0000

By Paul Fogel, Manager & Technical Lead, Mass Digitization, California Digital Library

California historical legislative research just got a bit easier. As a result of a collaboration between the California Office of Legislative Counsel and librarians at the University of California, Stanford University and the California State Library, nearly 4,000 California Assembly and Senate publications are now online and have been opened for reading access to everyone worldwide. They are available in the HathiTrust Digital Library as a featured collection, as well as individually in Google Books.

The project was initiated at the University of California's California Digital Library (CDL) by current HathiTrust Program Officer for Federal Documents and Collections Heather Christenson. CDL worked with California's Office of Legislative Counsel to clarify language in recently approved California Assembly Bill no. 884 to confirm that the the collected set of historical publications of California legislative output are indeed in the public domain and can be broadly shared. The recently opened volumes were digitized as part of the Google Books project from copies collected by UC Berkeley and many other university libraries and have been aggregated in the HathiTrust Digital Library, a partnership of over 140 academic and research libraries.

What are the California Legislative publications?

The collected California legislative materials include introduced bills, amended bills, and statutes of the California Assembly and Senate, some dating back to 1849, as well as published materials that support, augment or contextualize the bills. The supporting materials include legislative journals (dating back to 1849) that briefly summarize the proceedings of the California legislature (recording votes, who proposed or withdrew what legislation), in addition to journal appendices (which contain annual and special reports from various executive departments, commissions, and special panels, along with Senate and Assembly committee reports and documents), final bill histories, and constitutional amendments and resolutions.

What is in this collection?

In the table below are given the total number and date ranges for each of the various types of publications represented.

Type	Number of Volumes	Dates
Bills	1,832	1911 - 1988
Statutes	183	1849 - 2008
Final Histories	32	1973 - 2010
Journals	1,483	1849 - 2009
Journal Appendices	268	1855 - 1931
Constitutional Amendments	55	1964 - 2016
Final Calendars	80	1899 - 2011

The collection is not yet comprehensive, as there are gaps in the series for each publication type, and work will continue to locate copies of missing volumes, to digitize them, and to include them in this set of open materials. One characteristic to note is that the historically published and bound volumes of California Bills do not capture all versions and revisions of a bill in the same way as the state's Legislative Information website (http://leginfo.legislature.ca.gov) or the Office of the Chief Clerk of the California State Assembly (http://clerk.assembly.ca.gov/archive-list). Often, only the introduced version and an amended version of a Bill are printed in the publication.

Another caveat of this collection is that the volumes sourced for digitization were library copies that have been available for patron use for decades and may contain library stamps and marginalia. Occasional errors resulting from the digitization process may also be present.

To contribute to the efforts to complete and correct these materials, please send communications to HathiTrust at feedback@issues.hathitrust.org.

Use of the Collection

The HathiTrust collection built from these materials can be found at https://babel.hathitrust.org/cgi/mb?a=listis;c=1808948120. Gathering these materials into a distinct collection within the HathiTrust Digital Library enables users to search within the full text of these materials without having to also search the full text of HathiTrust's 16 million other books. The library's interface allows users to browse, directly read each volume, download one-page-at-a-time, and share. Individuals affiliated with one of HathiTrust’s 140+ member institutions have special access to download full-work PDFs of the volumes. Data mining and textual analysis can be performed on the publications in the HathiTrust Research Center.

The authors would like to thank the following people for their support and contributions to this project: Diane Boyer-Vine and Lindsay Pealer from the California Office of Legislative Counsel; UC Berkeley's Erik Mitchell, Elizabeth DuPuis, Marlene Harmon, Julie LeFevre, and Jesse Silva; Jutta Wiemhoff of the University of California's Northern Regional Library Facility (NRLF); Ivy Anderson, Katie Fortney, Renata Ewing, Kathryn Stine and Paul Fogel from the California Digital Library; Stanford University's Kris Kasianovitz; Heather Christenson, Jessica Rohr, Kristina Eden, Angelina Zaytsev and staff from the HathiTrust Digital Library; and the Google Books project. For more information, please contact feedback@issues.hathitrust.org.

1) The California LegInfo website only covers bills 1999 to the present, although a former version of the site does provide access to Assembly and Senate Bills from 1993-2016. The website of the Office of the Chief Clerk of the Assembly collects Journals, Histories and Statutes going back to 1849, as well as collecting Daily publications dating back to 1995, but does not support text searches.

2) The text is produced from optical character recognition (OCR) systems and will contain errors.

Findings from the Federal Documents User Needs Survey

Anonymous — Wed, 20 Jun 2018 20:37:14 +0000

By Valerie Glenn, Federal Documents Analyst, and Heather Christenson, Program Officer for Federal Documents and Collections

As reported previously, the HathiTrust US Federal Documents Program has been conducting a user needs investigation to learn more about users of HathiTrust’s federal documents. As part of this investigation, in April 2018 the HathiTrust Federal Documents Team conducted a survey of users, asking for feedback on their use of HathiTrust to access federal documents content. We received 185 responses, both from individuals affiliated with HathiTrust member libraries and from individuals who are not affiliated with HathiTrust. All respondents were from North America, primarily the United States. The majority of respondents were librarians or library staff.

HathiTrust: A “Go To” Place for US Fed Docs

Overwhelmingly, respondents have visited the HathiTrust Digital Library looking specifically for US federal documents. Of those who had not (3), 2 definitely used federal documents in their results and the other was not sure.

We also asked respondents if they thought of HathiTrust as a place to go for access to US federal documents. Of the 173 responses, 151 answered “yes” and 22 answered “no.” Among the popular ‘Yes” reasons were:

Lots of digitized publications available
HathiTrust has many documents that can’t be found elsewhere
Open/free access to the full text
Great access to older content
The libraries who deposit content with HathiTrust are also Federal depository libraries with large collections

Among the popular “No” reasons were:

I try to find information through federal websites
I generally use HathiTrust for searching full text of older books
We can’t download them (this is primarily a concern of non-members)

Reasons to Access Federal Documents in HathiTrust

We also asked why respondents use HathiTrust federal documents content. 144 respondents are looking specifically for federal documents; 108 use HathiTrust because the content is open/fully viewable. Another popular reason is to help with local collection management decisions - several respondents, regardless of affiliation, rely on HathiTrust content as a digital surrogate, particularly when making collection management (or collection retention) decisions.

Other Themes From HathiTrust Fed Docs Users

A final question asked respondents to share with us “anything else” about their use of federal documents in HathiTrust. This yielded a wide variety of responses; however, the following themes emerged:

Respondents have questions about the scope of federal documents content in HathiTrust, and about how comprehensive it is (overall, or in specific content areas such as Congressional hearings). Several users urged us to keep adding to the collection and make it more comprehensive.
Serials are challenging. Respondents commented on how difficult it can be to search for serials; related, multiple users mentioned that serials (and other items with the same title) should be grouped together on one catalog record.
Respondents would like more content open/fully viewable. There were also requests to remove download restrictions for those not affiliated with member libraries (all but one of those came from users not affiliated with a HathiTrust member).

Overall observations

Respondents consider HathiTrust a go-to place for older federal government documents. They like the ability to search the full text of many publications that aren’t digitally available anywhere else. At the same time, there are a lot of questions about the scope and comprehensiveness of the HathiTrust collection.

Many respondents use the HathiTrust copy as a digital surrogate and to help with collection management at their institution, regardless of affiliation with HathiTrust.

Criticisms centered around the quality of metadata. While some users appreciate being able to search by subject heading, the majority suggested improvements that could be made to the catalog records: adding a SuDoc call number; improving the way dates are recorded; having one record for a single title, not multiple records.

We plan to take survey results into account in planned and future projects such as the third phase of our user needs investigation, metadata improvement project(s), collection development efforts, and improvement of HathiTrust user services.

Thank you to all who participated in the survey!

Interview with Boston University Professor & HathiTrust Researcher Cathie Jo Martin

jbelle — Mon, 02 Apr 2018 14:09:52 +0000

By HathiTrust

"I want to understand how reformers in Britain and Denmark in the 18th and 19th centuries thought about education and I use computational linguistics techniques applied to literature to do so."

In Her Own Words: Research Summary

Despite their nineteenth-century political economies, poor, rural Denmark becomes a leader in public, mass primary education (1814) and secondary vocational training; while rich, industrial Britain creates public, mass schooling in 1870 and embraces unitary, academic secondary education.

I want to understand how reformers in Britain and Denmark in the 18th and 19th centuries thought about education and use computational linguistics techniques applied to literature to do so. The Boston University Hariri Institute funded the project. Ben Getchell and Andrei Lapets wrote code and helped me learn to calculate word frequencies and to implement unsupervised topic modeling.

A close reading of coming-of-age novels in Britain and Denmark demonstrates that authors differ cross-nationally on their views of education as well as on the role of the individual in society. British novels beginning in the early eighteenth century largely portrayed learning as an individualistic activity of self-discovery for the upper classes and novelists later sought to expand the right to education to improve the circumstances of the poor.

In comparison, Danish novels portrayed education as a tool for building a strong society, and youth were required to submit to the wisdom of elders for the good of society.

I used corpora of British and Danish literature to analyze word frequencies surrounding snippets of text surrounding education words, and demonstrated that my observations about selected novels hold true in the analyses of larger corpora.

On Using HathiTrust Datasets and Services

I constructed lists of classic corpora in Britain (562 items) and Denmark (521 items), using the Archive of Danish Literature and online lists of great works in the two countries. Most of the Danish sources were available from the Archive of Danish Literature; however, HathiTrust was able to supplement the pieces that I did not have. I got all of the full-text files from HathiTrust from Britain.

I [also] looked at Ted Underwood’s extremely helpful list of British literature. This is a great resource; however, the first dates of the publications of the volumes in my corpora are very important to my research design; but many of the works in HathiTrust are later editions of works. Therefore, I had to manually adjust the dates. This motivated me to develop my own lists of important works.

I am so grateful to HathiTrust for answering endless questions at the beginning of my research. The HathiTrust folks such as Eleanor Dickson have been extremely helpful to me in my efforts to develop this research project.

I have an article that is forthcoming in World Politics, the premier journal in comparative politics and one of the top journals in the field of political science. I simply could not have written the article without the resources provided to me by HathiTrust.

If you could have dinner with any person, living or dead, who would it be and why? What would you order?

… It would have to be Johann Sebastian Bach, because of the hours of happiness that he has given to me. Ludvig Holberg would be a close second, because he was the architect of the Danish model of social democracy as well as the early eighteenth-century father of Danish literature. I’m a vegetarian, so dinner might be a bit difficult.

Cathie Jo Martin is Professor of Political Science at Boston University and Director of the BU Center for the Study of Europe. Read her complete bio on the BU website. Boston University has been a HathiTrust member since 2011.

Charting HathiTrust’s Strategic Directions: Update

furlough — Wed, 28 Feb 2018 18:42:53 +0000

By Mike Furlough, Executive Director

Last summer we began work to define HathiTrust’s next strategic directions for 2019-2023. We published a draft plan in late summer, then held several webcasts to discuss these proposals with members. At our 2017 Member Meeting we devoted nearly the entire day to these issues and left tremendously energized by the membership’s clear commitment to HathiTrust and our mission.

We also left with a substantial amount of commentary from the attendees about our draft plan, which helped to clearly identify key areas of strategic value. It also reinforced the importance of continuing many efforts currently underway, such as our existing programs, development of a collection policy, a focus on metadata enhancement for improved access, publication of a membership development strategy, and modifications to the budgeting process. Among the themes we heard:

Clarify what HathiTrust is uniquely situated to accomplish and build our plans around those issues.
Clarify what the collection should support, re-state/reaffirm our focus on books and textual materials.
Incorporate more service assessment--internal and external--into the ongoing work.
Develop connections with users both through libraries and directly. There seemed to be more interest in educational uses, in addition to research uses, of HathiTrust.
In general, be more specific about desired outcomes wherever possible.

We’ve spent the last several months taking those comments into account and working on a final version of our plans. Thanks to the candid advice the membership offered, our final strategic directions will be a much stronger, more actionable set of plans. The Board of Governors meets in March to review the progress and finalize our plans. We anticipate sharing HathiTrust’s 2019-2023 Strategic Directions with our members in early spring.

Accessing Members-only Services with OpenAthens instead of Shibboleth

mcintsan — Wed, 28 Feb 2018 14:19:42 +0000

By Sandra McIntyre, Director of Services & Operations

Working with the library staff at Macalester College, we have successfully set up log-in access to HathiTrust member services for the first timeusing OpenAthens rather than Shibboleth for providing information about users’ identities. OpenAthens is a hosted identity provider service created at the University of Bath in the UK in 1996 and managed by Eduserv. It is now commonly used by educational institutions and healthcare organizations in parts of Europe and Asia, with new member institutions in North America. Like Shibboleth, it is compliant with the Security Assertion Markup Language 2.0 (SAML 2.0) standard for exchanging authentication and authorization data between security domains. It manages single sign-on for users to a variety of service providers — e.g., publishers and other authenticated access sites like HathiTrust — through its gateway.

With coordination with Macalester staff, HathiTrust has completed its first pilot to accept identity data from the OpenAthens system. Minor configuration adjustments by OpenAthens staff were required to release the attributes that HathiTrust needs. The new service enables Macalester to offer access to HathiTrust members-only services, such as full-book download for public domain volumes, to its students, faculty, and other affiliates. It also enables Macalester to set up access to HathiTrust in-copyright volumes for a staff member on behalf of members with print disabilities.

“We have some happy faculty on campus with our new access to HathiTrust,” says Katy Gabrio, assistant library director for collection development and discovery at Macalester. The library also is setting up a staff member in the College’s Disability Services department for proxy server access. According to Katy, a “hiccup” occurred recently in communication between OpenAthens and Macalester, leading to a temporary glitch in the service, but quick changes were made and all is well again.

Currently, HathiTrust requires that partners using OpenAthens adhere to the same InCommon Federation standards for SAML 2.0 exchange that our Shibboleth-using partners observe, and partners must join the InCommon Federation. “Our requirement for InCommon membership is based on a principle of leveraging the trust fabric of SAML federations for our login relationships,” explains Sebastien Korner, head of the Architecture & Engineering group at the University of Michigan Library, which configures the authorization systems for HathiTrust.

We welcome inquiries to feedback@issues.hathitrust.org from other members or potential members who are interested in using OpenAthens for accessing members-only services at HathiTrust. Current requirements for user identity authorization in HathiTrust include:

Operation of a SAML 2.0-compliant identity provider, such as Shibboleth or OpenAthens
Membership in the InCommon Federation and adherence to its standards for trusted shared management of access to online resources
Provision of required attributes for HathiTrust use
Membership in HathiTrust

In the future, HathiTrust will explore the benefit of registering directly with the OpenAthens Federation as a (paying) service provider, which would possibly eliminate the need for InCommon membership by OpenAthens institutions. We look forward to hearing about members’ needs as we evolve the authorization service.

Engaging the Collection: By the Numbers

azaytsev — Thu, 08 Feb 2018 16:57:47 +0000

By Angelina Zaytsev

The HathiTrust collection experienced continued growth in 2017.

The collection size hit two big milestones in 2017, reaching 15 million works in February and 16 million in December.
1.2 million works were contributed by 37 partners.
27,634 works were opened through the Copyright Review Program.
5,705 works were opened with permission of the author.

The following bar chart shows the growth of works in the collection since 2008. Each year is broken down into limited-view works (in gray) and full-view works (in orange).

In this report, we are primarily attempting to track various indicators of engagement in order to start to understand how well we are meeting the needs of users. We begin by looking at all users of the HathiTrust Digital Library, and how those engagement metrics may vary based on different factors. Then we look into two separate subgroups, members and genealogists, and compare the activity of those users to all users.

For members in particular, we explore how to identify users using three different methods: logins with university accounts, on campus access, and referrals from university websites. Login data is the most accurate data, but by looking at users on a university network and users who are referred to HathiTrust from a university’s websites, we can start to identify and track the broader world of member-affiliated users who aren’t being tracked in the login data.

One of the key findings in this report is that referrals from websites (as opposed to direct traffic or using web search engines) tend to result in more engagement by users overall.

For more analysis and data, read the full report (PDF).

HathiTrust's Six Millionth Open Book Highlights Congressional Investigations

furlough — Wed, 31 Jan 2018 17:43:53 +0000

"The Roaring 20's returned to Atlanta, Georgia, on October 24-28, 1970.”

This sentence appeared not in the style section of the Atlanta newspaper but at the start of an interoffice memo between Internal Revenue Service offices. The full memo and the backstory can be found in the six millionth openly available item deposited in HathiTrust, digitized by the University of California, Riverside.

There are many reasons HathiTrust's six million open volumes are open: some are in the public domain in the United States, some worldwide, and some have been licensed for public view by the rightsholder, usually with a Creative Commons license. Over 1 million of these are publications of the US government, which by statute are not afforded protection by copyright in the US.

When a digitized book is deposited in HathiTrust, we run an algorithm against its bibliographic metadata, checking for place of publication, date, and whether or not it was published by a federal agency. We also have an expansive program to review the copyright status of subsets of works that are deemed likely to be in the public domain. Currently over 51 individuals from 32 partner libraries take part in that program. Our goal is to open as many materials as we possibly can within the limits of copyright, and our members’ contributions of time and effort have helped us open a large portion of these six million.

Our six millionth open book documents in part the hearings conducted by the US Senate Select Committee to Study Governmental Operations with Respect to Intelligence Activities of the United States, held during in the Ninety-Fourth Congress in 1975. Also known as the Church Committee (after their chair, Senator Frank Church of Idaho), their investigations were instrumental in exposing secret surveillance of US citizens by the FBI, CIA, NSA, and IRS. The Church Committee grew out of the 1973-74 Watergate hearings, which exposed significant executive branch abuses of power against US citizens, and was the predecessor to the standing US Senate Select Committee on Intelligence. The third volume of these hearings focuses on the intelligence activities of the Internal Revenue Service.

So, what was going on in Atlanta those days in October 1970? Our anonymous man in Atlanta explains that “people came in sleek limousines, customized automobiles, mink and flamboyant dress for the Muhammad Ali-Jerry Quary fight on Monday fight, October 26. The styles of the 20's prevailed with males challenging the females for the extreme in dress and the brilliance of colors.”

However, it wasn’t the clothes that the IRS' cared about. It was the cars.

The memo, headlined “Operation Bird Dog,” and a list of those license plate numbers, was distributed to all District Directors in the Southeast Region for "your use as leads to possible income tax violations.” (Names and plate numbers were redacted before publication as part of the hearings.)

This boxing match was notable for being Muhammed Ali's first officially sanctioned fight since he had been banned in 1967 for refusing to be inducted into the US military. Ali won in a TKO after the third round. An IRS whistleblower later noted that Operation Bird Dog seemed to have been set up to target African American leaders in particular.

Dig a little, and you’ll find that federal documents aren’t as boring as you were told.

HathiTrust Emphasizes Importance of Collaboration at US House Hearing on the GPO Federal Depository Library Program

mmstewa — Tue, 26 Sep 2017 20:45:45 +0000

By Heather Christenson, Program Officer for Federal Documents & Collections

On September 26th, HathiTrust Executive Director Mike Furlough testified at the House Administration Committee Hearing Transforming GPO for the 21st Century and Beyond: Part 3 – Federal Depository Library Program.

The Committee is conducting oversight hearings on Title 44 of the U.S. Code, which is the authority for the Federal Depository Library Program (FDLP) and the Government Publishing Office (GPO). There have been previous efforts to reform Title 44 that have been ultimately unsuccessful (and in fact, HathiTrust contains digitized versions of hearings on this topic). Currently at issue is the need for “modernization” of the FDLP to reflect our now digital age. HathiTrust and our member libraries have an important stake in this issue, as many HathiTrust partners also participate in the FDLP.

At Tuesday’s hearing, Mike testified that “Title 44 should support comprehensive digital access to future and retrospective government documents, and should provide measures that support the privacy of users who access digital government documents” and, that “free access to U.S. government information is imperative.”

He also emphasized that the Government Publishing Office’s (GPO) programs should be aligned with the realities that many libraries face in the digital age: the increasing cost of managing large print collections and emphasis on coordinated collection management and “services to help individuals create, find, and use information.” He spoke to libraries’ progress in partnering to digitize over a million federal documents, and in implementing collaborative solutions for print management such as HathiTrust’s Shared Print Program.

He closed by stating that the GPO should be empowered to collaborate with library initiatives and make use of the results of library collaboration (such as HathiTrust’s US Federal Documents Registry), and concluded that “preservation and access is our business, and we’re ready to work with GPO and other organizations to help ensure that public information endures.”

For more information:

Video of the hearing; Mike Furlough’s testimony begins at 55:26
Mike Furlough’s full testimony
HathiTrust Federal Documents Program and U.S. Federal Documents Collection
HathiTrust Shared Print Program

HathiTrust Libraries Propose to Retain More Than 16 Million Volumes in Shared Print Program

Anonymous — Thu, 29 Jun 2017 17:59:24 +0000

by Lizanne Payne, Shared Print Program Officer

Fifty HathiTrust member libraries have proposed to retain more than 16 million volumes for 25 years under the HathiTrust Shared Print Program. These volumes correspond to more than 4.8 million individual book titles held in the HathiTrust Digital Library (about 65% of all HathiTrust digital monographs). This is a significant step toward the primary goal of the program: to ensure that print copies of all HathiTrust digital holdings remain available to scholars for many years to come. The Shared Print Program is a core program of HathiTrust, supported by and benefiting all of the more than 120 HathiTrust members.

This milestone marks the fruition of a goal that began in 2011 when HathiTrust members voted to “establish a distributed print monograph archiving program”. After several years of planning that involved many member libraries, the HathiTrust Board of Governors approved the program agreement (MOU) and associated policies in June 2017 with a goal to have the Retention Libraries execute the MOU by September 30, 2017.

Our next steps in Phase 2 will aim to secure retention commitments for the remaining ~3 million HathiTrust digital monographs; explore tools for collection analysis, collection management, discovery and resource-sharing; and collaborate with the HathiTrust Federal Documents Program and other shared print programs.

With completion of Phase 1, the HathiTrust program will constitute the largest shared print monograph retention agreement in the world. The fact that 50 participating libraries -- almost half of all HathiTrust members – have offered to retain more than 16 million volumes is an important vote of confidence: these libraries trust HathiTrust’s ability to achieve success at scale and they value collaboration with their HathiTrust partners.

For additional information including some preliminary statistics, see the full report (PDF), visit our Shared Print website, or contact HathiTrust Shared Print Program Officer Lizanne Payne

My Experience and the Experience of Millions

mmstewa — Wed, 24 May 2017 12:57:09 +0000

Dr. Paul Harpur, Senior Lecturer with the School of Law, University of Queensland, Australia and International Distinguished Fellow of the Burton Blatt Institute at Syracuse University draws from his monograph and personal experience to speak on the HathiTrust.

Introduction

This article seeks to illustrate the massive impact of the HathiTrust on the lives of persons with print disabilities and upon me personally. I will write in the first person as I want this to be informative and to enable me to express my thanks to the HathiTrust and to all librarians who are involved in the great work of opening the book to the print disabled.

While I am writing from personal experience, I also bring significant academic expertise to this topic. I have recently published a monograph on the practical and legal issues with accessing E-Books: Discrimination, Copyright and Equality: Opening the E-Book for the Print Disabled (2017) Cambridge University Press. This monograph includes a legal analysis of how copyright and anti-discrimination laws interact and includes a chapter on Google and the HathiTrust. I want to leave this content for the monograph and deal with the impact of the HathiTrust.

The book famine for the print disabled

Depending on where you live in the world, a person with a print disability will be able to access between 0% and 5% of the books published in the world. Statistics arguably fail to illustrate the problem. Below provides real world examples from students with print disabilities.

Imagine a situation where you are in a classroom, whether it be K-12 or university, and the educator refers you to a book but it is not prescribed (required). If it is just a recommended book and not a prescribed book, then educators will generally provide no support to get access to the book. If the book is not in a format that a person with a disability can read, then they are denied the right to read the book. If the book is in a format that is accessible, then this might be braille, cassette tape or in large print. To get access to the book in one of these formats might take weeks. Meanwhile the class has read the content, discussed it and started building on the understandings gained from reading. If you get the book, then you have to try and catch up. In most situations you just fall behind.

Enter the transformational impact of the HathiTrust and EBooks. You are in the classroom and a book is mentioned. The student cohort talks about chasing down the reading after class. During the lecture break you get on-line, search HathiTrust, request the book from the library or office that supports visually impaired users, and return to your lecture. You finish your classes for the day to find the book ready for you to read. Soon you are happily sipping a coffee reading the book.

The new era of access: E-Books

HathiTrust, its member libraries, and the librarians and others that support it, are participating in a new era of reading for persons with print disabilities.

It could be said that the dark ages extended for persons with print disabilities until the advent of Braille. Prior to the emergence of Braille in the late 1800s, persons with print disabilities had to rely upon others to read books for them. While Braille was a great step, it is expensive, slow and expensive to produce and a simple book may consist of 10 or more large volumes of braille books.

While scanners and technology has partially improved upon Braille, all of these methods required a hard copy book to be altered into a format that the print disabled could access. This has created a book famine; a book famine that E-Books have the chance to reverse.

How the HathiTrust’s E-Book library is transforming lives

In my recent monograph I noted that the HathiTrust is participating in a move that is having a significant impact on the lives of persons with print disabilities. I noted that the HathiTrust and other EBook libraries could make up to 15% of the world’s books available to the print disabled in formats that they can read (Harpur, 2017, P 91). As mainstream publishers start publishing more titles in digital format, then the figure of 15% will increase substantially. Of course, where the HathiTrust focuses on ensuring people with print disabilities can access works, commercial publishers and E-Libraries can arguably be less committed.

I am tenured at the University of Queensland which participates in the HathiTrust. This has enabled me to access the HathiTrust database of digitised books, comprising over 15 million books. A matter of a few years ago I could easily access perhaps a few hundred books from a range of official and other sites. This transformational change in access cannot be emphasised.

Conclusion

The social model of disability explains that people with impairments are disabled when barriers in society turn impairment into disability. The probability of the society becoming barrier-free might appear farfetched or impossible. I honestly might agree with you but for what has just happened.

A decade ago I was very print disabled. Now, with help from the HathiTrust and others, I am far less print disabled. In fact, when it comes to accessing many academic and cultural works, I am more print inconvenienced now. My parents, supporters and I used to spend hours scanning books for my studies and work. Now I ignore books I cannot access in an E-Book format that is not available in an E-Book that is accessible to persons with print disabilities. While it would be ideal to have reading equality; to go from difficult, time consuming and expensive access to a few books, to easy, cheap and rapid access to millions of books in a matter of a few years is an amazing, liberating and joyous experience. I would like the HathiTrust to be aware of the substantial personal and professional impact they are having upon the world’s print disabled.

Federal Documents in HathiTrust: A Look at Our Collective Collection

azaytsev — Mon, 20 Mar 2017 20:53:41 +0000

By Heather Christenson

HathiTrust has an ambitious goal to build a comprehensive digital collection of U.S. federal documents distributed in print format. But what do we already have in our collective digital collection? And what is it that can we learn about that collection? It is these questions that HathiTrust staff set out to answer in a project that we’ve called the “Federal Documents Collection Profile.” In January, we concluded an initial analysis of the U.S. Federal Documents collection as it existed September 1, 2016. “Initial”, because this hadn’t been done before, and because we expect it to be the precursor of more robust collection analysis and comparisons to come. A goal of the project was to investigate a variety of metrics based on the data available to us in order to establish a baseline for reporting on the collection. We were cautiously optimistic that we would be able to characterize at least some aspects of the collection.

We began the project by tackling the challenge of defining a set to analyze. What is the best way to identify federal documents in the large mass of HathiTrust metadata? Not that this question is entirely new to us, but, given the variations in completeness and accuracy in cataloging over the years, as we described in Detecting U.S. Federal Documents to Expand Access, is it even possible to accurately or reasonably delineate this set? We settled on an approach: detection of “f” and “u” in the MARC 008 field, detection of a SuDoc number in the MARC 086 field, and, making use of our U.S. Federal Documents Registry, checking for a match between the HathiTrust record and the Registry record. By this method, we narrowed the universe of federal documents in the HathiTrust digital collection to 412,205 bibliographic records and 970,315 digital objects. 94% of the bibliographic records in this set represent monographs and 6% represent serials, while 56% of the digital objects are monographs, and 44% are serials.

Because we did not limit our set to full view, we found that, in our federal documents collection, approximately 852,488 digital objects/documents are fully viewable in the U.S., while 117,827 are limited view/search only. Clearly more investigation can be done here to understand this breakdown.

One of the great strengths of HathiTrust is our large community of members. The power of aggregation was clear in our finding that fifty-one different organizations had deposited federal documents in HathiTrust. HathiTrust member partnerships with Google generated the great majority of digitized federal documents, but we have documents in our collection from almost twenty different digitization sources.

Things got more challenging when we dove into bibliographic data. Duplicates? We made some progress, but identifying true duplicates will require a focused in-depth analysis project. Breakdowns by subject? Corporate author? Publisher? Clearly there is work ahead of us to overcome decades of inconsistent cataloging practices and textual complications for a meaningful characterization. During this analysis we found more than forty-nine variations on “Government Printing Office” in the publisher field (260 $b), let alone the name change to “Government Publishing Office”!

But other aspects of the collection did come into view. The date curve peaked nicely in a pattern mimicking overall government publishing, 1960s through mid-1990s. Our subset of records that contained SuDoc numbers (64% of the full set), broke out to show strengths in Congressional Publications, Forest Service documents, NASA, and more. We found 147 languages represented in the bibliographic records, the vast majority English with a very long tail (and including some head-scratchers like Ancient Greek--perhaps another sign of inconsistent cataloging in the collection).

A brief look at usage metrics revealed “Library of Congress Catalogs 1976 V. 4” and “Annual Report of the Commissioner of Patents for 1916” in first and second place, as well as “A short guide to New Zealand” in ninth place, apparently having gained fame by being discussed in a reddit thread.

Finally, in addition to analysis of the full HathiTrust federal documents collection set, we zeroed in on a set of individual titles, as well as one agency, the Civil Rights Commission, to see what we could learn about comprehensiveness in HathiTrust. Although we estimate that our comprehensiveness measures are on the conservative side, clearly HathiTrust has a ways to go to fill in our collection, since comprehensiveness for individual titles ranged from around 3% to 60%.

We believe that our collection profile is one of the first attempts to delineate and characterize a collection from within the aggregate mass digitized library collection. We know that identifying the gaps and filling them in will be a task measured in years. 970,315 digitized documents is a great starting point, and our sleeves are rolled up.

Read the full report, Collection Profile: U.S. Federal Documents in HathiTrust.

Operationalizing "Non-Consumptive" Fair Use to Revolutionize Humanities Research

mmstewa — Fri, 24 Feb 2017 16:28:03 +0000

By Brandon Butler

[NOTE: We are marking Fair Use Week with this post from Brandon Butler, Director of Information Policy at the University of Virginia Library. Earlier this week we published a new policy on "non-consumptive use" for services at the HathiTrust Research Center. Here Brandon explains the legal grounding of the policy and offers his take on the policy's significance.]

It is fitting to end fair use week with some words about a new policy that, although it may seem dry and technical, may in fact represent the most compelling invocation of fair use in the last decade or more.

Earlier this week the HathiTrust Research Center (HTRC) published a new policy on non-consumptive use, which outlines the allowable types of data access and exports for researchers conducting computational research on the HathiTrust collection. Based on the latest case law, it is designed to foster groundbreaking research that enriches the public without intruding on the market prerogatives of copyright holders. With this policy, HathiTrust is operationalizing fair use to empower researchers to mine its rich corpus of millions of in-copyright texts for new insights, understandings, and information of a kind that literally could not have existed only a few years ago.

The key legal ingredient in this effort is the concept of “non-consumptive use.” A clear articulation of this concept ensures the HTRC is only enabling genuinely new and productive forms of research grounded in computer analysis of the texts: core protected fair use. It also rules as clearly out-of-bounds the ordinary reading that will typically require permission or another kind of fair use rationale.

The policy is grounded in a deep, well-established principle in fair use law: the signature characteristic of a fair use is that it does not supersede the original work in its ordinary and likely markets. The emphasis on non-substitution traces all the way back to the seminal case of Folsom v. Marsh, in which Judge Story first articulated the modern four factor test for fair use. Story explained:

"[A] reviewer may fairly cite largely from the original work, if his design be really and truly to use the passages for the purposes of fair and reasonable criticism. On the other hand, it is as clear, that if he thus cites the most important parts of the work, with a view, not to criticize, but to supersede the use of the original work, and substitute the review for it, such a use will be deemed in law a piracy."

As Judge Leval explained at length in his own seminal opinion in the Google Books case, non-consumptive uses are non-superseding par excellence. They serve the purpose of copyright by separating facts from expression, and allowing the former to circulate freely. (I described this aspect of the opinion at some length right after it came out.) Access to these facts is an enormous boon to society, and at the same time poses no risk at all to copyright holders, whose interest is in access to the expressive content of their works.

The non-consumptive use policy is also a natural development of a modern line of fair use cases that blessed internet search engines. Starting with Kelly v. ArribaSoft, this line of cases found that copying entire webpages and images posted online and showing snippets and thumbnails from these pages as search results was fair use because none of these activities provided the public with a substitute for ordinary access to the underlying material. To the contrary, search engines create a vast new form of knowledge: facts about the content of the web, which can be queried in order to help users find what they're looking for online. As this line of cases grew, the question naturally arose: is this logic limited to internet search engines, or is there a broader category of fair use here?

Perhaps the most exciting thing about the new policy is that it represents the generalization of the principle favoring web search so that it favors all computer analysis of any text for an open-ended variety of research purposes. Judge Leval's opinion, together with the opinion in the HathiTrust case, laid out the broad principles favoring this generalization, but this policy may well be the first attempt to really move beyond search to cover the full range of research projects that might benefit from non-consumptive methods.

So, take a look at the policy. I am proud to have been among the many talented people who worked on it, and one of our goals was to make it as readable as possible while staying faithful to the complex concepts at the policy's core. I hope that other institutions interested in fostering this kind of research will find our policy a useful reference.

Workshop Brings Together Staff from All Sites

mmstewa — Wed, 22 Feb 2017 15:02:28 +0000

by Sandra McIntyre

Thirty-eight HathiTrust staff members met February 2-3 in Chicago for the first-ever HathiTrust all-sites staff workshop. The attendees included managers, librarians, developers, systems administrators, and graduate students who work at HathiTrust’s distributed sites, including the sites at the University of Michigan, University of California, Indiana University, and University of Illinois Urbana-Champaign.

Before the workshop, the teams provided three “deep dive” web-conferences to share information with each other about the infrastructure and architecture for each part of the HathiTrust enterprise – HathiTrust repository and digital library, Zephir metadata management system, and HathiTrust Research Center. At the workshop, many staff members met others face to face for the first time, and the two days of intensive work together provided an opportunity to build new relationships as well as to address specific opportunities for collaboration.

The primary goal of the event was to coordinate planning, prioritization, and strategic development/deployment across the enterprise in line with the HathiTrust mission and member priorities. Team members analyzed use cases across all sites, conducted troubleshooting on a few issues, and generated a wealth of ideas about leveraging infrastructure, sharing common software services, and translating research results into production services.

Sessions included both plenary discussions and breakout working sessions, on such topics as tracking metadata flows in the HathiTrust ecosystem, the potential for sharing APIs and specific tools, promoting common development practices and communications, using “item-level” metadata, and detecting duplicates. Of special interest to many participants was a renewed exploration into providing a common user experience across all interfaces and services, based on the development of use cases.

HathiTrust managers are analyzing the recommendations from the workshop and prioritizing for collaboration going forward with input from various staff and governance groups. We anticipate enhanced communications and work across the HathiTrust teams in the months to come. Many thanks to the event’s Program Committee and all the participants!

HathiTrust staff from four sites collaborate at the all-sites staff workshop, Chicago, February 2-3, 2017.

14 Million Books & 6 Million Visitors: HathiTrust Growth and Usage in 2016

azaytsev — Fri, 10 Feb 2017 15:22:15 +0000

By Angelina Zaytsev, Collection Services Librarian

The HathiTrust collection continues to grow steadily. As of January 1st, 2017, there are 14,816,187 volumes in the collection. Over one million volumes were added to the collection over the course of the preceding year, scanned from the library collections of 39 contributors. These included several new, unique collections, such as:

Within the HathiTrust certified trusted repository, 38% of the collection is available to users to access in full view, and the remaining 62% is made available in other ways: all users can search across and within those limited view books; researchers can now perform transformational, non-consumptive research within these books; and users with print disabilities can access the full text.

What is usage like for a digital library of 14.8 million volumes? And who are our users?

Over 6.17 million users visited the HathiTrust Digital Library website over the course of 2016, culminating in 10.92 million sessions. About 49% of our users were located in the United States in 2016. This was a rise from 46% last year, but the percentage has hovered around 50% for years. The remaining 51% of users are scattered across a long tail of 236 other nations, topped by the United Kingdom, Canada, Germany, Italy, India, France, China, Australia, and Spain. Our users are primarily English speakers, as detected by the language of their browsers.

They come to our Digital Library from a variety of sources, including referrals from search engines (40%) and referrals from other websites (39%), such as library catalogs and services that list digital books, and 1.16% of users arrived from social media sources. Once on our site, they read a miscellaneous assortment of books, including novels, genealogical books, and other historical titles.

Interested in learning more? We invite you to read the full report (PDF).

Reflections on the 2016 Member Meeting

mmstewa — Wed, 21 Dec 2016 21:18:23 +0000

By Mike Furlough

Note: This message was sent to HathiTrust members on December 20, 2016

At the end of a stressful, fractious, and divisive year in the US and beyond, I hope we can all agree that it’s good to see it drawing to a close. That said, all of us at HathiTrust have been buoyed all year long by the work we do, and that’s in no small part because our membership so strongly believes in the necessity of our work and in developing community focused solutions to our common challenges. We saw this when we issued calls for voluntary participation in our copyright review and shared print programs. In both cases the response nearly doubled what we had expected and planned for. And it was evident at the 2016 Members Meeting, which we held in Chicago on November 10. (The agenda, presentations, and notes from the meeting are all available online.)

During a panel titled "HathiTrust and Its Members,” participants noted that HathiTrust members have relationships with many other consortial entities and wondered how we can truly collaborate within and beyond HathiTrust to "connect the dots” among disparate programs and activities. These comments were echoed throughout the day by other attendees, who drew attention to where HathiTrust’s current work provides a platform upon which to exert greater influence over the collections landscape and national policy.

HathiTrust has always had a “big footprint” mission. So where should we tread, and how, as we enter our ninth year? Even though some of the same challenges remain, it is a very different world for libraries and higher education than when we launched in 2008. At its Fall meeting, the Board of Governors discussed a process for strategic visioning to take place in 2017, which will necessarily engage the membership. This will be accompanied by development of a multi-year financial plan and some likely refinement of the HathiTrust cost model to ensure that it remains equitable. This work follows from the recommendations of the Committee on Membership and Finance, which we previewed during the Members Meeting. (We will share that work in a more formal way in the new year.)

We’ve already begun making plans for next year’s Member Meeting and will announce dates and location by the spring. It’s always a challenge to strike the right balance between reporting on our activities and diving deep on them in a single day. Based on attendee feedback for our most recent meeting, and based on our planned focus on future development in 2017, you can be sure that at the next meeting you will have more opportunities to explore and engage actively about the key issues brought before the membership.

To be honest, the US election week turned out to be a weird time for many to attend a professional conference. People were distracted and several mentioned to me some brewing issues that awaited them back home. As the day progressed, I did sense a subtle turn in mood, as if focusing on our collective commitment for a few hours gave us some break from the anxiety that we may have felt. As I said that day, the work we do to preserve and make collections accessible really does matter to me, and it matters to our users, and it matters to the future. It is not sufficient in itself, and it’s not the only thing that matters, but it’s a base on which we can continue to build and realize our own visions for the future.

On that note, on behalf of all of us at HathiTrust, let me offer you and your colleagues best wishes. We hope that your coming year takes you to the places in your work and your life that you most wish to go.

HathiTrust Collections Survey Report

mmstewa — Fri, 16 Sep 2016 20:26:45 +0000

By Mike Furlough and John Butler

We’re very pleased to announce publication of the HathiTrust Collection Committee’s Collection Priorities Survey Analysis Report . This report summarizes the data and findings of the Committee’s Fall 2015 survey of members and makes key recommendations regarding HathiTrust’s current and future collection-related priorities. The report has been reviewed by the Program Steering Committee (PSC) and the Board of Governors and both have endorsed the report’s recommendations. The PSC’s written endorsement and response to the report is also available for review.

The report has proven to be very timely, as the membership deliberates major content and collections issues for the corpus including expansion to other formats, collection development directions and foci, quality assurance of the digital corpus, among other considerations. Already, the report’s influence is becoming apparent and some substantial decisions have been made based on the findings.

Among the recommendations, the most notable is that HathiTrust should continue to “concentrate on enhancing the comprehensiveness of the digitized print corpus.” As the Program Steering Committee notes, the membership sent a clear message “urg[ing] the partnership to build on HathiTrust’s core and distinctive strengths rather than pivot in entirely new directions at this time (e.g., non-text formats). There is concern that such investments (e.g., in support of images, audio, video) might come at the expense of delivering a higher quality, more complete digitized print corpus.” During its June 2016 meeting, the Board of Governors agreed with this assessment.

Thus HathiTrust’s collection development priorities will remain focused on “books” for the foreseeable future. This decision will be periodically reevaluated as needed. This includes not only monographs, but also serials, government documents and some other materials previously collected. However, “book” may be too imprecise of a term given the increasingly fluid boundary between “books” and other text formats. Thus, the Program Steering Committee will ask the Collections Committee to initiate an exploration of expanding this scope from books (often understood primarily as “print”) to textual materials of a variety of sorts. This work has not yet begun but we will be sharing more information as it gets underway.

A second recommendation emerging from this study is that HathiTrust should take steps to improve the quality of the existing corpus, which would include addressing scanning, processing, and metadata errors. To help address that recommendation, we are today announcing the formation of new Quality Assurance and Standards Working Group. This group, led by Paul Fogel of California Digital Library, will examine needs and recommend strategies, processes and techniques for making scalable and persistent improvements to the quality of materials preserved by HathiTrust.

The third recommendation is that HathiTrust should focus on improving and expanding services to members. The survey highlighted several areas of interest among members, including improvements to the deposit/ingest process, simplifying access for users with print disabilities, and the development of collection analysis tools that members can use. The first two issues are matters of ongoing concern and HathiTrust staff have been evaluating and rolling out improvements over the course of this year. Still, more remains to be done and we are committed to making these services more useful for all members. Further, our shared print monographs and U.S. federal documents programs both require collection analysis functions and may provide opportunities for us to extend such functionality more broadly.

In recognition of these findings, the Collections Committee’s charge was revised in July to give increased attention to actions and directions recommended by the report. On behalf of the Board and the PSC, we wish to acknowledge and thank the members of the Collections Committee, who put significant effort into the design of the survey, data collection, analysis of results and development of recommendations in this important report. The membership included: Carmelita Pickett, Chair (University of Iowa), Martha Hruska, PSC Liaison (University of California, San Diego), Sharon Farb (University of California, Los Angeles), Bryan Skib (University of Michigan), Claire Stewart (University of Minnesota), and Tom Teper (University of Illinois at Urbana-Champaign).

Electronic Access and The "Collective Collection"

mmstewa — Fri, 17 Jun 2016 16:28:19 +0000

Written by Mike Furlough

Note: this is the lightly edited text of a talk presented at the 2016 CRL Collections Forum, @Risk: Stewardship, Due Diligence, and the Future of Print in Chicago, IL on April 14, 2016.

The audio for this talk is available on YouTube.

About twenty years ago, while I was serving a reference desk shift at the University of Virginia Library, a student approached me, looking for “old magazines.” He was working on a well-worn freshman composition assignment that required the student to find and analyze advertisements from issues of Time, Saturday Evening Post, or other popular magazines from the mid-twentieth century. I showed him the still relatively new web-based library catalog, instructed him on how to search for specific magazine titles, and gave him a guide to the stacks so he could find them by call number. Just as we wrapped up he asked: “Can’t I see the magazines on the computer?”

That was when I knew that users' expectations had changed forever, and I had no immediate expectation that we could meet them. At that time, Google didn’t yet exist. Large-scale digitization in libraries had begun, with a lot of support from the Andrew W. Mellon Foundation, but focused mostly on nineteenth century materials. JSTOR was already available but a young product. Science publishers were beginning to smell a new source of income from their own backfiles. Ambitions were so high that we could imagine tens of thousands, maybe a hundred thousand books and journals online in a few years.

Google blew those expectations away in 2004 when they announced their digitization partnership with five libraries[1]. If that student showed up today and bothered to ask for help, I’d show him how to use Google to find back issues of Life Magazine. I could also show him how to recall the magazines from the remote storage facility for twenty-four-hour delivery, but I suspect that I’d be met with greater incredulity than twenty years ago.

A lot has changed in those twenty years since I showed the student how to find the APs, and those twenty years went by awfully fast. Twenty years, a generational timeframe, seems to be something of a magic number for preservation. The Digital Preservation Network (DPN) has done business modeling and forecasting on a twenty-year timescale, and twenty years is also the minimum length of time DPN commits to preserving the bits deposited by its members. Twenty years is a common term for shared print retention agreements, the closest thing to a permanent commitment many of us feel we can make, given the unknowns of the future. Twenty years is as close to forever as we frequently can get.

Many of our discussions around shared print are driven by economics and opportunity costs: can we reduce the cost maintaining our print collection by sharing the burden with others? Can we free up space for other purposes by doing so? The original ballot proposal for HathiTrust’s shared print monograph archiving program cited a survey of library directors who overwhelmingly agreed that "withdrawal of print books would be an important future strategy…if a robust digital alternative were available” and that "they would be more likely to withdraw their print book collections if their library could provide guaranteed on-demand access to print versions through a sharing network such as HathiTrust.” There’s nothing wrong with responding to near-term needs: it’s the reality of the academic world we are living in, and we’ve premised much of our work at HathiTrust on being able to do things more affordably at scale. (More about our own shared print program later in this talk.)

But we know these decisions have long-term implications, and thus at today’s meeting and elsewhere we as a community are turning to slightly different questions: How can we ensure future access to the print record? I propose that it would be useful if we spent some time discussing questions such as: What is our vision for the state of the print record in the year 2036?

I think we would want to see a sound, robust, continental network of print archives and interlinking access services that include digital access. Maybe there would be a few “mega-repositories” to serve “mega-regions.”[2] I think that we’d want to have some confidence that our successors could make choices for the next 20, 30, or 40 years with better information than we ourselves have. We would like for our successors in twenty years to be able to account for a significant majority—75, 85, 95%? —of what we know to have been collected by research libraries. We would want to have confidence that we’ve also been able to account for a wider range of print materials, often unique and at risk, that document topics, histories, cultures, and lives that are not now well represented in our research collections, much less our digital collections. Such materials will have been digitized and the digital files preserved; and multiple libraries will have committed to the long-term retention of each. The copies will be geographically distributed and will be accessible in appropriate formats that support the needs of the users.

Is it too much to expect that over 20 years we could both account for the totality of North American research collections and ensure that we have well documented print retention commitments for that totality? I’m usually the one who reminds people that preservation is about coming to terms with loss, not saving everything: “Look, I know that withdrawing a book reminds you of your own mortality, but we really don’t need to keep this one.” So I’m not saying these are even the correct questions. We all know that we will have to work through existing constraints to get to such a point, but sometimes some visioning can be an energizing exercise. I’m not proposing the few sentences I’ve laid out is such a vision or that I have it right. I’m only suggesting we think about what we want, in addition to where we are now.

What is required for us to have greater certainty about what exists, what is retained, and it suitability for future use?

Obviously we have a very long way to go to anything like the very rough sketch I just offered. CRL’s analysis of the data in the PAPR database and last summer’s PAPR II summit both highlight some of the difficulties facing us. OCLC Research has posited that we are seeing the beginnings of “system-level” thinking about the collective collection, but CRL points out that there has been limited coordination among existing serials archiving programs. Their work speculates that perhaps only as little as 2% of the existing journal titles held by North American Libraries are currently covered in our archiving programs. High quality holdings data is hard to come by and so the analysis is only as good as the data we have.

HathiTrust has been focusing all of our planning around shared print support on monographs archiving. It is practically encoded in our organizational DNA. Two of the goals stated in our bylaws announce that we will

…develop partnerships and services that ensure preservation of the materials in HathiTrust and the entire print and digital scholarly record.

…reduce long-term capital and operating costs of storage and care of print collections through redoubled efforts to coordinate shared storage strategies among libraries.[3]

Initial planning for this program lays out a goal to eventually confirm retention commitments for the monograph titles with a digital copy in HathiTrust (our current, non-de-duplicated count is about 7 million titles).[4] Those retention commitments will be made by member libraries and publicly disclosed in commonly used resources, such as WorldCat, as well as other knowledge bases. A robust access system would be a part of the program. We are really only getting started with this work, but I would like to think that we can use this as an opportunity to help move us toward the vision I sketched earlier.

Although we are not primarily focused on serials print archiving, it would only make sense for HathiTrust to work with CRL and others to support serials print archiving based on the HathiTrust collection. The Keepers Registry has been tracking our holdings, treating us as a serials archive even though it is not our primary focused. Anecdotally I know that libraries consider title inclusion in HathiTrust as a factor when developing retention criteria; but through speaking with various experts I have not heard of any cases where a library has withdrawn solely because the title is in HathiTrust. Of course, our metadata is not perfect. Our un-de-duplicated serials count from the daily statistics is 370,000. OCLC, during last June’s summit, refined this count and put it at around 290,000 titles. Only about 95,000 of the records we hold for serials have an ISSN. We receive an annual report of holdings from each of our member libraries. Even so, it has proven very challenging to match our members' reported serials holdings data with our digital collection because of the different ways that libraries record holdings for serials and express the enumeration and chronology for these titles.

We do much better for monographs, and we can readily see that there is a great deal of duplication among our members’ physical collections. But we know there are gaps in our knowledge due to a few variables. These include how easy or hard it is to produce such data from a given ILS system[5] and in some cases whether a library has undertaken a reclamation project with OCLC.[6] The data may not be as up-to-date or may be missing some information we request. This of course will inhibit our shared print work to a degree, so improving the quality and extent of our holdings data will be important for shared print programs and operations.

Returning to digitization: Shared print projects are influenced by two decades of digitization, driven heavily by publishers or others in the commercial sector. Obviously HathiTrust's collection is one outcome of the mass digitization work begun by Google and to some extent by Internet Archive over a decade ago. It’s important to note that HathiTrust’s primary collection strategy has been to aggregate materials that have been digitized from our member libraries’ collections in general. Our focus is on unlocking additional utility from those collections for the entire partnership. We do this to build a public good while also creating services that benefit the members. We occasionally have collected materials from non-members where the material can be made full view, meets our specifications, and is highly valued. But these are the exceptions: the core of our collection is what you will find in circulating and special collections research libraries in North America. It is primarily published materials in bound form: books.

Much of this digitization has been done in partnership with Google or Internet Archive. Google has been focused on breadth of material, getting as close as it can to “completeness.” Partner libraries have latitude in what they send to Google, and Google has prioritized some types of materials by request of its partner libraries. But there are well-known limitations in what Google has been able to scan (think foldout maps, large format items, such as newspapers), and such materials are not generally found in HathiTrust.

HathiTrust is a not perfect representation of what has been scanned by Google or anyone else. There are Google partner schools that have not deposited all (or any) of their Google-digitized material into HathiTrust. While there is significant overlap between HathiTrust and print research collections, it is clear it that it is hardly complete. From conversing with Constance Malpas at OCLC Research I know that the median overlap rate of ARL schools with HathiTrust is still hovering in the mid-30 percent range, and the maximum does not seem to exceed 50%.

With those caveats in mind, where should we focus our efforts, and how do those align with CRL?

I think it should be obvious that there is too much to do by any one organization. Duplication of effort is not a winning strategy when resources are scarce. And when there is too much to do you should focus on exploiting your strengths. I think that the many libraries that are members of both CRL and HathiTrust (and others who are not) would expect us to pursue complementarity. That is just obvious to me.

The results of a recent survey of our membership, which are still preliminary, offer some additional guideposts.[7] It’s clear that above all else our members highly value the work we’ve done with published textual materials. They care about improving the quality of the corpus and completing gaps in sets and in subjects more than they do about expanding our scope into other formats. This might include focus on material that Google has not scanned or been missed.

There is continued strong support for our US Federal Documents Initiative, which is the one area we have identified as a collection/digitization priority. There we are attempting to identify a corpus of existing federal documents that can be scanned and added to HathiTrust. This program, our shared print monograph program, and improving and “completing” our existing corpus, all will require a much clearer sense of the characteristics of our digital corpus, the existing collective print corpus, and the relationship between the two. It’s clear our interest here overlaps with that of CRLs, as well as OCLC, and we can be working together to advance both digital access and print archiving.

Stepping away from HathiTrust specifically, I believe that just as our “community” long-term vision for shared print is hazy, so is our community long-term vision for digitization. We have all operated in the last decade with the knowledge that Google has been scanning materials found in circulating collections, but the extent of that work is not widely understood. Appropriately we see some strategies developed around certain types of materials, such as federal documents within the HathiTrust membership, or the National Digital Newspaper Program from the National Endowment for the Humanities and the Library of Congress. Foundations and national funding agencies have generally shifted their priorities for digitization funding, if they have any, towards scholar-driven selection and commitment to use. Local strategies wisely focus on what is unique and distinctive from their collections. There are attempts to coordinate as we become aware of opportunities to do so. For example, CRL has focused on state and international documents, which for HathiTrust are of secondary interest compared with US federal documents.

To the extent that there is alignment among these different programs, it is far less intentional than it might be. All of these loosely coordinated programs have accomplished a great deal, but again I wonder if we can begin to more intentionally craft a vision that could be used as a guide to our collective future investments in digitization. Earlier I linked digitization to print archiving as a part of what a twenty-year vision might include. A twenty-year vision for digitization might assume that we will have achieved the creation of a comprehensive corpus of reformatted items, but would also assume the longevity of print as an important and necessary mode of access that exists symbiotically with digital. This implies that we must continue digitizing materials from the early 2000s onward if existing digital versions cannot be located, and in turn this implies that library digitization strategies rely heavily on the fair use decisions we have seen in the Authors Guild cases to include in copyright materials. It would promote more strategic uses of digitized in copyright (for example, non-consumptive, computational access should be a standard mode of access, just like on screen reading).[8] And it should lay out principles of engagement with the commercial sector to help evaluate opportunities for partnership and licensing that will arise.

In the interest of ensuring the Future of Print, is it reasonable to treat digitization as a necessary, required step in print archiving? Here I am imagining a concerted effort to identify sets of materials that should be retained in print, followed by a commitment to digitize them.[9] In fact, I think that CRL and HathiTrust are both well positioned to examine how we could better link print archiving and digitization going forward, given our respective commitments to these issues and the large number of member libraries we share in common.

To link print archiving to a commitment to digitize would, at the large scale, require significant effort to identify the corpus of material that should be retained and registered as such. To do it well would require requiring significant coordination across multiple existing programs. And significant funding too. Here is where reality comes back to keep us in our corners. We think we have come a very long way in digitization and print archiving, but when we begin to really look closely we can see that we have an even longer way to go. If we could take some time to step back and define where we want to end up in our print archiving work, I think we’d stand a better chance of finding our way there.

But I come back to where I began: what do you want that student in twenty years to be able to do with library collections?

NOTES

[1] The “Google Five” were Michigan, Stanford, Harvard, New York Public, and Oxford. See the New York Times story “Google is Adding Major Libraries to its Database,” December 14, 2004. http://www.nytimes.com/2004/12/14/technology/google-is-adding-major-libraries-to-its-database.html

[2] See OCLC Research’s 2012 report Print Management at “Mega-scale”: A Regional Perspective on Print Book Collections in North America, written by Brian Lavoie, Constance Malpas, and JD Shipengrover, online at http://www.oclc.org/content/dam/research/publications/library/2012/2012-....

[3] See the HathiTrust bylaws, specifically Article I – Purpose online at https://www.hathitrust.org/bylaws

[4] The preliminary planning for this program is summarized in the Final Report of the Hathitrust Print Monograph Archive Planning Task Force, published with commentary at https://www.hathitrust.org/files/sharedprintreport.pdf

[5] For example, some members have a great deal of difficulty extracting holdings data from their ILS systems, either because the system was not designed with this reporting in mind, or the library lacks staff resources to spend a lot of time on it.

[6] HathiTrust relies on OCLC IDs to match records between a library’s local holdings and the volumes in HathiTrust’s digital collection.

[7] We expect to publish the results of the survey in summer 2016.

[8] For us, enabling computational analysis for the HathiTrust collection is a critical mode of access, which we support through the HathiTrust Research Center at Indiana and Illinois. We’ve seen some promising work using content analysis that could contribute to improvements or enhancements in the descriptive metadata of our corpus. We’ve also learned that researchers pose questions that are corpus agnostic. The HathiTrust collection is great, but they also want to compute against several other collections, and it is complicated for them to do that. Extending Research Center services across other corpora, such as CRLs, could be helpful to both scholars and to libraries.

[9] The partnership between CRL and Linda Hall Library is a model for this. Together they’ve committed to joint partnership to preserve and develop historical research collections in the fields of science, technology and engineering. They have developed a joint collection strategy, and have recently announced plans to begin digitizing pre-1950 titles. See https://www.crl.edu/news/crl-and-linda-hall-library-digitize-historical-serials

On Extended Collective Licensing

mmstewa — Tue, 17 Nov 2015 14:58:49 +0000

Written by Mike Furlough

In June, the United States Copyright Office (USCO) released "Orphan Works and Mass Digitization,” a detailed report proposing new orphan works legislation and proposing a pilot extended collective license (ECL) for in-copyright, published works that have been digitized. In response to the USCO’s request for feedback on the collective licensing proposal, over 80 individuals and organizations submitted comments.

HathiTrust opposes the implementation of Extended Collective Licensing in the United States. In our filed comments I write that "the USCO has made a good faith effort to find ways that the public could gain even more benefit from mass digitization programs. But the program as described threatens to erode rights already available to users and it has no obvious source of sustainable funding.”

An ECL applies broadly to a class of rights holders (songwriters, say), who are collectively represented by a rights management society (ASCAP or BMI, for example), which licenses their works to and collects royalties from other organizations (like radio stations) or individuals. ECLs are usually enacted though legislation that defines the role that a potential collecting organization must play and lay out the rules by which rights holders may opt-out. Licensees are free to negotiate the terms of the license with collectiong organizations or to choose not to participate, but if they don’t they will have limited options to make use of those works.

As my example above suggests, an ECL-type system is already in place in the US for music, but the Copyright Office’s proposal concerns, among other things, published “literary” works: books and serials. Such ECLs already exist in several Nordic countries, but not in a country with a publishing market as large as the United States. For an ECL to work, the rights management organization needs to be able to find those who own the rights to the works being licensed. In the context of mass digitization or extremely large collections of material, such as HathiTrust or Google Books, this is no small task, and one that would require significant upfront investment as well as potentially high licensing fees.

Users in the United States have many rights that are not subject to licensing. Since 2012 courts have consistently ruled that mass digitization to support activities such as full-text search or access for print-disabled users are fair uses under US law. Section 108 of the Copyright Act lays out many uses that libraries can make on behalf of their patrons. Supporters of ECLs argue that there are other uses that could be enabled through a license, which may be true, but libraries’ long history with electronic licensing has taught us to look critically this argument. Many libraries have been presented with licenses that forbid fundamental activities such as document delivery. In theory it may be possible to create a collective license that is affordable and protects all existing rights that users and libraries have, but it is doubtful that this could actually be done for books and meet expectaions of licensors and licensees.

Others have publicly expressed their opposition. The Author’s Alliance analyzed all responses and notes that “only nine unambiguously supported the initiative” while 52 expressed opposition. Their analysis points out an interesting lack of consensus among creators and the organizations that would represent them. The supporters of ECL are organizations that would potentially act as collecting societies, licensing works and distributing payments to rights holders. The opponents are not only library-focused organizations like HathiTrust, but also creators’ organizations and individual authors.

The widespread opposition to ECLs should lay to rest this idea, but the Copyright Office has not yet announced its own response to these comments. Fundamentally, our biggest challenge to using copyrighted works comes from our limited knowledge about the copyright status of published works. While ECL is a bad idea, I hope that the community's comments will spur the Copyright Office to work more closely with the academic and library community to address this problem.

Beyond Google Books: Getting Locally-Digitized Material into HathiTrust

hathitrust — Thu, 18 Jun 2015 04:00:00 +0000

By Aaron Elkiss, University of Michigan Library, Cross-posted from the University of Michigan Library Tech Talk Blog

HathiTrust was founded in 2008 as a partnership of the (at the time) 13 universities of the Committee on Institutional Cooperation plus the University of California, all of whom were digitizing or had plans to digitize volumes through the Google Books program. Even those original partners had a variety of past and present digitization projects beyond Google Books, though, and a goal from early on was to support preservation and access to a wide array of digitized book material. Not all material is suitable for digitization with Google Books, and not all institutions with material to digitize are Google Books partners!

Early Attempts at Non-Google Content in HathiTrust

In the 2009-2010 time frame, HathiTrust started experimenting with non-Google content. In particular we added support for content digitized by the Internet Archive; we also made some early attempts at migrating content from Michigan's existing repository (See our previous blog post on this effort.) Additionally, we added support for an impressive collection of incunabula digitized in-house by the Universidad Complutense de Madrid and volumes digitized at Yale University with support from Microsoft.

We quickly found that it would not be sustainable to add support for every new source of content coming into HathiTrust, especially when each source was only contributing a small number of volumes. Even when content coming from a source was relatively well-prepared and homogeneous, it usually differered enough from existing content in HathiTrust to require a great deal of investigation and craft work. Because HathiTrust started out with only material from Google Books, the process for getting material into HathiTrust was initially very specific to Google Books. The format for packaging content was essentially tied to what Google Books was providing, and the specifications for images in HathiTrust were based on digitization specifications that are geared towards technically sophisticated and well-resourced institutions who are newly digitizing content. There are stringent technical specifications for the images as well as specific descriptive metadata that image creators need to embeded in the files. These specifications are in place for good reason: to support both preservation of and access to all material, the same standards must be applied to all digitized book content entering the repository. The HathiTrust book viewing application generates images on the fly from the preservation copies in the repository; so, the less variation in the repository, the easier it is to assure correct operation. Additionally, less variation means that (in theory) any future format migration or other preservation actions would have fewer issues to consider.

Our first attempt to solve the problem of getting content into HathiTrust from disparate sources was to distribute a version of the the toolkit we had developed to handle material from Internet Archive, Yale, Madrid, and Michigan. The toolkit contained functionality to fix metadata embedded in image files, generate Metadata Encoding & Transmission Standard (METS) files, and validate submission packages. However, this toolkit could only work for a limited audience. Just installing the toolkit was non-trivial because of the large numbers of external dependencies. We tried to minimize the number of assumptions the toolkit made about its environment, but it still required Linux and a lot of prerequisite software. Also, the toolkit wasn’t a complete out of the box solution, unless the content happened to be exactly like something we’d already handled. It required time and programming expertise to create code describing a new package format and the steps needed to happen to transform the package into something meeting HathiTrust’s requirements. Because of these high technical requirements, it took a fair amount of our time to support the few institutions who tried making use of it. There were a few successes with the toolkit, most notably with content from the Texas A&M University, but not many other institutions had the resources to make successful use of the toolkit.

A Simpler HathiTrust Submission Package

After these setbacks, we thought about what else we could do to make it easier for partner institutions to submit content. We didn’t really have the resources to make the toolkit easier to install or use, so instead, we came up with a way to relax requirements for submitting content to HathiTrust while maintaining the same high standards in the repository. We already had code to attempt to bridge the gap between images coming from the Internet Archive, Michigan’s existing repository, etc, and the HathiTrust specifications as well as to create METS files from external metadata.

So, we made specifications for a simpler submission package format that eased the requirements in several ways:

Rather than requiring a METS file, partners can just create a simple YAML file with some basic metadata and a checksum.md5 file listing the files and their MD5 checksums.
Instead of requiring JPEG 2000 images prepared in a specific way, partners can submit nearly any black & white, greyscale, or full-color red/green/blue TIFF file and we will automatically create the preservation JPEG 2000 copies.
Metadata no longer needs to be present in specific places in the images. Partners can provide it in the YAML file and we will insert it in the preservation copies of the images.
Partners can submit content to a Box share rather than needing to make individualized arrangements to transfer content to HathiTrust.

There are some issues that the simple package format doesn’t solve — they’re core issues to digitization that any institution doing digital preservation would need to consider. Partners will need to:

Decide what individual objects will consist of in HathiTrust. For monographs, the object in HathiTrust is just the individual book, but for serials, other multiple items bound together, and other more complicated cases, institutions must make some decisions.
Scan images at appropriate resolutions. If the images aren’t high enough resolution to meet HathiTrust’s preservation standards (300 pixels per inch for greyscale or full-color images, 600 pixels per inch for black-and-white), then the only option might be to re-scan those volumes.

There’s also still some work institutions might need to do up front. Institutions must:

Convert non-TIFF/JPEG2000 images to greyscale or full-color red/green/blue TIFFs before submission. We can’t currently handle anything other than TIFFs or JPEG2000 images, non-red/green/blue or greyscale images, or images with 16 bit color depth.
Perform optical character recognition and produce one plain text file per page image. Most OCR software should be able to do this out of the box.

The requirements are much less technically demanding than using the ingest toolkit, though. So far 7 institutions have successfully submitted over 4,000 locally-digitized volumes to HathiTrust in this new format, and several others have started experimenting with it.

Get in touch with us at support@hathitrust.org if your institution is interested in submitting content to HathiTrust, or has any questions or comments about the process!

Further Information:

We’ve made available the specifications for the simple submission package format and an example YAML file.

Additionally, there are a number of free tools that can help with preparing submission packages. If your images are not already TIFF or JPEG 2000 images, take a look at ImageMagick. In most cases you can convert images to TIFFs just by running (for example):

convert infile.jpg outfile.tif

On Windows, irfanview is a free tool that can do batch conversion between image file formats as well as batch renaming; md5sum from GNU CoreUtils for Windows is helpful for generating file checksums.

Editors like Notepad++ and TextWrangler can help with creating YAML files. YAML can be a little fussy about syntax and spacing, so there are online YAML validators available to check the files you create.

Quality in HathiTrust

hathitrust — Wed, 13 May 2015 15:21:18 +0000

by Jeremy York (HathiTrust) and Kat Hagedorn (University of Michigan Library)

As reported in our monthly updates, we receive well over a hundred inquiries every month about quality problems with page images or OCR text of volumes in HathiTrust. That’s the bad news. The good news is that in most of these cases, there is something we can do about it. This blog post is intended to shed some light on our thinking and practices about quality in HathiTrust. We hope it will also encourage you to report any problems you might find so that we might have the opportunity to fix them, and deliver the highest quality collections we can for educational and research needs.

HathiTrust and Quality

We go to great lengths to ensure we have the highest possible quality volumes in HathiTrust. Our approach to quality at a broad level is outlined in our commitment to quality. On a day-to-day level, we strive to offer one of the best user support teams around, responding to reported issues and providing updates as we make progress on addressing them. Someone might reasonably wonder, however, why there are quality problems in HathiTrust at all? Shouldn’t libraries, or HathiTrust, have better quality control? Aren’t librarians primarily concerned about information quality?

Of course we are. As libraries - as HathiTrust - we strive for the highest quality of digitization and digital production. This striving can be tempered, however, by very practical concerns about fulfilling the needs of users, and fitting the investment of time and resources in digital conversion to the purpose of providing greater, enduring access to the materials in our collections.

Collaboration at Scale: A Digitization Cornucopia

One factor contributing to the presence of errors in HathiTrust is the sheer scale at which we operate. It would be truly remarkable if there were no mistakes in the more than 13 million digital items contributed by partnering institutions and others to the digital library. Errors at this scale are par for the course. Why is this so?

There are a variety of approaches that libraries take toward digitization. The choice of a particular approach may be influenced by the amount of materials to be digitized, the time and resources available for digitization, and, significantly, the intended purpose of digitization. For instance, it could be important to preserve, as much as possible for users, the artifactual value of the print original - the texture and color of the pages, the wear and tear on the book, etc. On the other hand it may be important, or sufficient for the purpose (or at least, initial purpose), to target the intellectual content in the book for preservation, rather than the intellectual content and artifact together. In the first instance, manual review and manipulation of each digitized image may be needed to achieve the desired end. The second instance, however, may lend itself well to a larger scale digitization project, where a higher production rate could be achieved by focusing (for instance) on capture of the printed text in the book with a high level of accuracy (not necessarily to the exclusion of physical attributes of the book), rather than fuller characteristics of the artifact, which may be more time-consuming.

The variety of approaches libraries have taken to digitization is reflected in the materials available in HathiTrust. Some, such as the collection of Islamic Manuscripts from the University of Michigan or incunabula from the Universidad Complutense de Madrid, are the result of specialized digitization of rare or valuable materials. Others are volumes digitized in large-scale initiatives by Google or the Internet Archive, or smaller scale initiatives carried out by libraries or third party vendors. Still others, such as volumes from Utah State University Press or Knowledge Unlatched, are the original digital files that were used for print or digital publication.

For the digital content we ingest, HathiTrust has established specifications related to image formats, resolution, color space, and other characteristics. Rigorous validation ensures that these specifications are met. The methods of production or processing of digitized items may leave fingerprints of some sort, however. These may be benign, such as the presence of digitization color targets, added coversheets, book cradles, or a characteristic coloration of pages, which do not generally interfere with the display or understanding of the original object and its content. They may also be more serious, including mis-colorations of pages, human fingers in the images, systemic cropping, warping, or bolded or light text—problems that do interfere with legibility or clarity of the image.

Not surprisingly, materials produced through large-scale digitization are the most likely to have quality problems. In such projects, it is simply not feasible to review the quality of each individual digitized image, and libraries are more likely to rely on sampling pages for quality review, or using automated metrics (Google scanning is a good example) to understand the quality of digitization outputs. Though errors are observed in large-scale digitization, the benefits of being able to search and use millions of volumes from library collections in digital form are so significant, and have had such an impact on the types and scope of research performed using library collections, that it is something libraries worldwide continue to pursue.

New Approach, New Responsibility

Some might say that libraries that engage in large-scale digitization are trading quality for quantity, and this would not be completely wrong (or necessarily bad). When faced with a choice of digitizing books at a rate of thousands per year for hundreds of years, as opposed to millions of books in a decade, many libraries chose the latter. Another way to look at it, however, is that libraries are reconceiving the way they offer services to their constituencies and tailoring their strategies and investments to meet user needs in the most appropriate way. In our preservation and access activities in HathiTrust, for example, we strive for both quantity and quality, working to provide materials that are fit for the purposes they will be used for, and matching the investment of resources to the effort required to produce that fit result.

It is part and parcel of this approach (and critical to our mission) that when we receive reports of problems with materials in HathiTrust’s collections, we do our utmost to address them. Our users are in the best position to recognize problems and let us know when the quality of volumes is not sufficient to meet their needs, and we take this very seriously. We recognize that the trustworthiness of our repository lies not only in preserving and providing access to content over the long-term, but responding to problems in the short-term when possible, and tracking on and developing strategies to address problems in the future when not.

Corrections: The Nitty Gritty

As might be expected, corrections are a manual process, one both labor-intensive and time-consuming. Not all problems can be addressed in an ideal time frame. Some problems (see the table below) may be systemic across volumes digitized using a particular approach. Foldout pages that are scanned while unfolded, obscuring content, is a good example. Moire problems in scanned images is another. In the future, large-scale remediation efforts may be the most effective strategy for addressing these problems. In the meantime, we are committed to working with vendors, our partner institutions, and any entities that deposit materials in HathiTrust, to prioritize and address quality issues and achieve a collection over time that is fit for the purposes our researchers want to make of it.

Here’s what we’ve done so far:

From the time HathiTrust was launched to the present, 6,499 volumes have been reported to have some kind of quality issue. As of May 4, 2015, we have managed to fix the problems in 2,310 of these. Overall, of 1,141 problem reports on full view volumes that are known to come from end users, which are prioritized (many problem reports come from staff at partner institutions engaged in copyright review of limited view materials), we were able to fix 913 of them, a total of more than 80%!

The correction methods for the 2,310 volumes breakdown as follows:

468 were corrected through the contributing institution rescanning and replacing individual problem pages with corrected ones;
128 were corrected by the contributing institution completely re-scanning and re-depositing the volume;
in the remaining cases the vendor made corrections and volumes were re-ingested into the digital library.

Types of Problems and Outlook for Correction

Some of the most common types of problems we receive reports of are listed below. Along with each problem we have given some information about what we are able, or in some cases not able, to do to address them. We hope this will provide somewhat of a guide to understand how we view and prioritize problems.

Problem	Description	Outlook
Warp	Either the words on a page are strangely stretched, or the page itself is.	This error can be introduced during the image processing stage after a digital image has been captured. It can frequently be corrected through re-processing of the image by the vendor. See appendix for example.
Skew or crop	Pages are sometimes not correctly aligned to the vertical or horizontal axis of display, or the pages are cropped inappropriately (such as when the inner gutter is cropped and it is not possible to see the first few letters of words on each line).	Similar to warp, this problem can often be fixed through re-processing by the vendor. In extreme cases, the page may need to be re-scanned and inserted. Sometimes cropping occurs because of a tight gutter in the book that was scanned that prevented content from being captured without damaging the book. In these cases also, re-scanning may be the only option. See appendix for example.
Upside-down pages	Self-explanatory	Upside down pages are generally fairly easy for vendors to address when we let them know. They re-package and deliver the content and we re-ingest.
OCR text quality	OCR may be unintelligible or absent.	We require deposited volumes to have OCR text when OCR text can reasonably be obtained (we do not require OCR, for example, for hand-written manuscripts). If the OCR quality is poor and is due to the nature of the material (e.g., tables and charts, small text, or text in columns can be difficult for machines to OCR), the outlook for correction is not good. If there are obvious errors, e.g., a volume is in English but the OCR is Russian, the outlook is good. And there are shades in between. In general, vendor OCR capabilities (certainly Google’s capabilities) have increased over time, so simple re-processing could go a long way. However, OCR corrections are exclusively machine generated--we do not currently manually correct OCR nor do we have a process in place for accepting manual corrections to OCR. See appendix for example.
Folded foldouts (maps, plates, tables)	Foldouts may be scanned while folded, causing much or all of the content to be inaccessible. This is a common practice in large-scale digitization due to goals for high throughput.	In these cases, we ask the contributing institution to rescan the foldout in its unfolded state and insert it into the volume. See more about this process below. See appendix for example.
Missing pages	Self-explanatory	Scanned volumes are sometimes missing pages because the original volume was missing pages, and sometimes due to problems in the way the scanned images are processed and assembled by the vendor. We have had a relatively high rate of success in obtaining missing pages (either through vendor or contributing institution correction) when they result from a vendor problem.
Moire	Moire is an often checkered or lined pattern superimposed on illustrations, that is introduced at the time of scanning.	Moire can only be fixed by re-scanning the affected pages. Because illustrations with moire are generally legible to some degree, moire is often not considered as high of a priority for correction in comparison with issues where content is clearly missing, obscured, or illegible. See appendix for example.

The following institutions currently have processes in place to rescan and replace individual pages of digitized books. Keep in mind that the processes are run, and volumes processed, but only as resources allow.

Cornell University
Northwestern University
Penn State
University of Michigan

As mentioned earlier, your inquiries are the primary way that we identify problems with HathiTrust volumes. To report a quality issue in a volume, use the “Feedback” link at the top of the page where you see the error. This feedback goes to members of the HathiTrust User Support Working Group (view full membership and charge), who will reply promptly and keep you informed of progress. Sometimes fixes take days or weeks. They can also at times take upwards of 6 months because of the time it takes to rescan pages or volumes. Please don’t get discouraged. We may not be able to fix all problems immediately, but we will certainly try (an 80% success rate is pretty good!) or save them for a future time. Help us create a corpus that will meet your needs to the fullest, and the needs of future generations as well.

Appendix: Error Examples

Warp Example

Crop example

Skew Example

OCR Example

Unfolded Foldout Example

Moire Example

DPLA and HathiTrust Partnership Supports Open E-Book Programs

hathitrust — Tue, 21 Apr 2015 16:00:00 +0000

By Dan Cohen and Mike Furlough

The Digital Public Library of America and HathiTrust have had a strong relationship since DPLA’s inception in 2013. As part of our ongoing collaboration to host and make digitized books widely available, we are now working to see how we can provide our services to exciting new initiatives that bring ebooks to everyone.

The Humanities Open Book grant program, a joint initiative of the National Endowment for the Humanities and the Andrew W. Mellon Foundation, is exactly the kind of program we wish to support, and we stand ready to do so. Under this funding program, NEH and Mellon will award grants to publishers to identify select previously published books and acquire the appropriate rights to produce an open access e-book edition available under a Creative Commons license. Participants in the program must deposit an EPUB version of the book in a trusted preservation service to ensure future access.

HathiTrust and DPLA together offer a preservation and access service solution for these re-released titles. Since 2013 public domain and open access titles in HathiTrust have been made available through the Digital Public Library of America. HathiTrust recently added its 5 millionth open e-book volumes to its collection, and as a result DPLA now includes over 2.3 million unique e-book titles digitized by HathiTrust’s partner institutions, providing readers with improved ability to find and read these works. Materials added to the HathiTrust collections are available through its full-text search, can be made available to users with print disabilities, and they become part of the corpus of materials available for computational research at the HathiTrust Research Center. By serving as a DPLA content hub, HathiTrust can ensure that open access e-books are immediately discoverable through DPLA.

Improving the e-book ecosystem is a major focus of DPLA’s and is an important theme at DPLAfest 2015 in Indianapolis. The Humanities Open Book program is just one example of current work to make previously published available again in open electronic form. A parallel initiative from the Authors Alliance focuses on helping authors regain the rights to their works so that they can be released under more permissive licenses. Publishers are also exploring open access models for newly published scholarly books through programs such as the University of California Press’s Luminos. DPLA and HathiTrust applaud these efforts, and we hope that these initiatives can avoid becoming fragmented by being aggregated through community-focused platforms like DPLA and HathiTrust.

We are both very pleased that we can provide additional support for the fantastic work that NEH and Mellon are supporting through the Humanities Open Book Program. Publishers who are contemplating proposals to NEH may find that works are already digitized in HathiTrust, and may choose to open them as part of the grant planning process (see http://www.hathitrust.org/permissions_agreement). In the coming months we’ll be happy to advise potential applicants to this program, or any other rightsholder who would like to know more about the services of DPLA and HathiTrust.

***

Contact:

Mike Furlough, furlough@hathitrust.org

Dan Cohen, dan@dp.la

Getting to 5 Million: HathiTrust's Collection of Open Books

hathitrust — Fri, 10 Apr 2015 10:42:02 +0000

by Mike Furlough, Executive Director

April 10, 2015

Just before the end of March we reached a significant milestone when we added the 5 millionth volume that is open for reading and downloading. Like any research library collection, HathiTrust is nothing if not eclectic, as evidenced by our 5 millionth volume, contributed by Ohio State University: A treatise on the disorders and deformities of the teeth and gums, explaining the most rational methods of treating their diseases by Thomas Berdmore (London, 1770). Berdmore was King George III’s dentist. According to his VIAF record he died at the age of 45 and his work was also published in Dutch. (Alert the medical historian in your family!)

Earlier this week, Rick Anderson offered in the Scholarly Kitchen a nice commentary on the significance of this event for the public at-large and for the library community. Rick sketches the broad outline of how we got here today, through partnerships with Google and among research libraries. I want to add that the HathiTrust collection is the product of hundreds, if not thousands, of people at our 105 member institutions, at Google, at Internet Archive, and many other content producers and distributors. This includes executives and front-line staff, librarians and publishers, technologists and end users. It’s an honor to be involved with it today.

In this post, I’d like to go into some detail to shed more light on the characteristics of these materials, our work to open them, and how we are endeavoring to make them as useful as possible. Numbers geeks, get ready.

Public Domain and Open

I’m referring to 5 million open works not public domain works. That’s because nearly 20,000 works in this this group have been licensed by the rights holder to be opened for public display. (That’s a very small percentage, but is still a substantial number of affirmative choices to support open access). We’ve tended to use the “public domain” as shorthand for the entire set of works, but in this post I’ll split hairs to better explain our processes and policy.

Some of our collection is public domain only in the United States and thus can’t be accessed outside of the US (there is also a portion that are in the public domain only outside the US). Here’s a snapshot as of April 1. (The total number of volumes held in HathiTrust on that date was 13,305,071.)

Status	Count	Percentage of Open Works	Percentage of Entire Collection
Public Domain - Worldwide	3,080,031	61.54%	23.15%
Public Domain - US	1,901,044	37.98%	14.29%
Public Domain - Only outside US	4,451	0.09%	0.03%
Total Public Domain	4,981,075	99.52%	37.44%
Creative Commons and Open Access Licensed	19,425	0.39%	0.15%
Total Open Volumes	5,004,951	100%	37.62%

Furthermore, these 5 million volumes correspond to 2,365,771 unique titles. This is because some of these titles are multi-volume works. At the end of this post I’ve appended a few links and graphics to provide you with more detail about the nature of the collection. I don’t have space to discuss each one here, but take a look and comment if you have questions.

Let’s look at how we determine the rights status of books in HathiTrust.

How did we get 5 million open volumes?

There are several processes that we use to determine if we can open a book for public view.

Automated bibliographic metadata review

Whenever an item is added to the collection, an initial automated rights determination is made using information in the item’s bibliographic metadata, such as publication date and place of publication (details of the process are available on our website). For United States works, anything published before 1923 is identified as being in the public domain worldwide. US federal government documents, no matter when they were published, are identified the same way. There are more than 612,484 federal documents available in the collection today.

Because copyright term in other countries is often determined by the death date of the author, and because the term varies country by country, it’s harder to verify copyright status of works published outside of the US. We’re cautious here and will only open works published outside of the US 140 years ago. In other words, all works published before 1875 are now marked as public domain worldwide. We treat works that were published outside the United States less than 140 years ago but before 1923 as public domain only when viewed from within the United States. There are a small number of works that are in the public domain only outside the United States. These are manually determined to be so due to provisions of the General Agreement on Tariffs and Trade (GATT).

More than a third of the total collection is thus made available, but a lot is left marked as in-copyright or of undetermined status. Because we know that much of this may be in the public domain, we also have a manual review program for some significant parts of our collection.

Manual Copyright Review: the Copyright Review Management System

Since 2008, the University of Michigan has had National Leadership Grants from IMLS to fund the development and operation of the Copyright Review Management System. This ambitious project draws on the expertise of copyright specialists at Michigan and research skills contributed by staff in dozens of partner libraries. CRMS incorporates a double-blind review process, in which two persons at different institutions are randomly assigned a work in a designated review queue to investigate. Two reviewers must confirm that an item is no longer protected by copyright for us to open the work. Split decisions are reviewed by a third, expert reviewer. If the evidence is incomplete and we can’t determine, we err on the side of caution and keep the work closed for viewing. (The full text of works that are not available for viewing is still available to be searched.)

The first phase of the project, CRMS-US focused on works published in the US from 1923 to 1963, during the period when copyright status had to be renewed or certain formalities had to be followed to receive protection. The second phase, CRMS-World, has focused primarily on works published between 1876 and 1944 in the United Kingdom, Canada, and Australia, and uses primarily author death date and place of publication as factors to determine copyright status. The CRMS team also worked with the Berlin School of Library and Information Science at Humboldt University on a pilot project involving German language materials.

Over the past six years the CRMS project staff have reviewed 511,520 items and have been able to open 270,979, or 52.96%. In this final year of IMLS funding the team is developing a toolkit for future work on copyright determinations, and we’re examining what other parts of the HathiTrust collection might be well suited for this distributed, manual review process. We’d like to hear from any of you who have ideas about how this project could be expanded in the future.

Permission of the rights holder

As noted above, a small portion of these 5 million books are, strictly speaking, not public domain, but are licensed for public access, either through a Creative Commons license or a more general license for open access. This includes items licensed by individual authors (e.g., http://hdl.handle.net/2027/mdp.39015038573807), a publisher, such as the Brooklyn Museum, or some other creator or agent stewarding the rights of various works.

Giving permission to open your work under a CC or other license is straightforward, and we have instructions online (Link http://www.hathitrust.org/permissions_agreement). We’re more than happy to discuss this process with anyone interested, and can help with gathering the information needed for a large number of items.

Using Open Works

It’s our goal to ensure that the works that are open access or in the public domain are as widely accessible and usable as we can make them. To aid discovery we’ve been serving as a content hub for the Digital Public Library of America since 2013, and any of the open works are findable through their outstanding search interface. All of these 5 million items can be read online in the United States and users can download pages of the works regardless of whether their library is a member of HathiTrust or not; just over 3 million of them are available in this way anywhere in the world. More than 600,000 items carry no license restrictions (read more about restrictions) and can be downloaded in their entirety in the US; 380,000 are available for full download worldwide. Users at member institutions may download full copies of a broader range of works when they authenticate via their institution.

Third-party agreements and licenses may govern what you can do with some of the works you access. For example, the full range of Creative Commons licenses, from CC-BY to CC-BY-ND-NC are in use in the collection, and works digitized by Google can be accessed via HathiTrust, but may not be re-distributed or re-hosted. You can get more details about appropriate use of works by looking at the rights designation indicated for each. We explain the meaning of our rights designations here: http://www.hathitrust.org/access_use

Public domain, open access, and CC-licensed works are available for computational uses via the HathiTrust Research Center (HTRC). The HTRC, hosted by Indiana University and the University of Illinois at Urbana-Champaign, has developed infrastructure, services, and staffing to help researchers perform complex research using various data mining techniques. The HTRC has also created a dataset of extracted features derived from open volumes at the page level. It includes line counts, sentence counts, and counts of part-of-speech, e.g., nouns, verbs, etc. Through a recent grant from the National Endowment of the Humanities we’re also making an n-gram viewer using the Bookwork software, which allows anyone to track the appearance of words over time in the “public domain” corpus: http://bit.ly/1CndJle.

Beyond 5 Million

We’ll continue to increase the number of open items in HathiTrust, through natural growth and specific programs. As part of our Government Documents Initiative we are continuing to identify US federal publications in our collection that were not correctly cataloged, and many of our contributing members are still scanning these materials in large numbers. By the end of the year the number of federal documents should increase by tens of thousands at least. We’ve also been quietly working with several publishers and other organizations to collect and open more materials in the collection. And we are analyzing the collection to identify potential pools of materials that can be manually reviewed using the CRMS processes.

HathiTrust--that is, the entire collective of members--is proud to provide access to these 5 million volumes. But there’s still a lot of material to collect. In Rick’s Scholarly Kitchen post, he argues that libraries who “consider themselves to be on the front lines of providing open access to scholarly information” should be actively considering how to digitize and make accessible collections that can be made available without restriction. I agree with him, but please don’t forget that in-copyright works in the HathiTrust collection are also very valuable and can be used lawfully in specific ways. Everything is searchable--all 13 million volumes, regardless of copyright status---and we are planning to offer non-consumptive services on the entire collection through the HTRC (read more about the HTRC Data Capsule). In-copyright works can be made available to users at member institutions who have a print disability, opening up many more millions of volumes for a significantly underserved population of library users. Finally, in the United States we provide access for lost or damaged and out-of-print items under Section 108 of the Copyright Act.

These are all good reasons, supported by law, to continue to digitize in-copyright works, and this shouldn’t be forgotten. It’s our mission to preserve the record of human knowledge for future access. There will be a day--far in the future perhaps, but the day will come--when all of today’s HathiTrust collection will be readable by anyone. We’re preparing for that day now.

Appendix: Additional Data and Visualizations

Our Statistics and Visualizations page provides reports about the distribution of languages, publication dates, and LC-classifications for the public domain (open) collection and the collection as a whole. These are updated each day, but here are some snapshots as of the April 8, 2015, the day I am completing edits to this post.

Language Distribution of Open Works in HathiTrust, April 2015

English is closer to 50% of the entire collection, but because the CRMS project has focused primarily on English language materials it represents a larger part of the public domain collection.

Date of Publication Distribution of Open Works in HathiTrust, April 2015

Not surprisingly, most of the open works are were published in the late 19th and early 20th centuries. The slight uptick from 1970s-1990s is mostly likely due to the increasing numbers of US federal documents published in that period. Because the majority of our collection was digitized from library circulating collections, we have very little material dating before 1800.

Type of Works

In addition to these reports, we track other characteristics of the collection, including the type work. The table below provides a recent snapshot of the collection by the type of work.

Type of Open Work	Number of Volumes	Number of Unique Titles
Single-volume monograph	2,416,746	1,922,399
Multi-volume monograph	1,083,469	330,699
Serials	1,519,766	11,2673
Total	5,019,981	2,365,771

Overlap Analysis

Finally, we can track the prevalence of public domain works in the collections of HathiTrust Members. We periodically create graphics we call call “H-plots.” To get these we run an overlap analysis for each member library’s collection holdings against the rest of the membership (see HathiTrust Print Holdings for details about the information we use in matching). This gives us a sense of the “uniqueness” of a member’s collection. In the example below, which I picked at random, we see how widely distributed are the items in the Library of Congress (LC) collection.

Here the X-axis is the number of times a book is found on the shelves in a HathiTrust member library; the Y-axis is the number of items in the collection that are duplicated. So, for example, there are about 160,000 items in the Library of Congress that can be found in five other member libraries, and of those about 125,000 are in-copyright and the remainder are open. This does not tell us about what has been digitized from the Library of Congress, it only reports on their physical collection holdings relative to the rest of the HathiTrust members. (In fact, every digitized item that the Library of Congress has added to HathiTrust is out-of-copyright.)

These H-plots are useful to get a sense of the general “rarity” of a library’s collection, but they may not always show a complete picture. For instance, there are items from institutions’ collections that are not represented (only items whose records include OCLC numbers are included - see the Print Holdings link above) and we may not be able to definitely match holdings across institutions (these will often show up as unique items held by an institution). Still, the H-Plots provide a view of what is known about library collections, which we expect to increase over time as we continue our work. The most recent collection of graphs is available at http://bit.ly/1Ofrjyw.

Reflections on the First HathiTrust Member Meeting

hathitrust — Tue, 18 Nov 2014 23:31:18 +0000

By Mike Furlough, Executive Director, HathiTrust

Since I started as Executive Director of HathiTrust in May of this year, I have done nothing but learn: learn about the organization, our operations, our finances, our people, and our partnership. I have traveled quite a bit, especially this fall, paying visits to HathiTrust members (thank you, libraries of Carnegie Mellon, Pittsburgh, Harvard, and Northwestern), several meetings of library organizations and consortia (thank you TRLN, GWLA, COPPUL, and ASERL), as well as a couple of special focus meetings on digital humanities and newspaper digitization (thanks to you all as well).

Although I usually give a talk about HathiTrust during these trips, I consider these listening, not proselytizing visits. I am there to learn about the turning points at which these organizations find themselves today and what strategic issues they are focusing on. And through their questions I find out what matters to them about HathiTrust—or what would matter to them more if we were to tackle this problem or that. It’s also useful to find out what people don’t know or don’t understand, because it sometimes means that users may not be getting the full benefit of HathiTrust.

The standout event of my first six months was the 2014 HathiTrust Members Meeting, held in Washington, DC on October 11. This was the first meeting of our membership since the 2011 Constitutional Convention, after which we developed our new governance structure, and adopted our current financial model. This was a unique chance to bring our partners together to update them on our current initiatives and engage them to begin planning for the future. Evident throughout the day were the membership’s strong sense of shared responsibility for the success of HathiTrust and the excitement for what we have done and will do together.

Here I’d like to offer some reflections on the day’s discussions, highlighting a few specific initiatives and some questions about where we are going as an organization. There’s no way in a short blog post to cover every single issue that came up that day, let alone in the last six months, so I hope you will forgive omissions for the sake of brevity and post questions or contact me directly. If you are interested, we have a more detailed report on the Member Meeting, along with slides of most of the presentations.

First of all, the partnership is strong and continues to grow. After the 2011 Convention our membership increased from 64 to 101 member libraries and now includes four in Canada, one in Spain, and one in Australia. Over 60 individuals from 30 member different libraries currently serve on a HathiTrust working group, standing committee, or governance committee. Our infrastructure is strong and we have repeatedly confirmed that our work is grounded solidly in the law. A growing number of our members have identified staff to work with us to obtain access to the HathiTrust collection for users who have print disabilities. The collection has grown significantly. At the moment I am writing this we stand at 12.96 million volumes, 4.8 million of which are open for full-text access because they are no longer covered by copyright or because an author or publisher has made the material available using a Creative Commons license. We have struck some outstanding partnerships in the last several years, including one with the Digital Public Library of America, which is now a notable source of viewers and readers of HathiTrust collections. In short, our preservation and access services provide a very solid basis for future work.

And that future work will continue to transform how libraries serve their users and manage collections. During its inaugural year the Program Steering Committee (PSC) launched working groups to plan programs passed as ballot initiatives at our 2011 Constitutional Convention. One of these, a proposal to develop a shared and distributed print monographs archive, will promote collective and coherent decisions about the retention and long-term management of print collections. By organizing a distributed print collection corresponding to the HathiTrust digital collection, we can strengthen our preservation commitments and better ensure future access to the cultural record. The working group studying these issues will make their first recommendations in early 2015. The chair of this group, Tom Teper of the University of Illinois Urbana Champaign reported on their work at the Member Meeting.

We have already taken action on another proposal from the Convention, one to expand and enhance access to US federal government publications. In 2013 we began the development of a Registry of US Federal Government Documents. More recently, the Government Documents Initiative Planning and Advisory Working Group, led by Mark Sandler of the Committee on Institutional Cooperation, has made preliminary recommendations that are now under review by the Program Steering Committee. Currently HathiTrust holds over 575,000 known US federal publications in HathiTrust, but we believe there to be a substantial number of unidentified documents in the collection, and a much larger number of documents left undigitized. The recommendations of the Advisory Working Group include several that will strengthen the Registry project, and others that will help us to identify, source, and collect federal documents over the next several years. Mark Sandler also provided a report at the Washington meeting.

Stephen Downie of the Graduate School of Library and Information Science at the University of Illinois, Urbana Champaign, reported on the HathiTrust Research Center. Our goal in supporting the Research Center is to simplify advanced computational access to our digital collection through services and infrastructure developed by experts. Downie, who along with Beth Plale from the School of Informatics and Computing at Indiana University co-directs the Research Center, outlined an ambitious agenda of service development, which will be furthered with substantial funding from HathiTrust and from both Illinois and Indiana. These plans include the development of training and services that can be integrated into services in a library’s research commons or in similarly-defined programs of advanced support for faculty and students. In addition to the development of these services, the Research Center has received funding for research from the Alfred P. Sloan Foundation, the Andrew W. Mellon Foundation, and the National Endowment for the Humanities. There is tremendous potential for the work undertaken by the Research Center to enable great improvements in the metadata and the content of the HathiTrust collections. They have recently announced the date for their next “Uncamp” (March 30-31, 2015 in Ann Arbor, MI) and released a request for proposals from which they will select projects for advanced research support from HTRC staff. (Researchers, including faculty and students, from HathiTrust member institutions have priority in this call). The RFP includes detailed information.

Because we have developed such a strong organization, collection, and infrastructure, we can readily address these challenges of print management, document identification, and services for computational research. Yet with all of this underway, we are still growing as an organization, and much of our discussion during the Member Meeting focused on how we can collectively chart HathiTrust's future paths. At the 2011 Convention, attendees referred a ballot measure to expand the mission of HathiTrust to the new Board of Governors for action. In Washington, board member Brian Schottlaender, presented a draft of new language for the Bylaws (Section I - Purpose), developed in response. The language proposed makes clear that HathiTrust should not be as format-bound as we have been in the past. The original bylaws state that we are building a “digital archive of library materials converted from the print collections of the member institutions.” In proposed revisions, our purpose would be to collect "digital content of value to scholars and researchers, including a variety of formats and born-digital materials.” There was general support for these edits, though some members asked for further clarification on other points. We are finalizing the new text and it will be presented to the members for a vote in the near future.

Assuming these changes to the bylaws are passed, we will have to think about what it means for HathiTrust to collect the record of human knowledge in “a variety of formats.” Obviously we must pursue partnerships with publishers and other organizations to collect newly published materials in born-digital format. We made a start with that by collecting newly published university press books made openly accessible through the Knowledge Unlatched pilot project. But this is only a start, and we must be ready to collect material from other sources. In this regard, the discussions around future funding for scholarly monographs remain very important to monitor.

In our first several years we did undertake a pilot project that collected images in HathiTrust, and had plans for a pilot for audio materials that we did not complete. During an open discussion period in Washington I asked “How important are non-text formats for HathiTrust?” and the responses varied. No one disputed their importance, but some cautioned on the timing. For certain members they are critical. These members believe that we must better support visual and graphical materials, including those found in the books in our existing collection, as well as materials at-risk or otherwise less accessible in our archives and special collections. Some observed that as a body of materials, the government publications--on which we are so heavily focused--are and have always been multi-format. However, others cautioned that we still have much yet to do with the textual materials we’ve collected, and that there are other types of text collections we haven’t touched, such as newspapers. We should, in this view, not lose sight of what we do well and be mindful of the resources required to expand into new formats.

Making clear choices about what you not going to do can be powerful. Our success stems in part from our clarity and focus on text over the last six years. We’ve now developed great capacity and expertise in managing re-formatted print/text collections, and I am a strong believer in playing to your strengths. Expanding beyond text might be seen to diminish that focus. Although new format choices implicate development and would affect our resource allocations, “What formats?” is not the only question. What are we trying to achieve, and what types of future access do we need to envision?

Of course, books do not exist in a vacuum, and even a text-bound collection must in the future be able to connect its materials with users regardless of their working environment. Works of fiction, poetry, and other creative genres found in HathiTrust can be related to letters, draft manuscripts and other materials in archives around the world. The long-form arguments embodied in monographs are dependent upon those of other books, as well as articles in serials, primary source documents, collections of data, and so on. Virtually anything can be evidence in a scholarly argument, and for two decades now we have seen many experiments in multi-modal scholarship that attempts to make these relationships between argument and evidence manifest and seamlessly available. In what way can we prepare our infrastructure to connect to, if not collect, those related materials? As our friends at OCLC research have observed, “evolving scholarly record” has become more heterogeneous and parts are at risk due to fragmentation in our mechanisms of management and preservation. We will have to address this format question squarely in the coming year, but we will do so the context of our overall mission, the services we can build together, and related strategic issues. Earlier this year the Program Steering Committee began outlining some issues related to collecting non-text formats. This is only a start of the discussion, and this issue is also in the charge of the newly re-charged Collections Committee.

It’s a very different world now than when we began and clearly it’s time for longer-range planning at HathiTrust. In 2008 “mass digitization” was still less than four years old, opinion about its value was mixed, and its future was uncertain. That is the moment we came from, but as we start 2015 we have many new venues in which to work on these problems as a collective. These include, among others, DPLA, the Digital Preservation Network (DPN), and Academic Preservation Trust (APTrust). These initiatives and others can also transform discovery, preservation, and access to the diverse scholarly products of our researchers and students, especially if we are coordinating our strategies. When we began HathiTrust some commentators doubted that we could be successful, but our success has partially enabled such a flourishing ecosystem of digital library infrastructure. Precisely because of our success, HathiTrust has a special obligation to work with others to help bring “coherence” (to borrow a term) in this environment.

Whatever we do, these issues need to be addressed from multiple perspectives and with the needs of the membership at the center of the discussion. At the Member Meeting, and in private conversations I have had, some representatives have urged that we undertake our future development initiatives in the most inclusive and transparent manner possible without interfering with our agility. These are important and natural concerns. Our governance structures, including the Board of Governors, the Program Steering Committee, and various working groups, are providing mechanisms for this. For example, the PSC will work on creating processes for identification and evaluation of proposals for major new technical or service developments. In our startup years we have drawn heavily on the resources of the University of Michigan Library. But in 2013 we launched Zephir, developed and operated by the University of California’s California Digital Library to manage metadata for the repository, and the HathiTrust Research Center is co-located at two of our member institutions. HathiTrust increasingly must stand up on its own and continue to draw upon the expertise of all of its members, enabling our libraries to build and offer their own services based on the HathiTrust collections and platform. Some attendees at the Member Meeting offered ideas aimed at making this possible, such as “microgrants” to fund investigations or research and development beyond the scope of the Research Center. Others expressed their hope to form a strong HathiTrust community, and want to see opportunities for member institutions to share programs or projects they’ve initiated based on the HathiTrust collection and services. These are great ideas to explore, and there are others found in the full report on the Member Meeting. I welcome others from you now and at any time. HathiTrust is partnership focused on sharing responsibility for preserving and curating our resources, and your involvement is necessary.

What's in your collection?

hathitrust — Wed, 06 Jun 2012 19:23:31 +0000

Are you familiar with the Collections area of HathiTrust?

http://babel.hathitrust.org/cgi/mb

There are currently 940 public collections created by users and library staff members at partner institutions, including several we have featured. HathiTrust collections provide a way to aggregate digital items related to a common theme, or associated with a given physical collection or location (for instance, the University of Michigan has created a collection of its Hatcher Graduate Reference reading room). Items can be added to a collection from HathiTrust full-text search results pages. Once they have been added to a collection, the full-text and bibliographic metadata of items can be searched independently of the larger repository. Items in collections can also be quickly copied to new or existing collections. These features make collections an easy way to refine a set of search results, share batches of items with others, or (in the case of the Michigan Graduate Reference collection), allow staff and users to search within specific collections to find the book with that one particular index or obscure term that they can pull from the shelf for more information.

Staff at some of our partner institutions have been talking about how great it would be to have even more high-quality collections to help demonstrate the usefulness of this feature (and be used!). We'd also like to explore how this kind of feature could better support library needs.

It’s easy to create a collection

Once you are logged into HathiTrust (either as a member of a partner institution or using a University of Michigan Friend Account), you can easily create collections when viewing a volume or from full-text search results as shown below:

We can help

Because large collections can be somewhat cumbersome to create manually, we can work with you to help build them! To create a custom collection, we need to know the specific the items that are desired to be included. We can work from a list of item identifiers, or from one or more search queries. Item identifiers for large collections, or collections made from criteria that are not easy to search for can be obtained using one of the methods below:

HathiTrust's tab-delimited metadata files. These files are an inventory of repository holdings, containing a variety of identifiers for volumes (ISBN, LCCN, OCLC, etc.), copyright information, and limited bibliographic metadata for each volume in HathiTrust. A description of the files is available at http://hathitrust.org/hathifiles_description.
HathiTrust Data API. In addition to retrieving entire volume packages from HathiTrust (including images and OCR), the Data API can be used to find ids for volumes digitized from a particular source. The University of Michigan has built a demonstration application using the Data API that illustrates how this can be done. Please see http://www.lib.umich.edu/two-over-threehundred.

Custom collections

Here are some examples of collections that have been custom-built. If you haven’t yet become familiar with our Collections feature, give it a try. If you are, and have some great ideas for collections but have had trouble making them, give us a holler and we can point you in the right direction or help you create it.

Collections built from one publication -- United States Congressional Serials Set: http://babel.hathitrust.org/cgi/mb?a=listis;c=1597493732
- Given the title, we were able to locate all items associated with that title in HathiTrust and build a collection.
Collections built from a search term in the HathiTrust catalog -- Ancestry and Genealogy: http://babel.hathitrust.org/cgi/mb?a=listis;c=332123463
- This collection was based on a catalog search for full view items where "genealogy" occurred anywhere in the bibliographic record. The owner has added items individually since the collection was created.
Collections built on holdings information in a partner catalog -- UM Hatcher Graduate Reference: http://babel.hathitrust.org/cgi/mb?a=listis;c=30688098
- The list of ids for this collection was assembled using location data from the University of Michigan Library.
Collections built from analysis -- English Short Title Catalog: http://babel.hathitrust.org/cgi/mb?a=listis;c=247770968
- HathiTrust is collaborating with the ESTC to determine volumes in a candidate set (English language volumes published before 1800), are both in the ESTC catalog and in HathiTrust. The matching volumes are included in the ESTC collection.

Note that once collection(s) are built, we will transfer ownership to the requester so the collection(s) can be updated and maintained.

Please contact feedback@issues.hathitrust.org with any questions or to get started!

When a simple search just won't do

hathitrust — Thu, 26 Apr 2012 14:07:57 +0000

By Heather Chistenson, HathiTrust Communications Working Group

With over 10 million volumes, and full text search free from commercial results ranking, HathiTrust is a go-to place for researchers who are serious about exploring the research library collection. Many technologists in our community who follow the HathiTrust Large-scale Search Blog are aware of the work that has been going on since full text search went into beta and then live on the HathiTrust site in 2009. We’re pleased to report that in 2012 HathiTrust continues to make progress with the implementation of more new advanced search features. With the leadership of Tom Burton-West at the University of Michigan, there have been two new feature releases in the past few months.

In February we released the first part of the advanced search interface for HathiTrust full-text search.

Advanced search allows users to combine a full-text search with searches within specific fields such as Title, Author, or Subject. For example if you want to find out where Charles Dickens used the phrase "the best of times" you can search for: [All of these words] [Dickens, Charles] in [Author] AND [This exact phrase][the best of times] in [Just Full Text]
The advanced search interface also allows users to set limits by publication date, format, or language. Multiple languages or formats can be selected.

We have now released the second phase of advanced search.

Users can now combine up to four different fields connected by the "AND" or "OR" operators, and any limits set are retained if you click on the "Revise this advanced search" on the search results page.

For those moments when a simple search just won’t do, we encourage you to give it a try!

Go to Advanced Full-text Search!

Ten Million and Counting

hathitrust — Fri, 06 Jan 2012 08:23:11 +0000

HathiTrust reached a major milestone on January 5, 2012, exceeding 10 million volumes in its digital collections. More than 2.7 million of these volumes are in the public domain, with viewing and downloading options available online. Statistics about the collections and a graph charting growth over time are available below (see also Statistics and Visualizations). We have also prepared a timeline noting significant events on our way to 10 million volumes. As of January 5, 2012, 23 of HathiTrust's 67 partners are depositing content in the repository. Details on contributions by institution can be found in our monthly updates. See also our News and Publications page for press releases, papers, presentations, and more about HathiTrust over the last several years.

Copyright Distribution by Type

Copyright Distribution by Date

Volume Distribution by Date

Volume Distribution by Language (1)

Volume Distribution by Language (2)

Growth Over Time

Timeline

January 2008

First formal multi-institutional commitments made to building HathiTrust

March 2008

First instance of HathiTrust repository infrastructure in place in Ann Arbor, Michigan
Storage purchased for second instance of repository in Indianapolis
University of Michigan coordinates site visit by a team from DRAMBORA
- Results of the DRAMBORA review were published as

Seamus Ross, Andrew McHugh, Perla Innocenti, Raivo Ruusalepp: Investigation of the potential application of the DRAMBORA toolkit in the context of digital libraries to support the assessment of the repository aspects of digital libraries, Glasgow: DELOS NoE, August 2008, ISBN: 2-912335-41-8

April 2008

Loading and testing of Google-digitized content from the University of Wisconsin begins
Preparations begin to establish second instance of repository in Indianapolis

May 2008

Testing of Lucene/Solr begins to provide full-text search across the repository
PageTurner application released with specialized accessible interface, allowing reading and full-text searching of individual volumes in the repository

June 2008

Lucene/Solr installed on development and production servers
Collection Builder application released

July 2008

Ingest of content begins from the University of Wisconsin
Tab-delimited metadata files are made available to facilitate local loading of HathiTrust bibliographic records
- Read more about HathiTrust Data Availability and APIs

August 2008

HathiTrust “about” website is released, including information about HathiTrust compliance with criteria for Trustworthy Digital Repositories (TRAC) and other documentation
Benchmarking for full-text search indexing begins

September 2008

Plans initiated to enable distributed development of applications and services by partner institutions
- 3-prong strategy: to enable access to the PageTurner via an API, to create a development ‘sandbox’ for shared development, and to develop a public discovery interface for the repository

October 2008

HathiTrust formally launched, including the institutions of the CIC, the University of California system, and the University of Virginia
- See the press release
Storage installed at Indiana site and an additional 90 TB of storage is installed at both instances, bringing capacity at each site to 190TB
Public beta full-text search application released, allowing full-text search of 500,000 volumes

November 2008

Data synchronization between Michigan and Indiana sites is completed and routinized

December 2008

Agreement concluded with OCLC to create discovery interface for HathiTrust
Indiana site becomes fully operational mirror of storage at Michigan site

January 2009

Load testing for full-text search begins

February 2009

Work begins on temporary beta catalog interface for HathiTrust

March 2009

Redundancy (in Indiana) for Web hosting infrastructure and full-text search indexing is established
Sample datasets containing full-text OCR of repository volumes are made available to researchers
New storage purchased, bringing total capacity at each site to 320TB

April 2009

Temporary beta catalog released
Ingest of Google-digitized content from Indiana University and the University of California begins

May 2009

HathiTrust Research Center and Collaborative Development Environment working groups launched
- The groups are charged to develop specifications for a HathiTrust Research Center and establish collaborative development environment for HathiTrust repository, respectively
Alpha version of Data API released
Michigan ingests legacy digital collections into the repository to pilot non-Google ingest

June 2009

California Digital Library begins work on improvements to PageTurner application
A record 379,000 volumes are ingested in June

July 2009

Working group formed to investigate need for 3rd instance of storage

August 2009

Report released on HathiTrust Disaster preparedness
HathiTrust releases METS profile version 1.0
- See HathiTrust Digital Object Specifications

September 2009

University of Michigan Press opens access to backfile publications in HathiTrust
UM and CDL staff begin collaboration for ingest of Internet Archive-digitized materials
Michigan staff contribute common-grams code to Solr code base

October 2009

Ingest of content begins from Penn State
Ingest of content begins from UC Santa Cruz and UC San Diego
A record 553,963 volumes are ingested in October

November 2009

Full-text search released (across 4.6 million volumes)
- See the Full-text Search Blog

December 2009

Columbia University joins HathiTrust
Center for Research Libraries begins audit of HathiTrust for compliance with TRAC
- See the HathiTrust TRAC documentation for information and results.
HathiTrust Bibliographic API released
HathiTrust begins work to implement Shibboleth
- View information about Shibboleth in HathiTrust
Redundancy of search index established at Indiana site

January 2010

Executive Committee approves new pricing model for HathiTrust
- The new model allows participation of institution that do not have large amounts of digital content to contribute. View the new pricing model FAQ.
Storage Working Group submits final report to Executive Committee

February 2010

Sample of IA-digitized volumes from UC ingested for testing
Ingest of Google-digitized volumes begins from the University of Minnesota
Full-text search index exceeds Solr/Lucene's limit of 2.1 billion unique terms
- Lucene core developer Michael McCandless creates patch allowing up to 274 billion. View the full-text search blog post.

March 2010

UM staff receive samples of locally-digitized materials from several CIC institutions (Iowa, Illinois, Northwestern) to begin working on scalable mechanisms and processes for ingesting locally-digitized content
OCLC begins loading records for HathiTrust volumes into WorldCat

April 2010

Ingest begins of an initial set of nearly 100,000 IA-digitized volumes from the University of California

May 2010

New York Public Library joins HathiTrust
HathiTrust passes 6 million total volumes and 1 million volumes in the public domain
Executive Committee launches Communications Working Group

June 2010

HathiTrust enables authentication via Shibboleth
- In the short-run this allows partners to download full-PDFs of all public domain materials in the repository and use the Collections application through a local sign-on. Implementation of Shibboleth paves the way for future partner services, such as expanded access to in-copyright materials.
Full-text search index is mirrored at Indiana site

July 2010

Yale University Library joins HathiTrust
Strategic Advisory Board launches Collections Committee
Executive Committee launches User Experience Advisory Group
Collection-building functionality integrated into full-text search

August 2010

Princeton University Library joins HathiTrust
Ingest of Google- and Internet Archive-digitized volumes from Columbia University begins
HathiTrust adds 160 new TB of storage bringing total capacity at each site to 475 TB
October 31 deadline announced for joining HathiTrust to participate in "constitutional convention" of partners in 2011

September 2010

The Triangle Research Libraries Network and Dartmouth College join HathiTrust
Ingest of content begins from New York Public Library and the University of Illinois

October 2010

HathiTrust announces the 52 partners that will take part in 2011 Constitutional Convention
- Newly announced partners include:
  - Baylor University
  - Emory University
  - Harvard University Library
  - Johns Hopkins University
  - Library of Congress
  - Massachusetts Institute of Technology
  - New York University
  - Stanford University Library
  - Texas A&M University
  - Universidad Complutense de Madrid
  - University of Maryland
  - University of Pennsylvania
  - University of Pittsburgh
  - University of Utah
  - University of Washington
  - Utah State University
Image ingest pilot begins
- The University of Minnesota, Minnesota Historical Society, and Minnesota Digital Library begin working with staff at Michigan to develop a prototype workflow for depositing images and associated metadata into the HathiTrust system for access, storage, and preservation purposes. Read more about the project.
California Digital Library begins work on a new bibliographic data management system for HathiTrust
Discovery Interface Working Group charges Full-text Search sub-group
Ingest begins of content from Princeton University and the University of Chicago
Collaborative Development Environment is released, used actively for development, testing, and release of code for HathiTrust systems

November 2010

Ingest from Cornell University begins

December 2010

Policy and specifications framework for ingest of locally-digitized materials is finalized
HathiTrust begins working with CIC institutions on ingest of locally-digitized content

January 2011

OCLC releases WorldCat Local prototype catalog for HathiTrust
HathiTrust ingests nearly 60,000 images and associated metadata from the University of Minnesota and partners
HathiTrust adds support for rights holders to open access to works with Creative Commons licenses
- The Brooklyn Museum, Society of American Archivists and many others are early adopters. View the rights holder Permissions Agreement.

February 2011

HathiTrust makes datasets of public domain materials available on a large scale
- See HathiTrust Datasets for more information

March 2011

HathiTrust certified by the Center for Research Libraries as a Trustworthy Digital Repository
- See HathiTrust’s TRAC documentation
Ingest from the Library of Congress begins
HathiTrust signs agreement with ProQuest to make the HathiTrust full-text index available via Serials Solutions' Summon service
Executive Committee launches User Support Working Group

April 2011

HathiTrust releases new viewing functionality in PageTurner application
- See the Update on April 2011 Activities for details
Ingest from Harvard University begins
HathiTrust concludes first storage replacement cycle, replacing storage purchased in 2007
Planning begins for the HathiTrust Constitutional Convention

May 2011

HathiTrust begins investigation to identify orphan works in HathiTrust
Ingest of content from University of Virginia begins

June 2011

Boston University and Lafayette College join HathiTrust
UM announces plans to provide access to orphan works to partner institutions
The HathiTrust Research Center is launched, led by Indiana University and the University of Illinois
HathiTrust begins ingest of materials digitized by Yale University Library
"Perspectives on HathiTrust" blog is launched, with inaugural post on HathiTrust and Discovery by John Wilkin

July 2011

The University of Notre Dame and University of Florida join HathiTrust
3-year review of HathiTrust is posted on the HathiTrust website and distributed to partners
- The 3-year review was prepared by Ithaka S+R with oversight by the Strategic Advisory Board in advance of the Constitutional Convention to lay the groundwork for discussions about HathiTrust’s future. View the 3-year review and the Constitutional Convention information page.
HathiTrust posts the first set of orphan candidate works
HathiTrust releases improvements to the Collections application interface and full-text search
- Improvements to full-text search include the 2 highest priorities from a full-text search features analysis prepared by the Full-text Search Working Group: the incorporation of bibliographic metadata into the full-text index to allow faceting of results by bibliographic data and improved search results ranking.
First version of partner print holdings database released
- The holdings database is to act as the basis for the new pricing model, and expanded access to in-copyright materials for members of partner institutions. See the Update on July 2011 Activities for more information.
The HathiTrust Research Center receives a $600,000 grant from the Sloan Foundation to investigate “non-consumptive” research
- The term “non-consumptive” was first used in the proposed Google Settlement to refer to computational research performed on in-copyright works In relation to in-copyright works, "non-consumptive" research in such a way that significant reading or "consumption" of the works does not occur.

August 2011

University of Connecticut joins HathiTrust
Cornell, Duke, Johns Hopkins, Emory University, and the University of California system announce participation in the Orphan Works Project
- View information about the terms of access proposed to orphan works. See also the Orphans Works Project page on the University of Michigan Library website. Note: No orphan works are currently available in HathiTrust (as of January 6, 2012).
Proposal to establish print monographs archive distributed to partners
- The proposal is submitted by the Collections Committee for the Constitutional Convention. View the final accepted proposal and the Constitutional Convention information page.
HathiTrust releases mobile interfaces for catalog and PageTurner applications
HathiTrust begins ingest of rare books and incunabula digitized by Universidad Complutense de Madrid
HathiTrust begins working with the University of Pittsburgh and University of Utah on ingest of locally-digitized materials
HathiTrust begins ingest of Utah State University Press backfile publications, to be made available in HathiTrust on an open access basis
HathiTrust begins ingest of Google-digitized volumes from Northwestern University and Purdue University, and Internet Archive-digitized volumes from North Carolina State University
HathiTrust concludes agreements with OCLC and EBSCO to make the HathiTrust full-text index available via their discovery services

September 2011

The University of Connecticut and University of Missouri join HathiTrust
HathiTrust, Google, and Duke University Press sign agreement to open access to DUP backfile volumes in HathiTrust under Creative Commons licenses
The Authors Guild and others file a lawsuit against HathiTrust alleging copyright infringement
- View information about the lawsuit
HathiTrust begins working with the University of Florida and the University of North Carolina-Chapel Hill on ingest of locally-digitized materials
Partners submit final ballot proposals for the Constitutional Convention. 7 are submitted in all.
- View the proposals and the Constitutional Convention information page.

October 2011

The University of Miami and University of Arizona join HathiTrust
The Constitutional Convention takes place; 5 out of 7 ballot initiatives are passed
- View the blog post about the Convention, and the Constitutional Convention information page which includes notes from the Convention.
Ingest of Internet Archive-digitized content begins from Duke University and University of North Carolina-Chapel Hill

November 2011

Boston College joins HathiTrust
The University of California begins offering reprints of UC-digitized public domain materials via HathiTrust
The User Experience Advisory Group releases HathiTrust User Personas

January 2011

HathiTrust reaches 10 million volumes

February 2014

HathiTrust reaches 11 million volumes

Personas: Understanding HathiTrust Users

hathitrust — Fri, 16 Dec 2011 14:46:39 +0000

By Jenny Emmanuel, HathiTrust User Experience Advisory Group

The HathiTrust User Experience Advisory Group recently released a set of “personas” depicting typical users of HathiTrust Digital Library. Personas are aggregate statements that display information about typical users and their needs. They are a commonly used usability method that collects data from multiple sources, including website analytics, search logs, first person stories, researcher observations, and other methods which are then aggregated into a narrative to depict stories depicting who HathiTrust users are and how they used the information within HathiTrust.

Personas are typically used throughout the development process so that staff working on the HathiTrust interfaces, communicating with users, and librarians can have a shared idea who it is that uses the Hathi. With personas, they can easily keep the end user in mind while they are improving HathiTrust, developing support materials, developing user education programs, or many other uses.

The HathiTrust User Experience Advisory Group worked on the personas for several months, with the additional help of the University of Michigan’s User Experience Department. The group gathered information from analytics, anecdotes from HathiTrust partners, various online publications about HathiTrust (blogs, articles, comments, etc.), reports from similar projects, and user feedback to identify major groups who use HathiTrust for research. These data sources led to the creation of seven distinct groups of both academic and non-academic researchers, each of which became the basis of one of the personas. The collected data was then collated between each of these groups and then written as a narrative with a generic use case, a given identity, and a stock image to give each persona a personal touch. Even though the personas appear to be of actual people and actual use, it should be noted that each persona is fictional, but supported by collected evidence.

The HathiTrust personas are used to guide further development of HathiTrust. They are also being utilized by the HathiTrust Communications Working group in their publicity and educational materials related to HathiTrust.

To view the personas, see: http://www.hathitrust.org/personas.

Is that the library in your pocket?

hathitrust — Sat, 03 Dec 2011 20:40:02 +0000

By Suzanne Chapman, Chair, HathiTrust User Experience Advisory Group

Looking for books to read on your shiny new tablet or other mobile device? This fall we officially released a mobile version of the HathiTrust Digital Library. The mobile site offers mobile-friendly access to key functionality including searching the HathiTrust catalog and reading HathiTrust "Full view" texts. Users from HathiTrust partner institutions can also download these "Full view" texts in PDF or ePub format to allow reading offline. Since the mobile interface is web-based, it works on all platforms and may be viewed either from mobile devices or from desktops and laptops. The interface has special functionality for tablets with two ways to read texts: either in the vertical scrolling format, or in a horizontal flip format.

Please give it a try and let us know what you think!

http://m.hathitrust.org/

Many thanks to the University of Michigan Library User Experience Department for designing and developing this exciting new interface.

HathiTrust Constitutional Convention on Record

hathitrust — Thu, 10 Nov 2011 16:02:38 +0000

On October 8-9, 2011 delegates from across the U.S. and around the world gathered in Washington, DC for a landmark event, the HathiTrust Constitutional Convention. Our goals were to review the work and accomplishments of the now 3-year-old HathiTrust, and chart its future governance and priorities. Before the group were seven different ballot proposals that had been submitted by HathiTrust partners ahead of the meeting. On a beautiful autumn weekend, the delegates headed indoors, gathered around tables, and deeply engaged in the proceedings and discussion.

As a result of these proceedings, HathiTrust:

Will establish a governance structure consisting of a Board, a Board Executive Committee, and Board-appointed committees, and will articulate bylaws
Will formalize a transparent process for inviting, evaluating, ranking, launching and assessing development initiatives
Will establish a shared print monograph archiving program among the member libraries
Will expand and enhance access to U.S. federal publications including those issued by GPO and other federal agencies
Will develop and vet a fee-for-service model to allow contribution of content from non-partner entities

The Convention was also an opportunity to celebrate the achievements of HathiTrust in less than three short years: over 60 partners, infrastructure that preserves and makes discoverable close to 10 million volumes, and the HathiTrust Research Center that will enable new forms of research.

For a full account of the proceedings, please consult the official minutes of the Constitutional Convention.

HathiTrust's Past, Present, and Future

hathitrust — Mon, 17 Oct 2011 14:27:17 +0000

Opening remarks given at the HathiTrust Constitutional Convention, October 8, 2011 (view presentation slides)

By John Wilkin, HathiTrust Executive Director

Think back to 2004 and the conversations going on in our community around digitization and the challenge of making big things happen at the intersection of our institutions. Digitization on a grand scale was 10,000 volumes, and we rejected any notion of digitizing a large corpus of materials like US federal government documents for countless reasons. In the years since our 2005 announcement that we were undertaking digitization on a large-scale, our community, in collaboration with Google and the Internet Archive, has digitized over half of the collective holdings of ARL libraries. Three years later, we launched HathiTrust, an organization that facilitates collective action on a grand scale. Seldom has so much in our world changed in such a short time. Together, we have utterly transformed parts of the library landscape.

My plan today is to talk about HathiTrust’s past, present and future. Don’t worry—I won’t do a history of HathiTrust. My discussion of the “past” will be primarily about the organization’s early accomplishments, and begins with a review of our Short- and Long-Term Functional Objects. I’ll then talk briefly about a few things in the HathiTrust pipeline, and finally conclude with an overview of some of the larger changes that have taken place since 2008. A point I’d like to emphasize now and throughout is that this is a “libraries writ large” success story. What has happened is something that we accomplished collectively. This is not a story of an external organization—Google, a government agency, or some external champion—doing something for us. This is our story, and one that we need to understand and celebrate.

Short- and Long-Term Functional Objectives

In those early, heady days of HathiTrust, the first partners established a list of Short- and Long-Term Functional Objective. These objectives were not meant to encompass all of HathiTrust development, but were a vehicle to articulate goals for a quickly emerging organization, a way to give some initial direction until other mechanisms could create a more nuanced roadmap. We needed to define goals in order to test responsiveness for this new organization.

Short-term

Page turner mechanism
Branding (overall initiative; individual libraries)
Format validation, migration and error-checking
Development of APIs that will allow partner libraries to access information and integrate it into local systems individually
Access mechanisms for persons with disabilities
Public ‘Discovery’ Interface for HathiTrust
Ability to publish virtual collections
Mechanism for direct ingest of non-Google content

Long-term

Compliance with required elements in the Trustworthy Repositories Audit and Certification (TRAC) criteria and checklist
Robust discovery mechanisms like full-text cross-repository searching
Development of an open service definition to make it possible for partner libraries to develop other secure access mechanisms and discovery tools
Support for formats beyond books and journals
Development of data mining tools for HathiTrust, and use by HathiTrust of analysis tools from other sources

For every one of these functional objectives, HathiTrust has delivered something meaningful to the partnership. It’s worth noting that some of these objectives were monumentally difficult and there was absolutely no certainty that we would succeed in all of them. In the end, what we accomplished was the creation of a rich, open system with a nuanced understanding of rights and the ability to deliver various forms of content to different audiences in different ways. All of the content in HathiTrust is discoverable with a superb balance of precision and recall, and the services we offer around the preservation of the content are without peer.

Although I won’t cover the Functional Objectives in detail, I would like to highlight three of the more ambitious accomplishments: our TRAC certification, the full-text cross-repository searching, and the creation of a research center.

HathiTrust is only the second repository (after Portico) to receive certification by CRL. HathiTrust’s process for certification involved countless hours of staff work developing processes and products, and creating and providing documentation. And that is as it should be. Certification is all about accountability and openness, and we can take pride in obtaining it. We are a distinctive type of organization, not analogous to OCLC or Portico, and our organizational distinctiveness tends to confound those who want to see a central office and central staff. It was important to document for CRL the large commitment of staffing across the partnership to help them understand that HathiTrust is not apart from us, but rather a part of us—that HathiTrust is not separable from our institutions. We excelled in the technical components of the review, but CRL has lingering questions about the organization. I believe we too have lingering questions about the organization. We want this effort to be part of us and not separate, and there are few models of how to make that work. This tension between something central and something that we are all a part of will, I believe, be a leitmotif in our meeting over the next three days. I think we’ve made good progress and that we’ve created a productive and healthy tension. {At this juncture, I’d like to pause to introduce Heather Christenson, the chair of the HathiTrust Communications Working Group. We owe this group a great debt of gratitude for showcasing our successes, but this group also highlights the value of the inter-institutional work and the tension it creates.}

A second grand accomplishment I’ll highlight is the creation of a viable full-text search mechanism that works with all of the content in the repository. I hope no one here is so jaded as to think that full-text searching across millions of volumes is a slam-dunk. Many were skeptical, and I can’t tell you how many calls I fielded from vendors telling me that what we were attempting was impossible—or at least impossible without their help. The effort required a large amount of research and testing, and what we learned required deep collaboration with the broader community of developers working on the Apache Solr search engine project. The resulting service is sensitive to the amount of content—unparalleled in size—to the hundreds of languages and character sets, and to requirements like phrase searching that reflect the distinctive ways users approach a vast and diverse library collection. Our users can now search over 3 billion words and get results in a split second. Collective work in the partnership has produced faceted results in our full text, and ranking that takes bibliographic information in the full text into account. The functionality that we have today is tremendous, and it provides a foundation for a next generation of search that gives our users access to bibliographic information where needed, and full text where desired.

The creation of a research center is a very different kind of example and helps underline the value of collective action. Indiana University and the University of Illinois assembled the cyberinfrastructure resources to create a research center supporting uses of the HathiTrust collection. The consolidation of collections and institutional focus made HathiTrust a valuable partner for researchers at those two institutions. It was so valuable that they redirected institutional resources to create the infrastructure and leadership needed for this initiative—they created the research center at little or no cost to us. How much more compelling it is that the research center comes from faculty leadership (from those who would do the research), drawn to use of this immense library, rather than from us in support of those faculty. Indeed, because of their commitment and credibility, the research center has attracted significant funding from Sloan to deal with problems like security in use of the in-copyright materials, and I think we can expect them to be a magnet for other funding in the future. The research center will soon offer a platform for uses we could imagine but could not otherwise support. We’re accomplishing the functional objective of support for research uses of the data in a number of ways, including by distributing public domain data, but the creation of a research center was a significant win for all of us and comes as a result of our working together to create a compelling library resource.

Other accomplishments

Holdings and the New Cost Model

Our accomplishments in other areas are equally impressive, and equally reflective of HathiTrust’s role as a community resource. I hope that all of you are familiar with the work done by OCLC Research and Constance Malpas showing how HathiTrust’s collection overlaps with those of our libraries. The first results of that work show a median ARL overlap of 19% in June 2009 and 31% in June 2010. The overlap rate was remarkably constant from big to small ARL. That is, by June 2010, nearly every ARL library could depend on finding approximately 31% of its collection online in HathiTrust. The rate of overlap continued to grow; by June 2011 I estimate the overlap rate to have hit a median of about 45%, and will reach something like 50% overlap early next year. Remarkably, the numbers for non-ARL institutions and particularly the Oberlin Group libraries are even greater. Materials not ingested—materials from partners like Harvard, Virginia, the CIC and Stanford, and from non-partners like Texas—could increase that number to more than 75%. The breadth of our holdings is so significant that HathiTrust is being used as one of the key resources for the just announced (Oct. 3, 2011) European serials preservation registry, The Keepers.

That any one of our libraries could find more than 50% of its collection digitized and online in HathiTrust creates real possibilities, and in this regard HathiTrust’s leadership shows vision and commitment. The new cost model, which is based on overlap, is designed to share the burden of archiving in ways that are reflective of the value we derive from the collection. Our institutions share the cost of in-copyright volumes where we hold corresponding print volumes; all members of the partnership share the cost of public domain materials evenly. In order to make that cost model work, we needed a holdings database and are very close to unveiling the first examples of calculations that result from that system.

Collection overlap is an interesting phenomenon, with the various collections showing both important similarities and important differences. Focusing again on ARL institutions as the exemplar, you’ll see in the scatter gram that we look remarkably similar in the rate of our overlap. However, as one might expect, the overlap profile for a collection like Harvard’s and a collection like Lafayette’s are so different that they will mirror each other, with Harvard holding more print corresponding to HathiTrust volumes uniquely, and Lafayette holding more volumes in common with other institutions, with a smaller number of unique volumes. These are the extremes, but all institutions will have distinctive overlap profiles. Here are just a few examples: [SLIDES]. What this means, then, is that each institution’s cost will vary a great deal by size, of course, and by the nature of the collection. We’re at a point where I can give you a preview of what that will look like.

Costs are attributed to three elements of our preservation work: the public domain; in-copyright books; and serials. Keep in mind that all partners share the cost of the public domain equally. As of the end of September, we have 2.6m public domain and 62 partners; thus, the cost of the public domain and open materials comes out to $9,300 per partner. Based on our overlap data to date, the cost for in-copyright books ranges from a low of less than $1,000 per year to a high of about $75,000 per year. I’ve masked the institutional names in the data here because it’s still a bit early, but these numbers are largely right, and entirely based on holdings data. The high number is Michigan because Michigan’s collection is the source for so much digitized content. Institutions with low costs would be institutions like Merced and Lafayette, with smaller collections and sometimes less overlap. Finally, the cost of serials is preliminarily based on holdings at the title level, rather than the volume level. Here are the same institutions arrayed along an X-axis with costs for serials on the Y-axis. The sum of these three costs gives us a low cost of less than $15,000 per year and a high cost of roughly $200,000 per year. Bear in mind that this is a likely reflection of the general shape of 2013 costs, with the bulk of the institutions paying much less than $50,000 per year. As more content comes in, costs go up; as more institutions come on board, costs go down; and as time passes, many elements of cost go down because of declining costs in the technology. So far, this has created a fairly flat picture of cost year-to-year rather than a dramatically increasing cost.

What I’d like to emphasize here is not only a concrete sense of the costs for the partnership—what they’ll be and how we calculate them—but that we’re well down the path to having in place the infrastructure to do this work. That is, we have a collection that represents a broad, common set of needs—not just public domain works, but in-copyright works that aid us in managing our print collections. We have technology that understands questions of holdings and overlap, which can produce cost calculations and also serve as an access control tool. Although the technology and metadata will benefit from refinement (e.g., our individual serials data could use some work), the partnership now has a good start on something that has tremendous practical value for our institutions individually and collectively.

At this juncture, and before turning to other accomplishments, I’d like to pause to consider one of the bogeymen of the new cost model: some have wondered, “what if an institution joins HathiTrust and brings with it one million public domain volumes? Won’t that dramatically increase costs in uncontrollable ways?” Keep in mind the effect of scale, both of preservation costs and of the number of institutions. The cost for adding one million public domain volumes increases each of our costs under $4,000 per year, with a corresponding benefit of access to a phenomenal amount of content. There’s nothing in the e-book marketplace that compares to this.

Publisher relations and publishing work

Never once in conceiving HathiTrust did we see this enterprise as being solely about digitized content: we believed that the digitized version of the published record provided an excellent foundation on which to add newly published materials in their original digital formats. To that end, we have set in motion three distinct efforts related to publishing:

Making it possible for rights holders to open access to works.
Making it possible for publishers to deposit digital master files for archiving and open access.
Making it possible for publishers to publish directly into HathiTrust.

The second and third initiatives are in their infancy, but all deserve a quick review.

In the first case, authors and publishers have opened thousands of works in an effort to share them more widely. Several presses, including university presses, and associations like ARL, have already opened substantial bodies of work with no expectation for compensation. They have relied on already extant files in the repository and have granted permissions where possible. Duke University Press recently announced an agreement with HathiTrust and Google, and will apply Creative Commons licenses to its materials, receiving in return digital files (from HathiTrust and with Google’s permission).

Using born-digital materials rather than digitized versions of the books can improve the quality of HathiTrust content and the user experience. One university press is already depositing PDFs of published content. We are in discussions with two academic presses regarding an agreement where, in return for open access to their materials, we will store and provide access to the archival version.

Finally, the University of Michigan’s MPublishing unit is working on a mechanism to publish open access content directly via HathiTrust. By binding together a publishing process informed by archival needs and an access mechanism informed by audience needs, they hope to build a system that makes an archival commitment to readers and libraries without losing the functionality needed for a credible publication. They hope to have the first iteration of this system available next year and to begin sharing their specifications and development process with partners following that.

Uses of in-copyright materials

We have made tremendous strides in facilitating lawful uses of in-copyright materials. Particularly in US copyright law, there are clear provisions for uses of in-copyright materials, according to the law—that is, limitations on the exclusive rights of the owners of copyright. We have legal and moral obligations to our users to provide services for these materials. And there have been important, untested questions that we need to explore as a community. I would like to briefly list work we’ve done to support access to in-copyright materials:

We have laid the groundwork for access to in-copyright works by users with print disabilities. Our technology incorporates Shibboleth for inter-institutional authentication, the holdings database as a check of a partner library’s purchases, and cooperation with campus offices that provide services to users with print disabilities. We are ready to launch this service, which will provide unparalleled access to millions of works by this small group of users at our institutions. Never before have persons with print disabilities had ready access to libraries of content this large. This will be one of our proudest accomplishments.
Again using the holdings database and Shibboleth, we will soon be able to provide access to works that meet Section 108 criteria (i.e., that the work is damaged, deteriorating, lost or stolen and is not available on the market at a reasonable price). At the very least, we can make it possible for partners to create print replacements; it is also the case that the DMCA gives us some leeway for digital access to these works. The infrastructure is in place and we will soon use Section 108 provisions in US copyright law to extend access.
And, famously, we will soon be testing the concept of Fair Use and our ability to serve the imperative of preserving the materials in HathiTrust. How could I give this talk without touching on the suit by the Author’s Guild against HathiTrust and several of the partners? Despite the well-documented missteps in our first orphan works identification process, our ability to make these uses under Fair Use and our ability to store the digital copies as part of an overarching preservation strategy are two of the most important principles underlying the HathiTrust effort. The access mechanisms that we have developed (e.g., taking into account holdings of the partner institution and relying on authentication of users) are thoughtful and appropriately conservative. We have taken steps to define lawful uses without antagonism of or disregard for the interests of rights holders. This was an important step for the library community.

Big issues

Creating a 10 million volume digital repository in and of itself changes the library landscape, and these things I’ve just discussed do as well, in that they change our sense of who we are and what we’re doing: we have had a positive impact on our institutions, our users and on the profession. Additionally, there are several other developments worth considering as we look back at the last several years.

Our institutions are now pooling resources in ways we rarely saw in the past. We have pooled resources to solve the digital archiving problem, to address collection building, to perform collection analysis, and I hope we will soon do so to address print monograph storage issues. We have shifted our investments from funding spent in isolation to common pools of funds to solve common problems. Before someone accuses me of being historically myopic and draws the comparison to WorldCat, keep in mind that in HathiTrust our resource pooling replaces (rather than enabling) local work. WorldCat makes it possible to devote resources in our separate institutions more efficiently.
We have begun to mobilize resources and expertise from within the various partner institutions to deal with problems common to us all, such as copyright determination, digitization of government documents, and the refining of bibliographic information. These problems can’t all be met by pooling our resources; instead, we must rely upon our individual institutional resources and perspectives. The diversity of our resources and perspectives improves the quality of our work and so makes us all stronger. (Consider the example of the copyright expertise advisory group for the new grant, which has extraordinary talent, and talent that would not be assembled in one place.) The Copyright Review Management System is a good example of early collaboration, and now IMLS has funded us again for a much more ambitious effort to work on copyright determination for publications from around the world. We have used HathiTrust to galvanize the community to address problems collaboratively. If we can find a way to deal efficiently with metadata remediation—changes and improvements to our bibliographic records—this too will surely be done by working within various institutional contexts rather than by pooling resources.
And, finally we have begun to approach the question of fair use in a large and coordinated way. For some time, libraries have recognized the need for coordinated action on best practices in order to bolster our use under this part of copyright law. A few of our institutions made bold and solitary moves, and the rest of us have tried to learn from the experience. Working together on this question of fair use does, I believe, position us to develop defensible best practices and establish a clear legal precedent. In the lawsuit brought by the Author’s Guild, whether or not we win remains to be seen; that we undertook this work collectively is important and a big change.

In each of these cases, we can see new modes of collective action in libraries. Where it makes sense to pool resources, we do; where it makes sense to work together on common problems, we do; and where we need to act collectively to show a unified front, we do. These are important times.

Connecting the dots

Let’s pause for a moment to put all of this together.

Together, we have built a collection of nearly unparalleled size and richness. With our future work, it will only grow larger and richer
We are devoting collective resources to getting a bead on what we actually have here: rights determination is the big example, but we’re beginning to see interest in bibliographic remediation, at least for things like government documents.
We are working to create a record of contemporary publishing within this corpus by working with publishers, and in some cases those publishers that are our libraries, our organizations and our university presses. We are doing that by getting permissions from authors and publishers to open access to materials, by striking deals with presses like the one we just signed with Duke University Press, and, importantly, we will soon be publishing via the repository.

We have charted a path forward for an increasingly comprehensive shared collection, a collection that contains a vast body of open materials, a collection that facilitates lawful uses, and a collection that houses new publishing. This is a collection we can use for many things—to gain a better understanding of the shape of the published record and our collections, to shape shared storage strategies, to rationalize our collections and to serve our users.

What next for the partnership?

The short answer to the question of where the partnership goes next is that it depends entirely on the discussions we will have over the next few days. In 2008 our intention was to get the effort off the ground, and then bring the community together in 2011 to plan next steps with a clearer understanding of what we might accomplish, and that’s where we are today. I hope that we leave the Constitutional Convention with a course charted for clearer, more collective governance and strategies for defining future priorities. In the meantime we—i.e., HathiTrust, our community—will continue to move HathiTrust forward. We will continue to enhance the systems you see today, providing better full text searching, supporting more functionality through the APIs, and adding more content. The holdings database and the cost model will be fleshed out, and we will all have a better sense of what our costs will be in 2013. These are important things.

I’d like to use this bully pulpit to share my personal opinion, and declare that it’s time to beef up the organization. We have made a good start in creating an organization that reflects our collective interests and I feel confident that with the right governance and leadership we can create a stronger HathiTrust without creating a new 501c3 or intensively consolidating staff. To create a large, centralized organization would be to create a HathiTrust divorced from our institutional contexts. This is also an opportunity for me to suggest that it’s time for us to look for a full-time executive director. Although I’ve enjoyed this work immensely and feel proud of my accomplishments, I believe that a full-time, independent director, a visionary with strong organizational skills, will make it possible for us to build a stronger sense of community and more fruitfully talk to funding agencies, both things that can make HathiTrust all that much more durable. I’m not leaving this post today; however, I would like to urge the partnership to strengthen the core of HathiTrust by building a small central staff and hiring a director.

Closing

In closing, I’d like to return to this theme of the community and working collectively. As we know, so many of the challenges we face are shared challenges. Our metadata are not our metadata in isolation from each other; our collections are not our individual collections in isolation from each other; and many of the baseline services or capabilities we strive to offer are ones that all of us would like to offer in our institutions. The last several years have seen us move markedly in the direction of collective action on collective problems. Indeed, working collectively on collective problems makes it all the more feasible to create distinctive or tailored services for our individual campuses or communities. Whether we call it “group scale,” as Lorcan Dempsey does, or “working globally so that we can better deliver services locally,” HathiTrust is a remarkable example of collective action, of our community working together to solve a common problem. Although there are many rough edges and many things to work out, our first steps have been monumentally successful in beginning to change the work we do and the way we do it. This is a tribute to each of you and to your institutions: we did this as a community, and we did it because it made sense. I hope we’ll reconvene every few years to ponder where we’ve come from and where we go next, and that we will look back on this moment as a powerful example of the changes we can affect for our users and for the profession.

HathiTrust and Discovery

hathitrust — Fri, 24 Jun 2011 12:41:33 +0000

By John Wilkin, Executive Director, HathiTrust

It is a core tenet of HathiTrust that preservation cannot take place without access. The coupling of preservation and access is both philosophically and strategically central to HathiTrust’s mission, as awareness of the materials in our collections helps to create the value that leads to preservation. And because discovery is integral to access, HathiTrust has worked hard on a multi-pronged strategy for discovery.

Key to this strategy are our ongoing efforts to ensure that HathiTrust content is “in the flow” of library discovery more generally, as illustrated by our recent agreement to integrate the HathiTrust full text indexes into the Summon discovery service, and our collaboration with OCLC to create a permanent bibliographic catalog for HathiTrust.

The catalog as a tool for collection management

HathiTrust serves two primary constituencies: librarians as collection managers, and scholars and other users of our collections. This may seem like an artificial distinction—the lines between these two types of users and their discovery methods are often blurred, with bibliographically astute users wanting to look through the lens of the catalog, and reference librarians exhibiting some of the most sophisticated source-intensive research skills. Nevertheless, a central part of the work of libraries, and particularly the partner libraries, is collection management, and HathiTrust has as part of its design (both in its mission and goals) seamless integration into collection management strategies.

To best serve librarians as collection managers, a well-designed catalog is a critically important tool. A well-designed catalog for a collection manager always offers bibliographic precision. It allows the librarian to know (and find) exactly what is held, and also how that holding—that bibliographic instance—relates to other similar holdings. As we move into large-scale collection management across many of our cooperating libraries, this kind of well-designed catalog will play a critical role.

When HathiTrust launched its enterprise, we provided an extremely popular “temporary beta” catalog based on VuFind. It sported tremendous features like faceted results and the ability to sort results by date and rankings. It was well-received and reliable. At the same time, we announced a partnership with OCLC to build a replacement for this temporary beta, which we expect to launch sometime this year. Why replace the VuFind-based catalog, which works so well? Situating HathiTrust’s holdings in the larger OCLC WorldCat database is a tremendous boon to librarians in understanding what we have online, how the collections of the partner institutions relate to each other, and how those online holdings connect to libraries around the world. By managing HathiTrust’s records in the same place that other libraries do, we are better positioned to perform collection analysis and to shape future strategies to close gaps. In short, working with OCLC to build the HathiTrust catalog is an important strategy with regard to our collection management goals.

But should the creation of an effective catalog with OCLC cause us to abandon other bibliographic discovery strategies? Absolutely not. HathiTrust works in a number of ways to distribute bibliographic information to partners and the world. Our APIs allow libraries to add URLs to their catalogs where their library has a matching record. Our OAI distribution of brief records makes it possible for many libraries and other bibliographically-oriented entities to add records for materials unique to their collections. And the hathifiles, an inventory of HathiTrust holdings now numbering approximately 9 million lines, can help drive institutional processes to identify materials and shape more sophisticated record-oriented strategies. And of course OCLC’s efforts to load information about HathiTrust holdings is also a boon for libraries wishing to get records from OCLC. The creation of a catalog is critical, but does not by itself fulfill users’ needs to find records in other discovery venues.

Full text discovery and support of scholarship

HathiTrust’s full text strategy is very similar to its bibliographic discovery strategy, though it flips the paradigm a bit. After an extraordinary research and development effort, HathiTrust launched a full text search service in 2010, and ever since then we’ve been working to chart a course for a better, more sophisticated service. This summer (2011), we will launch a new full text search service that will incorporate fuller bibliographic information in the full text, use facets, and offer other features such as weighting of results depending on where the results were found in a text. And of course this will only be one more step in a process of continual enhancement.

While HathiTrust believes the catalog function must be in OCLC, where libraries already manage their records, we also insist that the full text service must be in HathiTrust, where the materials are managed. Therefore we will focus increasingly on the standalone HathiTrust full text search service as a vehicle for end-user discovery. As such, it will always work to distinguish itself from the services offered by Google and other commercial services by enabling scholars to search for information precisely and exhaustively. Appealing as it is, Google Search’s lack of precision and complete recall can be a hindrance to much scholarly work, and here HathiTrust must step up. After all, our collection of content is different from Google’s (with our locally-digitized content and content that comes from partnership with other large-scale digitization initiatives), and our academic orientation ensures that our search results are not influenced by a connection with commerce, such as advertising.

Just as our our OCLC strategy does not end our pursuit of other bibliographic discovery strategies, our decision to mount a robust full text search service in HathiTrust does not eliminate the need to ensure discovery elsewhere. Because so much of our content is in Google Book Search and the Internet Archive, we achieve this goal in part without much additional effort. Still, much of the content in HathiTrust is only accessible in HathiTrust, and so getting in the flow of our users’ discovery methods (particularly users of academic and research library collections) is very important. By making the HathiTrust indexes searchable in Summon, we begin to accomplish this. Although Summon is the first and best of these services, the marketplace will produce others, and we remain committed to ensuring that our content is discoverable in as many of these services as possible. Negotiations are underway with Summon’s competitors, and press releases will follow as we conclude these agreements.

That which can be found is more likely to be preserved

In order to effectively support its preservation mission, HathiTrust must constantly improve the discovery experience and must seek to situate discovery wherever our users search for information. “Either/or” strategies are bound to fail us. Indeed, we will continue to implement a range of discovery strategies in collaboration with all appropriate partners and in every appropriate location. Our strong connection to scholars will lead us to refine the approaches we take to discovery, and our knowledge of where they seek information will guide the approaches we take to distributing records and making our full text indexes available. By making the information we store as discoverable as possible, we stand the greatest possible chance of having that information found, valued and preserved.