During these six months, staff have begun focusing more on registry development rather than continued research and analysis. Work has centered on building and refining procedures for detecting duplicate and related records, as well as specifications for system development.
Project staff
During the past several months we have advertised for the position of Registry Applications Developer, and interviewed several applicants. We hope to have someone in place no later than the first of the year.
Relationship detection
Much work has been done in the area of relationship detection over the past six months. The project team has:
-
Documented methods for identifying a US government document from a MARC record;
-
Drafted, tested, and continually updated an algorithm for relationship detection, involving identifiers such as OCLC number, LCCN, ISBN, ISSN, and SuDoc call number, along with text such as titles, authors, publishers, and publication dates;
-
Drafted and finalized initial rules for normalization of data, including enumeration and chronology;
-
Drafted specifications for merging duplicate records;
-
Drafted specifications for validation of submitted records;
-
Conducted testing on known duplicate and related records in HatihTrust, as well as a set of records contributed by partner libraries.
We anticipate that a final initial relationship detection process will be in place by November 15th.
Manual review of registry metadata
While the relationship detection process will aid on the automated de-duplication of Registry metadata, we recognize that some records will need to be reviewed manually. Project staff began identifying categories of records that are likely to need manual review, and brainstormed about how to implement the manual review process and recruit people to conduct these reviews.
Gap detection/comprehensiveness work
Staff reviewed several government documents titles in an attempt to identify any comprehensive sets of government publications in HathiTrust. Titles included Foreign Relations of the United States, Budget of the United States Government, Statistical Abstract of the United States, the Monthly Catalog of United States Government Publications, the United States Code, and the World Factbook. Missing volumes were noted.
Additionally, this work has highlighted some possible methods and challenges for the automated detection of gaps in Registry metadata. We anticipate that more detailed work surrounding gap detection will take place in the next three months.
Project staff also began identifying records currently in HathiTrust which are US government documents but are not coded as such in the 008 field. Thus far more than 2300 items have been identified.
Presentations
Staff gave several presentations about the Registry project to several groups of government documents librarians (see HathiTrust Papers and Presentations):
-
Federal Depository Library Conference (May 1, 2014)
-
Minnesota GovDocs Forum (May 9, 2014)
-
Kentucky Library Association GODORT (May 16, 2014)
-
ALA GODORT Cataloging Committee (June 28, 2014)
-
CIC Heads of Government Publications (June 29, 2014)