Center for Research Libraries - Global Resources Network

Resources for

Overview

Summary

HathiTrust was launched in 2008 by the thirteen libraries of the CIC, the University of California system, and the University of Virginia. It is intended to be a shared digital repository for storing partner libraries’ digital content. It incorporates all the content from Michigan’s MBooks catalog plus content from other HathiTrust partners. It currently has 26 partners who all contribute or intend to contribute to the project. The goal is to become an organization that is not dependent on one particular university library for work, location, or funding.  

CRL’s April 2011 certification report details some of the strengths and weaknesses of the HathiTrust repository as of the end of 2010, based on an audit of the repository performed between November 2009 and December 2010. The CRL Certification Advisory Panel concluded that “the practices and services described in HathiTrust public communications and published documentation are generally sound and appropriate to the content being archived and the general needs of the CRL community.” That report can be found at: http://www.crl.edu/news/7244

Sources

The information in this report is based on review of extensive documentation gathered by CRL from published and other open sources; data and documentation provided by HathiTrust between November 2009 and December 2010; and a site visit held in May 2010.

CRL thanks HathiTrust executive director John Wilkin and Jeremy York, who fielded a multitude of CRL questions and requests for documentation throughout the course of CRL’s research.

Authors

Center for Research Libraries

  • Kayla Ondracek and CRL staff

Analysis

Mission and History

The stated mission of HT is to “contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge” [1]

The stated goals of the HathiTrust Digital Library, as of April 2011, were: 

  • To build a reliable and increasingly comprehensive digital archive of library materials converted from print that is co-owned and managed by a number of academic institutions.
  • To dramatically improve access to these materials in ways that, first and foremost, meet the needs of the co-owning institutions.
  • To help preserve these important human records by creating reliable and accessible electronic representations.
  • To stimulate redoubled efforts to coordinate shared storage strategies among libraries, thus reducing long-term capital and operating costs of libraries associated with the storage and care of print collections.
  • To create and sustain this “public good” in a way that mitigates the problem of free-riders.
  • To create a technical framework that is simultaneously responsive to members through the centralized creation of functionality and sufficiently open to the creation of tools and services not created by the central organization. [2]

HathiTrust originated with a project called MBooks which was created at the University of Michigan. The 2005 cooperative agreement between the University of Michigan and Google authorized the University to provide access to copies of all their scanned content through the Michigan library. Originally they created access through the library’s Mirlyn Library Catalog. When the libraries of the CIC decided to exploit this infrastructure to manage and provide access to their digital content, the Mirlyn catalog was renamed "HathiTrust Digital Library". Originally the options for viewing, printing, and creating collections were the same as in MBooks. However with new partner participation, HathiTrust improved access to materials within its collections through enhancements like the PageTurner interface. (See Repository Content and Services, below.)

Governance and Staffing

HathiTrust is governed by an Executive Committee consisting of the executive director of the Trust and senior officers from the founding institutions: University of Michigan, University of California, Indiana University, and the CIC. The Executive Committee “bears final responsibility for the activities and functions of the HathiTrust operations and for the partnership, as well as for the long-term integrity and accessibility of deposited materials.”  The Executive Committee is guided by a Strategic Advisory Board consisting of senior professional staff from constituent universities, with special representation from the CIC and the University of California.[3]

Funding and Planning

HathiTrust is a cooperative enterprise undertaken by a number of major U.S. research libraries led by the University of Michigan Libraries. Development and operation of the repository rely heavily on funding from the University of Michigan and the California Digital Library. Additional funding is provided by the libraries of the CIC and by other major U.S. research libraries such as Cornell University, Yale University, and the University of Pennsylvania.

Many of the participating libraries deposit digital files and metadata produced through mass digitization of their holdings in partnership with Google, the Internet Archive, and others. 

Development and operation of the repository also rely upon substantial contributions of technology, technical infrastructure, services, and staff time and expertise by library partners such as the University of Michigan, Indiana University, the California Digital Library, and others. 

The basis on which the annual contributions of member libraries are determined has evolved over time. Fees have been determined variously by: a) a cost-sharing formula based on real or projected costs; b) the amount of the member library’s digital content expected to be stored in the repository; and/or c) the extent of overlap between the member library’s local print holdings and digital files included in the repository. 

Stakeholders and Designated Community

The repository is designed primarily to serve the needs of research libraries and the scholarly constituencies of those libraries. As much of the original content contained in the repository was generated through mass digitization projects undertaken in partnership with Google and the Internet Archive, the repository serves as an additional host for some of the digital assets maintained by those organizations.

Content and Services

Content Characteristics – Types and Formats

As of April 2011, HathiTrust delivered primarily digitized textual documents such as books, serials, and government documents from the holdings of U.S. research libraries. Because HathiTrust has received most of its content from digitization projects initiated through the Google Books Library Project, the nature of this content has been largely informed by Google’s selection preferences and processes. This means that content in HathiTrust comes from materials preserved for a wide variety of disciplines and selected on the basis of format, condition, size, and uniqueness. 

Some collections scanned by the University of Michigan are also present in the repository, as well as content from the Internet Archive. HathiTrust is developing a formalized process for partners to scan their own materials locally and deposit the resulting digital objects. Implementation of this process will allow for greater variety of specialized collections as well as the inclusion of larger format books and visual materials. As of this writing, HathiTrust was investigating the possibility of delivering other formats, such as digital audio, image, electronic publications (such as EPUB, an option often offered by Google Books), and born-digital publication formats. Currently bibliographic information for these formats exists in the search system with links to find physical copies in a library if the format has a corresponding textual document that has been digitized, such a pamphlet or a set of notes.

Certain content types display patterns of availability that are somewhat related to their type in terms of copyright issues that restrict/allow this availability. Many government documents are considered public domain and so are freely viewable.[4]

HathiTrust uses the digitization specifications from the University of Michigan as its standard for the preservation metadata associated with incoming files, quality control, and acceptable file formats.[5] These specifications are documented acceptable file formats for preservation. HathiTrust accepts:

  • TIFF ITU G4 files stored at 600dpi
  • JPEG or JPEG2000 files stored at resolutions ranging from 200dpi to 400 dpi
  • XML files with accompanying DTD (METS). (HathiTrust “Preservation”)

Content Characteristics – Ingest and Quality

HathiTrust ingests digital image files in formats that generally come with a metadata layer as part of the format package, as well as the industry standard METS (Metadata Encoding and Transmission Standard) XML Document Type Definition specifically designed for the metadata needs of digital repositories. They use PREMIS (Preservation Metadata Implementation Strategies) terms to expand usage of the METS document in the areas of provenance, fixity, and context. These methods ensure that digital objects within the repository will be ready for potential refreshing and/or migration in the future. Correct metadata associated with the files will be used to record such events as well as ensure that the objects are able to be read with new hardware and software as yet unforeseen. For delivery, the PageTurner application creates a PNG (for those image files stored as TIFF) or JPG (for those stored as JPEG2000) image for viewing by the user, as these formats tend to be smaller and to load faster (Shallcross 24).

HathiTrust has adopted certain standards for ingest that determine the format and quality of files that come from partner universities. Ingest is handled by the GROOVE application, which includes the JHOVE validation tool and other tools to handle METS document creation and fixity checks. Objects are fed through this system, which is fully automated in order to minimize opportunities for operator error.

Partners submit bibliographic data first in MARC format to the University of Michigan’s Aleph bibliographic management software, then content. Content is subject to these tests:

  • Metadata: internal tests to ensure MARC21 conformance and completeness
  • OCR text: tested for well-formedness using JHOVE [validation tool nested within GROOVE]
  • Image files: tested for well-formedness using JHOVE
  • Metadata in image files: internal tests for consistency with conventions
  • Digital signatures (MD5 checksums) for all OCR text and image files: checksum verification
  • Additionally: a one-to-one correspondence is ensured between OCR text and image files.

Rights Management: Google Books

HathiTrust currently receives much of its content from the Google Books Library Project through partner universities. This means that the uses HathiTrust makes of the content is subject to terms of the original Google agreements between Google and the partner institution. The terms of the 2007 amendment to the agreement between Google and the University of Michigan stipulate that a separate written agreement between the third party and Google must be established that prevent the third party from allowing mass-downloading of material, sharing with unspecified parties, etc. Under the new terms, the third party may not use the material in a way that is less restrictive than the original agreement between Google and its partner university. Complications may arise as each partner university creates its own unique agreement with Google concerning the digitization project and the access and use of the resulting digital files. These idiosyncrasies likewise may affect flexibility in content use, the nature of the content, and the timing of the content’s presence in HathiTrust.[6]

Rights Management – General

When it comes to rights management, HathiTrust’s philosophy is to keep rules as simple and automatable as possible, allow for some mistakes, and keep itself open to suggestions and feedback in order to adjust mistakes in retrospect. HathiTrust adheres to the simpler copyright rules in order to automate ingest and keep the process as efficient as possible (HathiTrust “Rights Management”). The result is a fairly reliable and conservative system.

After HathiTrust has received bibliographic data and upon ingest of the material, the content is assigned a number based on its publisher and publication date. Because there is no designated field within MARC to store rights information, such information is stored in a MySQL rights database and accessed upon request by a user. The numbers are as follows:

  1. public domain
  2. in-copyright
  3. out-of-print and brittle (implies in-copyright)
  4. copyright-orphaned (implies in-copyright)
  5. undetermined copyright status
  6. available to UM affiliates and walk-in patrons (all campuses)
  7. available to everyone in the world
  8. available to nobody; blocked for all users
  9. public domain only when viewed in the US (HathiTrust “Rights Database”).

Note: Number 6 is not currently in use because it has not yet had occasion to be used.

For most countries, material can be said to fall into the public domain if it was published before 1870.[7] In the United States, material is public domain if it was published before 1923. Because of this contingency, some material may be viewed by those located in the United States but not by those outside the country. In order to respond to different global access rights, HathiTrust has implemented a GeoIP database that stores mapping information that is used upon request by users through the PageTurner interface. Materials that have been assigned the number 3 may only be viewed by IP addresses originating from U of M campus library buildings, as previous agreements have given U of M this right. Many documents published by the United States government are in the public domain and so may be fully viewed, even those published after 1923.

Publishers or other rights holders can give HathiTrust permission to provide reading access to its users by signing a permissions agreement, which are kept on file at the University of Michigan. Rights holders may also protest the display of their work or its presence in the archive and can issue a take-down notice against HathiTrust, which HathiTrust will investigate (HathiTrust “Rights Management”). Also, users can provide feedback to HT on a specific volume if they suspect it has come out of copyright.

 

Collection Discovery and Services

User Interface

HathiTrust has been using VuFind as its temporary public interface which has been in place since April 2009 (HathiTrust “Objectives”). It allows for faceted bibliographic search of HathiTrust’s holdings. Once an initial term has been searched, users are able to further restrict search by viewability, subject, place of publication, original format, and contributing institution.

Since April 2009, HathiTrust’s Discovery Interface Working Group and OCLC have been working together toward a public interface that will serve as a replacement for the temporary beta interface currently in place. May 2010 saw the first successful installation of the “HathiTrust WorldCat Local instance, which is now undergoing testing and evaluation by HathiTrust and OCLC (HathiTrust “Update May 2010”). Beyond the public interface, indexed searching of a volume’s full text is made possible through HathiTrust’s use of the latest incarnation of the open-source search engine Lucene called Solr. It was determined that while Solr as it was would not be able to accommodate the needs of HathiTrust and its users, the programmers at HathiTrust would work with Solr’s development community to develop the application further. Issues that arise with the application as well as improvements made to it are recorded in updates to the Large-Scale Search blog.[8]

OCLC recently added HathiTrust records into its WorldCat worldwide bibliographic catalog, which is another step toward a more discoverable collection (OCLC). HathiTrust results are also displayed in the online catalogs of partner universities, including volumes digitized elsewhere.

Collection Builder

HathiTrust’s Collection Builder “provides the ability for end-users and collection development staff to create and 'publish’ virtual collections of volumes held in the repository.” This tool began in 2008 with a University of Michigan-specific version and has since been generalized to include participating libraries. Students, faculty, and staff at these libraries can log in using their university ID number and password in the Shibboleth authentication portal and build personal or public collections that can be shared with others. This is a valuable tool for instructors to gather pertinent material for courses, or for subject librarians to similarly create collections in their subject area to serve as a resource for students. Collection Builder can be accessed through http://babel.hathitrust.org/cgi/mb.

Shibboleth Authentication

As of June 2010, users at participating universities were able to log in to HathiTrust using the Shibboleth authentication portal in order to access certain services such as the Collection Builder. PageTurner recognizes the authentication as well, and subsequently authorizes full PDF downloads of public domain materials. Without authentication, users are only able to view the volume with PageTurner or download 1-page PDFs. This step expands use for such participating universities and opens the door for the development of more valuable tools for designated users.

PageTurner

PageTurner is the viewing interface for public domain or otherwise authorized volumes. It also is the mechanism through which viewing rights are granted to users. Upon volume request, the application queries the rights and GeoIP databases to either grant or deny access. A query to the GeoIP database, which compares the user’s IP address with information mapped in the database, can further grant or restrict access (York “Poster”). For instance, those users whose IP addresses originate from University of Michigan libraries have expanded access to volumes which are determined to be brittle, and those users who are outside of the United States and are trying to view documents published between 1870 and 1923 will find their access restricted.

The interface itself is intuitive and effective, with simple arrow buttons to guide the user page-by-page or to skip to the end or beginning of the volume. Users can enlarge/shrink the size of the page, rotate the image, jump to a different page, use the table of contents to navigate, or change how they would like to view the document (OCR text, image, 1-page PDF, full PDF [if logged in]). Users can also bookmark the page, search the full-text, and provide feedback to the Digital Library Production Service of the University of Michigan to suggest improvements or point out any issues with the volume. Images are delivered to the interface in either PNG (for archival files stored as TIFF) or JPG (for files stored as JPEG2000) formats. Familiar navigation methods help users become comfortable with the application more quickly, and encouraging communication with the user with the option to provide an e-mail address for a response helps to improve the service to the user, and as a whole.

Bib and Data APIs

HathiTrust makes some APIs (Application Programming Interfaces) available to partners in order to expand access to the mechanisms behind HathiTrust. Through the release of the Bib API in January 2010, partner universities and OCLC have been able to put HathiTrust bibliographic records into their own search catalogs. The Bib API returns bibliographic, rights, and volume metadata about the collection (HathiTrust “Bibliographic API”). The data API provides a means to access content and metadata themselves: “The HTD [HathiTrust Repository Data] API provides extensible, efficient and secure access to the data and metadata resources of the HathiTrust Repository,” so that those who already have identifying data about the collection can have corresponding access to the content, such as page images, OCR text, and METS files. The API returns an XML, JSON, or binary representation of the content requested (HathiTrust “Data API”). The data API been used for applications such as PageTurner and the Collection Builder, and a specification has been posted for review and comment. Limited access by researchers to content may also be available through such APIs, and more may be developed in the future.

OAI [Open Archives Initiative]; Tag-delimited Files; Datasets

An OAI feed of MARC21 records and unqualified Dublin Core records for public domain records, as well as tag-delimited files containing metadata identifying the content held in the HathiTrust repository, are available from the HathiTrust website (HathiTrust “Data Distribution”). Two datasets, of 5,000, the other of 50,000 public domain volumes, have been made available to researchers to explore, or to participate in the 2009 “Digging into Data” challenge. The 50,000 volume dataset is still available. HathiTrust hopes to participate in the development of data mining tools, and as of July 2009, it has “engaged members of partner institutions in a working group to develop specifications for a HathiTrust Research Center” to support data mining and large-scale analysis (HathiTrust “Objectives”).

Technical Systems and Analysis

Beyond file formats and metadata standards, HathiTrust employs strategies accepted by the digital repository community as standard for long-term preservation purposes. Redundancy of data in any digital repository is of the utmost importance in order to avoid permanently losing data to all levels of natural, malicious, or accidental incidents. To fulfill this requirement, HathiTrust has two functional locations: the University of Michigan in Ann Arbor, which is the main storage and ingest site, and a mirror site at Indiana University that continually syncs with Ann Arbor, completing a full-sync approximately every three days (Shallcross 6). Other than being able to ingest new material, this site is fully functional. Besides these two sites, HathiTrust uses the IBM Tivoli Storage Manager (TSM) application of the University of Michigan ITCS to make nightly backup tapes at two separate locations in Ann Arbor.[9] The main storage site’s servers are physically secure in a locked cage only accessible to approved staff and storage hardware is constantly monitored by MACC (Michigan Academic Computing Center). Regularly released updates to software and hardware are applied to the system only after a weeklong period of test and quarantine (TRAC 11).

HathiTrust plans to regularly refresh its hardware every 3-4 years or as necessary to ensure a stable medium for the digital content. Content is refreshed or migrated as necessary. HathiTrust has demonstrated its ability to handle a large-scale migration of content to new hardware as well as to new formats, as it has once migrated all content from its old storage methods to its new Isilon servers, and “has migrated large SGML-encoded collections to XML, and Latin-1 character encodings to UTF-8 Unicode.

HathiTrust’s insistence on open specification and standard file formats has lessened the need for normalization and migration. To date the repository has not needed to use resources such as format registries (such as PRONOM) for assistance in migration. However, as HathiTrust expands to include a wider range of content sources, it may ne necessary to use these services to aid with ingest and the subsequent management or normalization of nonstandard file formats. HathiTrust encapsulates its Archival Information Packages in a pairtree file hierarchy structure.[9] These encapsulations consist of page image files, OCR text, coordinate OCR files, a Google METS file, and a HathiTrust METS file (York “Notes” 3).

Conclusions

The CRL report on HathiTrust certification identified the strengths and limitations of the repository, as of January 2011.  That report is available at:  http://www.crl.edu/archiving-preservation/digital-archives/certification-and-assessment-digital-repositories/hathi

For judging ongoing performance of the repository, there are a number of specific indicators that may be useful in determining the extent to which HathiTrust services and resources are maintaining their value to the community: 

  1. Growth of financial support from participating libraries:  This growth should be measured in terms of both the number and diversity of the participant population. Diversity should be defined in terms of institution type and size, geographic reach, major disciplines, and sectors (i.e., libraries, publishers, private institutions, public institutions) represented.  
     
  2. Progress against clearing rights for non-public domain content: As most of the content of the repository falls, or may fall, within the period potentially covered by copyright, efforts by HathiTrust to clear rights for those materials will increase the value of the corpus.  
     
  3. Rate of successful ingest of Google and other content from participants:  The growth of the corpus of well-formed, available content will increase the value of the HathiTrust Digital Library.  Success here will depend upon the ability of HathiTrust to provide quality inspection of materials from Google and other providers in a time frame that permits provider re-scanning of the original source materials where HT standards are not met.  

Endnotes

[1] “HathiTrust Digital Library, Mission ad Goals”  at http://www.hathitrust.org/mission_goals, accessed 4/11/11

[2] Ibid.

[3] “HathiTrust Digital Library: Governance” at http://www.hathitrust.org/governance,  accessed 4/11/2011.

[4] This is an instance in which HathiTrust currently differs greatly from Google Books: Google Books tends to restrict viewing on government documents despite their status as in the public domain; sometimes even despite their pre-1923 publication.

[5] http://www.lib.umich.edu/files/UMichDigitizationSpecifications20070501.pdf as of 6/24/2010. Last updated in 2007. A new version is currently in development.

[6] For instance, where some universities have permitted Google to digitize a certain number of unique volumes, whether they are in the public domain or in copyright, others have permitted Google to scan public domain volumes only. The Committee on Institutional Cooperation’s (the CIC) agreement allows the digitization of any unique material in its member institutions which are in good condition, but will only immediately receive a digital copy of materials in the public domain. Materials that are in copyright are to be held in escrow, paid for by Google, and given to the CIC as the volumes pass into the public domain (CIC/G 4.11). The lack of in-copyright scanned works from such institutions potentially undermines HathiTrust’s mission as a digital repository working toward the long-term preservation of digital materials. In the case of institutions that do not allow Google to scan in-copyright works at all [such as the New York Public Library, a recent HathiTrust partner], and in light of a potential partnership with HathiTrust, unique volumes that were not scanned could remain unscanned even as these works pass into the public-domain. These contingencies affect, at the very least, the timing of a volume’s acceptance into HathiTrust.

[7] Or later, depending on the country. This is a “safe figure” that HathiTrust can use to safely respect the copyright laws of most countries outside of the United States.

[8] Blog can be accessed here http://www.hathitrust.org/blogs/large-scale-search as of June 24, 2010.

[9] These two sites are at the Arbor Lakes Data Facility (ALDF) at 4251 Plymouth Rd. and the Michigan Academic Computing Center (MACC) located at 1000 Oakbrook Dr. Page 7, Disaster Recovery document.

Follow CRL

Stay informed by subscribing
to our e-mail list and social
media outlets.