Toward Greater “Transparency” in News Databases: Assessing Policy and Practice of Digitized Newspaper Repositories

The expansion of access to digitized newspapers affords many opportunities for researchers and libraries today. Since 2001, a growing and diversified body of providers—national and academic libraries, state and public institutions, publishers, third-party commercial producers—have brought scalable digital technologies to bear on the provision of legacy newspaper collections.

Image from Readex’s America’s Historical Newspapers collection showing issues held for the Chicago Herald.

However, as with the challenges of exposing structural metadata on available collections (see previous article), the diversity of players and the multitude of platforms and preservation systems employed in the process have provided little transparency for libraries seeking information on the sustainability and likely persistence of digitized news content. To meet this challenge, with the support of The Andrew W. Mellon Foundation, CRL is gathering detailed, comparative information on the infrastructure, content management, and digital repository platforms maintained by commercial publishers and libraries that support digitized newspaper collections.

CRL’s Assessment Framework

CRL has created an assessment framework to evaluate the major platforms and repositories, as well as the practices and capabilities of the supporting organizations with respect to their ability to provide long-term access to digital news resources. The two primary purposes of the framework and the assessment efforts are:

  1. to inform library decisions about the purchase and licensing of newspaper databases, and
  2. to support strategic library and publisher investment in the development, implementation, and management of digitized newspapers programs.

CRL has a rich background in facilitating informed decision-making at academic and independent research libraries. Through its reviews of electronic resources, CRL provides objective information on major databases and digital collections of high interest for licensing by participating CRL libraries. CRL also creates provider profiles on its eDesiderata site, which analyze the business practices and sustainability of the resource providers. At the most rigorous level, CRL conducts in-depth assessments of digital archives that preserve research materials of interest to the CRL community. CRL reports are independent, critical analyses based upon information obtained from the repositories themselves, from reliable independent sources, and in some instances through formal audits.

These initiatives have informed CRL’s development of a new framework for assessing digitized newspaper repositories. The assessment is based on two major sets of criteria for trusted digital repositories: Trusted Repositories Audit and Certification: A Checklist1 and ISO 16363 : Audit and certification of trustworthy digital repositories2. However, the framework under development for digitized news repositories seeks to strip down the checklist criteria to a level of assessment that is credible and actionable, while employing a rating scheme less rigorous than full certification. The framework will provide “actionable intelligence” about repositories and programs, presented in the form of structured repository profiles that will be widely shared with libraries and other newspaper digitization stakeholders.

In assessing the major libraries and publishers conducting newspaper digitization, CRL seeks to answer the following questions:

  1. Does the organization provide adequate support, staffing, and financial resources to perpetuate the long-term sustainability of the repository?
  2. Does the organization demonstrate sufficient plans and policies for the selection, digitization, and storage of content?
  3. Does the organization follow content-management practices that utilize a workflow model demonstrating valid ingest and archiving processes as well as facilitating sufficient intellectual control of content?
  4. Does the organization utilize systems that adequately protect repository content and enable the monitoring of content integrity?

To fully answer these questions, CRL is documenting the policies, platforms, and technologies used by the libraries and publishers to manage digitized historic newspaper content. The information CRL is gathering falls into the following broad areas:

Organization Apparatus

To understand how well digitized newspaper content is managed by a given publisher or organization, CRL will examine the overall mission, project scope and documentation, general structure of the organization, and contingencies in place for ensuring long-term content integrity and persistence.

Program Planning and Management

To identify the key requirements for program implementation, CRL will examine the planning and preparation that goes into managing the organization’s repository. CRL will gather documentation of formal decision-making responsibilities, guidelines for content retention and disposition, rights and permissions regimes, specifications for digitization and enhancement of content, and the organizational framework or model for storage and access to repository content.

Content Management

To assess the effectiveness of the workflow in promoting sustainability, CRL will survey the technical processes and workflows involved in submitting/acquiring, ingesting, and maintaining content in the repository. CRL will also assess the nature and quality of repository metadata and how it is maintained; the ability to express and expose structural metadata for purposes of assessment and discovery; and the ability to structure hierarchical relationships between content data objects and content sources (i.e., documenting the provenance of content).

Technical Components

CRL will analyze the systems and technologies utilized to support the repository and sustain a digitized newspaper program.

The Framework in Action

CRL has begun applying the framework in the profiling and rating of several library programs and commercial newspaper products, including: the Library of Congress National Digital Newspaper Program, the National Library of Australia Newspaper Digitisation Program, and Readex’s historical newspaper databases. CRL consults open-source information (staff papers, publications, conference presentations, annual reports, budgets, agreements and contracts, website content, repository samples, and media coverage), confers with administrative and technical managers from the participating organizations, and employs the help of subject matter experts within the CRL community of libraries to build accurate and current assessment profiles.

The rating scheme proposed by CRL is based on the level of transparency of the repository operations, program accountability, and trustworthiness as demonstrated by the organization. In rating the programs, CRL will provide an objective measurement of the ability of a provider to demonstrate credible policies, practices, and infrastructure in support of the program, through such metrics as:

  • Organizational policies that explicitly define the financial and bureaucratic commitment to the newspaper program;
  • Plans for preserving the institution’s digital/digitized content over time;
  • Presence of appropriately trained staff thoroughly engaged in the newspaper program;
  • Evidence of content selection priorities, offering rationale and guidelines for carrying out selection procedures with awareness to accountability;
  • Documented workflows and metadata specifications;
  • The content and metadata retrievable through digital platforms and repository samples indicating the level of compliance with standardized content management practices;
  • Specifications for open source and/or proprietary software and hardware employed by the repository; and
  • Evidence and details on the application of backup and security tools used in the repository’s storage and dissemination systems.

The assessments and profiles will identify the credible aggregators of historical newspapers, illuminate community standards and best practices, and analyze the risks posed to the persistent access to digitized newspapers over time. These tools will develop the knowledge base for CRL’s efforts to build broad consensus among libraries, publishers, and commercial providers for the further digitization of newspapers globally.

  1. Trustworthy Repositories Audit & Certification: Criteria and Checklist. February 2007. Version 1.0. OCLC and CRL
  2. International Standard ISO 16363: 2012 Space data and information transfer systems – Audit and certification of trustworthy digital repositories. Geneva, Switzerland: International Organization for Standardization.