The ICON Database: New Information for Decision-making on Newspaper Digitization and Preservation

The Challenge

Image of ICON database display for the Times of India, showing issues held and gaps in the aggregate holdings represented in ICON.

Effectively managing and providing access to historical newspapers are matters of consequence for libraries. Academic libraries and many national libraries invest considerable sums to digitize newspapers to make them more accessible to historians and other researchers. And each year, research libraries in the aggregate spend millions of dollars to purchase databases of digitized historical newspapers from commercial publishers. The pressure to provide these resources is rising: News content is in high demand from researchers, who are now able to use sophisticated software and applications to mine the large bodies of data-rich text that newspapers represent.

At the same time, original back files of newspapers are increasingly threatened. National libraries and many major academic libraries, long able to maintain extensive and bulky collections of these files, are now under intense pressure to repurpose scarce storage space to provide public amenities such as exhibits, study rooms, and social spaces. In other instances, newspaper back files are imperiled by inherent vice: they were printed on poor-quality paper and often stored under unfavorable environmental conditions. As a result, libraries must make consequential decisions daily about whether to preserve, conserve, reformat, and even retain these types of important materials.

Unfortunately, few precise figures are available to inform these decisions. However, The Andrew W. Mellon Foundation recently awarded the Center for Research Libraries major funding to expand CRL’s collecting and analysis of data on archived and digitized newspapers and, based on that data and analysis, to promote coordinated, strategic action by libraries and consortia.

CRL’s Role in Newspaper Preservation and Acquisitions

CRL today provides data and analysis to enable academic and independent research libraries to make informed decisions about their own acquisition and management of news materials locally. CRL’s white papers on the lifecycle of electronic news (2011), the adoption of the web by African newspaper publishers (2013), and the comparative coverage of news broadcast transcripts by the major aggregators (2013) are a few of the resources designed to help illuminate the complex landscape of digital news for librarians.

CRL now also brokers the terms of purchase and licensing agreements between many of its libraries and the vendors and publishers of databases and electronic resources, through which most scholars today obtain access to primary sourcematerials. CRL publishes critical reviews of major newspaper databases in CRL’s online eDesiderata platform.

Newspaper Collection Data and Analysis: the ICON Database

One of the major resources CRL provides for librarians is the ICON Database, produced under the auspices of the International Coalition on Newspapers program. ICON is a registry of information on the hard copy, microform, and digitized holdings of US and foreign newspapers held by several major US research libraries. The largest single repository of information about such holdings, ICON currently contains records for over 172,000 newspaper titles, published in 51 US jurisdictions, 9 Canadian provinces, and 159 other nations. The publication dates of these holdings range from 1649 to 2014. At present, ICON lists over 40 million issues of these titles represented in the print and microform newspaper holdings of CRL and several major US research libraries, the extensive print newspaper holdings of the American Antiquarian Society, and digitized newspapers in two databases considered by CRL to be trustworthy and persistent: LC’s Chronicling America database and the Readex World Newspaper Archive.

For each newspaper title, the ICON database displays:

  • Publishing history of the title, on an issue-by-issue basis, which is generated using an algorithm developed by CRL from sampling, extrapolation, and hands-on data generated as a by-product of the digitization process, microfilming process, and/or intensive shelf-reading.
  • Names and characteristics of repositories/databases holding the title, and any or all of the formats in which the title is held.
  • Holdings of the accredited repositories in microform and paper down to the issue level, obtained through CRL harvesting or publisher direct submission. CRL harvests this information from the Chronicling America website using an API, and obtains direct submission from Readex and the American Antiquarian Society.
  • The contents of the major “trustworthy” databases by issue level, obtained through direct submission from the publisher and through CRL harvesting.

We envision three practical uses of ICON data, namely to support the following decision scenarios.

  1. Strategic digitization
    For libraries and publishers investing heavily in digitizing newspaper content, there is currently no single source of information on newspapers that have already been digitized. Reliable, granular information on the location of complete print and microform holdings, which might serve as potential source materials for digitization, is also scarce. A comparison of the contents of the Chronicling America database and the vast collection of early US newspapers held by the American Antiquarian Society, using ICON, reveals what a small portion of the latter have been digitized.
  2. Collection management
    ICON data can confirm the existence of original and reformatted copies, indicating not only the completeness of a library’s holdings but also the conditions under which those holdings are maintained. Such information can be relevant to library decisions on whether to retain locally held original copies of a given title, or whether to invest scarce resources in their conservation, attempt to fill gaps, or implement better security, environmental conditions, or controls on handling.
  3. Investment in database purchases and acquisitions
    Gaps in coverage are a common flaw in databases of digitized newspapers. ICON‘s increasingly reliable inventory of the editions of many titles actually published, in the form of an issue-by-issue publishing history, provides a frame of reference for judging the completeness of a given newspaper database.

Historically, this kind of information has not been available. Bibliographic and holdings information for newspapers, where available, tends to be incomplete, general, or unreliable. Holdings reported to utilities like OCLC are often described in summary terms, without noting gaps. Even this information can sometimes be out of date, largely because of the labor required in obtaining such data, not to mention the scarcity of information on the publishing histories of newspaper titles.

However, the widespread mass digitization of newspaper collections is now creating new opportunities for capturing this data: the digitization process creates issue-by-issue, article-by-article, and even page-by-page data. If we capture that data, as CRL is doing, it becomes possible to describe newspaper titles with unprecedented detail and granularity.

Funding from The Andrew W. Mellon Foundation is now enabling CRL to enhance the ICON database to handle data at a more granular level, and to expand our ability to collect data on newspaper titles held and digitized by other major world libraries and by key commercial news database publishers.

During 2014–15, CRL is creating the necessary protocols and an ingest pathway for automating the importing of issue-level information into ICON. These will enable new digitization projects to contribute to the ICON database data in various common metadata schemas and packages. CRL is developing software to parse and normalize issue-level data on newspaper holdings in hard copy, microform, and digital formats, and to enrich publication histories based on existing ICON data.

Challenges in Securing the Metadata

CRL faces real challenges in obtaining the detailed metadata necessary to fuel ICON analysis. While the Library of Congress exposes Chronicling America metadata to open harvesting using an API, we hope to convince other major US and European libraries to follow suit. The capability to create such metadata is already present in the widely used CCS docWorks software, and in other newspaper digitization applications. We hope that exposure of this data will become the norm in library digitization.

The cooperation of other electronic publishers will also be critical to the usefulness of ICON, and that cooperation is not yet assured. Several electronic publishers have already agreed to submit or expose for ongoing CRL harvesting the issue-level bibliographic and descriptive metadata for titles in their existing databases and digitization pipelines. Readex provides CRL metadata on the World Newspaper Archive databases as a condition of its cooperative agreement with CRL. As of October 2014, ProQuest has tested delivery of metadata for one long-running title (Times of India), and is considering additional contributions. Gale and East View have agreed to explore submission of data but we have not yet seen results.

CRL maintains that exposure of metadata at the issue level should be considered by libraries a basic prerequisite of transparency and thus trustworthiness in commercial databases. CRL will prevail upon the other major national site-licensing consortia, such as JISC (UK), the Deutsche Forschungsgemeinschaft (DFG, Germany), and CRKN (Canada) to help us convince publishers that making their metadata harvestable by CRL or submitting it to ICON is in their best interest.

Toward a Coordinated, Rational Strategy

With support from The Andrew W. Mellon Foundation, CRL will also work with several major US, UK, Canadian, and European organizations to make future newspaper preservation and digitization more systematic and strategic. In early 2015, using ICON data, CRL will conduct a comparative analysis of the coverage of world newspapers by the major digitization efforts to date. That analysis will examine and evaluate the major “trustworthy” databases, and identify significant weaknesses, gaps, and areas of overlap and duplication. The analysis will identify by title and country of origin many newspapers not yet digitized, and of intrinsically high risk (e.g., titles neither digitized, nor micro-formatted, nor widely held; titles historically prone to vandalism or theft; titles published during eras of highly acidic paper, etc.).

The findings of the analysis will then be shared with representatives of the major actors in newspaper digitization: the Library of Congress and the National Endowment for The Humanities, JISC, the Europeana Newspapers partnership, the DFG, interested national libraries, and the major database publishers. The findings will be the basis for deliberations at an international “summit” on newspaper archiving and digitization that CRL will convene in conjunction with the meeting of the IFLA News Media Section, in April 2015. There, representatives of national and academic libraries, consortia, electronic publishers, and others will weigh the findings of the ICON analysis and their implications for further mass digitization of newspapers.

The summit will also be an opportunity to decide on acceptable and achievable norms and protocols for sharing data about newspaper digitization projects; perhaps even create the outlines of a sustainable and mutually advantageous “division of labor” between the commercial publishers, national libraries, and major library consortia on the future digitization of international newspapers.

A common agenda for newspaper digitization and preservation that is rational, strategic, and achievable would create much-needed clarity around the mass-digitization undertaken by libraries and publishers. It would minimize duplication of library and publisher investment, and ultimately optimize the usefulness of news databases for scholars.