Archiving the Latin American & Caribbean Web: Three U.S. Initiatives


In this article, we consider three programs in the U.S. which are key library-based initiatives for archiving ephemeral, freely accessible web content from Latin America and the Caribbean. The programs we review are: The Library of Congress’s Web Archives (LCWA); Columbia University’s Human Rights Web Archive (HRWA); and the Latin American Government Documents Archive (LAGDA)— along with its close affiliate, the Human Rights Documentation Initiative (HRDI)— at the University of Texas at Austin. This article describes how these programs approach the sustainable capture of open web content, and the extent to which they succeed in providing archived content useful in the teaching, research, and publication mainstream. More detailed descriptions of each initiative can be found in the CRL report, An Evaluation of Web Archiving Programs in the US Relevant to International and Area Studies (2018).

Web Archiving at the Library of Congress1

History

National libraries around the world have recognized since the 1980s that their collecting responsibilities need to encompass the digital realm, especially where materials historically submitted to them in print form for legal deposit now exist only exist in digital form. Starting in 1996 some countries, among them the UK, Australia, Sweden, and Denmark, passed legislation to allow or mandate the collecting and preservation of their nation’s digital output. In the United States in the year 2000, Congress established the National Digital Information Infrastructure and Preservation Program (NDIIPP) to develop a national strategy to collect, preserve, and make available to the public significant digital content. At the same time, the Library of Congress established its first digital archiving program, called MINERVA (Mapping the INternet Electronic Resources Virtual Archive), today simply called The Library of Congress Web Archives (LCWA). As its first major project, MINERVA worked with the Internet Archive (IA) to archive the 2000 presidential election. Then, in the wake of the 9/11 terrorist attacks on the Pentagon and the World Trade Center, LC harvested domestic and foreign websites reflecting world reaction to these events, preserving this content before it disappeared. Over 30,000 websites were captured at that time: the September 11, 2001 Web Archive is today LC’s single most visited web archive collection.2

Today, LCWA still contracts with the Internet Archive for crawling services, but unique among IA’s hundreds of other partners, harvested content is not made available through its Archive-It interface or the Wayback Machine; instead, the archived content is loaded on LC servers and is available only through the LC portal. Moreover, much content is available only on the LC premises, analogous to the treatment of deposited print publications. Unlike many other countries, however, LC has never been granted a legal mandate requiring publishing entities and individuals to deposit their digital output, and conversely, it is not legally required to archive websites. This has led to a complex system of permission requests.

As of 2018, there are about 100 event and thematic collections administered by LCWA, with detailed information—though not necessarily public access—provided through the gateway at www.loc.gov/webarchiving, with the actual web archives grouped by subdomain at webarchive.loc.gov. There are currently 51 foreign collections, of which only three have to do directly with Latin America and the Caribbean: the Brazil Cordel Literature Web Archive3, and the archives of the two Brazilian presidential elections of 2010 and 2014, the latter of which was as of this writing not yet officially posted.

LCWA’s most recent annual report indicates the total size of the archive is currently 1.3 petabytes.

Governance and Selection

Policies governing the selection and archiving of the foreign sites (of importance to Latin Americanists and other area studies researchers) are given special attention in a set of “Supplementary Guidelines”4 to LC’s Collections Policy Statements:

  • Foreign websites are collected on a highly selective basis. To avoid duplication of effort, recommenders of international sites should verify that the content is not already being archived and made publicly available by the host country. Exceptions to this policy can be made if there are concerns over the long-term accessibility of a foreign website.

Proposals are being actively encouraged since LC has emphasized the growth of web archiving as part of its overall digital collecting plan. “We recognize that there is a lot on the Internet that is within scope and not being actively archived by anyone, and we currently have the capacity to add additional projects without compromising our core web archiving efforts (federal websites, elections, etc.).”5

On the topic of Twitter and social media, much has been made of the LC Twitter archive, first acquired in 2010 with tweets going back to 2006 and with the charge to include all public tweets going forward.6 The collecting mandate is no longer comprehensive, reflecting many concerns on the part of LC, among them the need to honor deleted requests, but especially the widely varying needs of researchers who want to use the vast amounts of data collected for projects in a multitude of fields and disciplines. Permission must be granted by both LC and Twitter for any such use. For this reason, most researchers seeking to use Twitter data go directly to Twitter itself or to commercial services licensed by Twitter like Dataminr and Gnip, typically to mine feeds for certain topics, opinion research, trends, and other patterns.

Support and Collaboration

Not having the sweeping digital depository mandate of national libraries in other countries, such as Britain, France, Denmark, among others, LC relies on the willing cooperation of site owners, both domestically and abroad. The Library has a notification and permissions process based on the country of publication and the type or category of the nominated site, and two requests are addressed in email messages that are sent out to most site owners: one for notification or permission to crawl; and another for notification or permission to provide access outside the Library’s premises.

Matters are much more complicated with hosts of websites in foreign countries. With them, crawling permissions are based on U.S. law (since the crawling activity is happening in the United States), but the access permissions are based on the laws of the country that the site is published in. Often explicit permissions are required—and this turns out to be a bigger challenge than almost any other. And yet, ultimately a remarkable level of coverage has been achieved for LC’s archives of foreign elections and of the international response to 9/11.

Use Analysis and User Feedback

Use data for LCWA is collected using the Adobe Marketing Cloud. The number of total visits in 2017 for the archive interface at loc.gov/websites was 180,238, or roughly 500 a day. Recognizing that the value of archived content will grow over time as live websites now being archived disappear, the LCWA Team is primarily focused on building the archive rather than on performing downstream use analysis, at least at this time. This is not unusual across the country.7

Not much is known about specific uses of archived content in the LC Reading Room or elsewhere in the country or world. As an independent investigation suggests, LCWA is only infrequently cited in published research. LC is beginning to experiment with creating data sets that will allow researchers to use the archives in new ways. It will continue to collect expansively, working with partners across the globe, especially through organizations such as the International Internet Preservation Consortium (IIPC).

Web Archiving at Columbia University Libraries: The Human Rights Web Archive8

History

Columbia University Libraries began exploring web archiving in 2008 out of a recognition that freely available websites were an increasingly important but ephemeral research resource that university libraries were not actively collecting. By 2013, Columbia was funding its own Web Resources Collection Program, which includes large thematic web collections in areas such as human rights, historic preservation and urban planning, and New York City religions, in addition to archiving the university’s institutional web domain.

Columbia’s first and still largest collection is the Human Rights Web Archive, a collecting focus inspired in part by a 2007 CRL-cosponsored conference on human rights documentation held at Columbia. Organizationally, it is an initiative of the University Libraries’ Center for Human Rights Documentation and Research (CHRDR).9 It represents an effort to preserve and ensure access to freely available human rights resources created mainly by non-governmental organizations, national human rights institutions, and individuals. Project work on HRWA transitioned to programmatic work in 2010. As of early 2018, the project had collected 15 terabytes of data and has active harvests of about 700 seeds.

Governance and Selection

HRWA is the largest of four thematic web collections being built by CUL’s Web Resources Collection Program (WRCP). A high priority at Columbia is mainstreaming web archiving with all other collecting and archival activity being undertaken by the Libraries. This is reflected in HRWA’s thorough integration with other administrative units of CUL and the campus at large. There is frequent interaction, both formal and informal, between HRWA staff and the university’s faculty and students: an important source of useful intelligence about existing as well as new potential seeds for the web archive.

The value of conjoining traditional collection expertise with the selection and management of seeds at HRWA is made clear in a publication co-authored by Pamela Graham of CHRDR:

  • Knowledge of existing publishing streams (including print) forms a basis for understanding the broader cultural production landscape and traditional modes of dissemination; in turn we can identify publishing that sits outside the mainstream . . . Examples include websites of marginalized social groups or movements, or emerging writers or artists who only disseminate their work online.10

Use Analysis and User Feedback

Google Analytics data from the HRWA Archive-It account show that since tracking began in November 2014, there have been 13,095 sessions, with 49.7% of views from the United States and 50.3% of views from the rest of the world. The Internet Archive’s public collection page statistics for the copy of HRWA archived content added to the general Wayback Machine shows dramatically higher use: 4,482,392 views since 2011, or about 650,000 views per year or 1,850 per day. Views are not citations, of course, but these numbers still document great attention paid to HRWA content. The actual impact of the archiving activity on published research and scholarship has been difficult for Columbia’s web archiving staff to assess—it has also not been the highest priority to do so.11 As with the Library of Congress Web Archives, questions of documented use tend to be put aside for the present in favor of creating and enhancing well-curated, technologically robust archives. Use will inevitably rise as live-web versions of archived sites go offline or their content “drifts.”

Challenges and Future Plans

There are, of course, still a host of challenges for Columbia’s HRWA, including technical issues involving the locally developed search interface, and legal issues related to copyright. To expand use, institutional commitment must continue at the current high level. Looking at the big picture nationally, as Graham and Norsworthy do in their forthcoming book chapter, there is a keen sense at Columbia that the “primary obstacles to expanding these activities in libraries are less on the ‘technology’ side and more on the ‘cultural’ side.”

Web Archiving at the University of Texas at Austin: LAGDA and HRDI12

History

The history of the two principal active web archiving projects at the University of Texas at Austin—the Latin American Government Documents Archive (LAGDA) and the Human Rights Documentation Initiative (HRDI)—is an integral part of overall library and archive growth at the Teresa Lozano Long Institute of Latin American Studies (LLILAS) Benson Latin American Studies and Collections. LILLAS Benson is one of the world’s most important centers for the study of Latin American history, culture, politics, and society. LILLAS’s interdisciplinary program integrated more than 30 academic departments across the university. The Nettie Lee Benson Latin American Collection is one of the world’s premier repositories of Latin American and U.S. Latina/o materials.

The Benson’s physical collections number over a million volumes, to which are added a wealth of original manuscripts, photographs, and various media related to Mexico, Central and South America, the Caribbean, and Latina/Latino presence in the United States. The creation of LLILAS Benson’s digital collections began in the early 1990s. A website followed in 1994—also almost prehistoric in the history of the Internet. Surely the highest profile digital project of the early years was the engagement of the University of Texas on an international effort to preserve, digitize, and make accessible the Guatemalan National Police Historical Archive Project (Archivo Histórico de la Policía Nacional, or AHPN), launched in 2011, based on more than eighty million pages of documents discovered in an abandoned Guatemala City barracks in 2005.13

Archiving born-digital web content, rather than digitized materials, at LLILAS Benson did not, however, begin as an extension of document digitization, but out of sheer necessity. Benson Library had systematically collected Latin American official government documents, including annual State of the Union reports as well as annual reports from individual government ministries. Beginning in the late 1990s, however, Latin American governments began releasing these documents only in digital form. Initially, the Benson just collected and organized links to these documents, not anticipating either “link rot” or “content drift,” when a new annual report, for example—replaced the old at the same address. This led to the establishment of LAGDA, started in 2003 with an investigation and planning grant from The Andrew W. Mellon Foundation, and becoming operational when LLILAS Benson enlisted Archive-It in 2005.14

Today, LAGDA comprises over a million discrete documents/files from approximately 300 ministries and presidencies in 18 Latin American and Caribbean countries. From a preservation perspective, a recent review showed that thousands of documents and speeches, which are available through LAGDA, no longer exist on the live web, including virtually the entire web presence of the Honduran government under Manuel Zelaya. In light of the recent election of Jair Bolsonaro as president of Brazil, LAGDA’s importance as an archive of vulnerable government publications may once again be highlighted.

In addition to LAGDA, LLILAS Benson is also home to several other smaller web archiving projects, including the legacy web archiving projects of LANIC, the Latin American Network Information Center, which, though no longer being actively maintained or updated, remains a serviceable and valuable archive.

The most significant of the other web archiving endeavors relevant to Latin America and the Caribbean actively maintained at the University of Texas is the Human Rights Documentation Initiative, or HRDI, which monitors, crawls, and archives ephemeral materials from the websites of human rights groups around the world. HRDI was founded in 2008, originally to preserve records documenting the genocide in Rwanda. Since then, its mandate has grown, especially regarding Latin America. As to whether there has ever been collaboration between the HRWA at Columbia and the HRDI at the University of Texas, David Bliss at Texas reported only that “an effort was made when compiling the initial seed list to avoid overlap with the HRWA.”

“Post-Custodial” Archiving

In spring 2017, the University of Texas at Austin received a grant from The Andrew W. Mellon Foundation to fund a project titled “Cultivating a Latin American Post-Custodial Archival Praxis.”15 The project focuses on building local capacity in Latin America to preserve vulnerable historical documentation, making the resulting documents digitally accessible. Building on earlier projects supporting the digitization of materials from Nicaragua, El Salvador, and Guatemala, the new grant will support similar post-custodial initiatives with partners in Brazil, Colombia, and Mexico, with an emphasis on documenting underrepresented communities.

LLILAS Benson did not originate the “post-custodial” approach toward partner organizations, but it has embraced this paradigm wholeheartedly.16 As a policy, “post-custodial archiving” resides somewhere between “governance” and “collaboration,” reflecting a shift in archival theory overall as it relates to area studies. Kent Norsworthy summarizes this partnership-based approach as it applies to Latin America and the Caribbean, forming the basis of the LLILAS Benson collaboration philosophy:

  • The field of Latin American studies has been changing for some time, requiring an end to the previous paradigm—benevolent study of our “southern neighbors” from an unreflectively northern perspective—and replacing it with the principles of horizontal collaboration among sister institutions across the hemisphere and critical theoretical engagement from a true diversity of perspectives . . .17

The post-custodial paradigm seeks to break through the colonial and post-colonial approach based on the acquisition or copying of cultural resources from their source communities—another form of resource extraction, in other words. The new paradigm was pioneered by HRDI with digitization projects done jointly with the Kigali Genocide Memorial Centre in Rwanda and the aforementioned Archivo Histórico de la Policía Nacional in Guatemala.18

In the field of web archiving, this approach involves calling on in-country partners to provide seed nominations and ensuring that access to all born-digital archives is open to those partners, while at the same time protecting individuals in source countries from negative consequences of exposing their data and personal stories. It is, therefore, no surprise that the “National Forum on Ethics and Archiving the Web” held at the New Museum in New York on March 22–24, 2018, specifically called for contributions on “recognizing and dismantling digital colonialism and white supremacy in web archives,” as well as “strategies for protecting users: from one another, from surveillance, or from commercial interests.”

In-house Use Analysis and User Feedback

As at Columbia University, the focus of web archiving at LAGDA and HRDI at present is building the archive. Use analysis is writ small: Google Analytics data is not maintained, and there is no archive of student use of the archived resources.

The use data page for LAGDA at the Internet Archive, accessed April 4, 2018, records 3,154,453 views since the creation date of August 3, 2011—on average about 470,000 views per year.19 The Internet Archive also posts data on views of the HRDI, also accessed on April 4, 2018: there have been 1,806,613 views since July 30, 2011, on average 270,000 views per year.20

Challenges and Future Plans

According to LAGDA and HRDI staff members, the biggest challenges their work faces today are not on the technical side: they have to do instead with the availability of sufficient staff and resources to properly curate the 300 active seeds—and add new ones as existing seeds go dead. LLILAS Benson staff recognize the importance of strengthening relationships on campus and developing the feedback loop between researchers and web archivists to improve both the quality of LAGDA and HRDI and to encourage their more active use. Graham and Norsworthy touch on at least one important part of this challenge in their forthcoming article:

  • . . . anecdotal evidence suggests that researchers are creating their own personal archives of information saved, copied, or captured in some manner from the web. How can those scholar-led archiving efforts inform more systematic and comprehensive collection building carried out by libraries?

Ideally, then, such collaboration would make researcher “scrapbooking” largely unnecessary.

Conclusion

The 2018 CRL report identified measures that archiving efforts can take to become more useful to researchers and therefore more sustainable. These include standardizing metadata across the library/archives divide; developing better finding aids and exposing them to web crawlers; and introducing certification standards to enhance the credibility of archived web content among skeptical scholars. Education and outreach at discipline-specific professional meetings will be useful. Ultimately interinstitutional— and international—collaboration will be necessary, to leverage the strengths of multiple library, archival, and publishing partners in validating and preserving information distributed on the web.


  1. Discussion of web archiving activities at the Library of Congress is based primarily on publications by the LC Web archiving team led by Abigail Grotke; on a phone interview with Grotke and LC Collection Development Analyst Michael Matos on December 21, 2017, and a meeting at the Library of Congress on January 24, 2018.
  2. https://www.loc.gov/collections/september-11-2001-web-archive/about-this-collection/.
  3. https://www.loc.gov/collections/brazil-cordelliterature-web-archive/.
  4. http://www.loc.gov/acq/devpol/webarchive.pdf.
  5. Michael Matos, Library of Congress collection development analyst, personal communication.
  6. It should be noted that this project is organizationally entirely separate from LCWA.
  7. As concluded in the 2016 NDSA Survey: “Given the relative youth of many programs, as well as the fractional nature of staffing and other resource limitations, lack of knowledge of downstream use is perhaps not surprising.” Jefferson Bailey, Abigail Grotke, et al. “Web Archiving in the United States: A 2016 Survey.” (February, 2017). http://ndsa.org/documents/WebArchivingintheUnitedStates_A2016Survey.pdf.
  8. Sources for this description include an initial phone interview with project coordinator Pamela Graham on December 15, 2017; then an onsite meeting on January 26, 2018.
  9. For a full description of the project, refer to https://hrwa.cul.columbia.edu/about.
  10. Pamela M. Graham and Kent Norsworthy. “Archiving the Latin American Web: A Call to Action.” In Latin American Collection Concepts: Essays on Libraries, Collaborations and New Approaches, edited by Gayle Williams and Jana Krentz. Jefferson. N.C.: McFarland, 2019 [forthcoming]. Quotations are based on a prepublication version of this chapter provided by the authors.
  11. Results of the author’s own analysis of citations to HRWA content in published research suggest that scholarship is either passing it by or not acknowledging use, at least in any explicit form.
  12. With thanks to the director of LLILAS Benson, Melissa Guy, as well as to David A. Bliss, AJ Johnson, and the now retired Kent Norsworthy of UT Libraries for providing important background for this section. Unless otherwise indicated, statements attributed to them were contained in personal communications.
  13. The AHPN represents “the largest single repository of documents ever made available to human rights investigators.” Around ten million pages were publicly accessible at the time of the launch. Kent Norsworthy, “Digital Resources: LLILAS Benson Latin American Studies and Collections, University of Texas at Austin,” in Oxford Research Encyclopedia of Latin American History, 2016, p. 9. http://latinamericanhistory.oxfordre.com. See also “The Archivo Historico de la Policia Nacional de Guatemala at the University of Texas,” Focus on Global Resources, Winter 2012, 31 (2) https://www.crl.edu/focus/article/7499
  14. The successful proposal to The Andrew W. Mellon Foundation was submitted by CRL, four U.S. universities (New York University, Cornell University, Stanford University, and the University of Texas at Austin), and the Internet Archive. Proposal, final report, and other documents related to the “Political Communications Web Archive Project” can be found at http://www.crl.edu/reports/politicalcommunications-web-archive.
  15. https://legacy.lib.utexas.edu/benson/announcements/university-texas-austin-receives-mellon-foundation-grant-pioneer-archival.
  16. The terms “non-” and “post-custodial” are most clearly articulated in writings and speeches by Canadian archivist Terry Cook going back more than 30 years. See also Melissa Guy, “The ‘Post-Custodial’ Model for Preserving At-Risk Archives in Latin America,” presentation at CRL Global Collections Forum, May 18, 2018. https://www.crl.edu/events/crl-global-resourcescollections-forum-2018.
  17. Former director of the UT Libraries, 2016.
  18. UTL’s Fred Heath, referring to the collaboration which brought the AHPN Digital Archive to Austin, noted: “. . . the cultural heritage of the [Guatemalan] nation will remain in country—a reversal of a century or more of ‘tail lights going north’ with national patrimony and a total volte-face in the way U.S. research universities are viewed by nations to our south. Now we just have to prove ourselves worthy of their trust.” Quoted in Focus on Global Resources, 2012.
  19. https://archive.org/details/ArchiveIt-Collection-176&tab=about.
  20. https://archive.org/details/ArchiveIt-Collection-1475&tab=about.