In 2005, as part of its Political Communications Web Archiving investigation, CRL conducted a pilot assessment of Archive-It, a subscription service developed by the Internet Archive to harvest, catalog, manage, and display web-based information. Archive-It allows subscribers to select born-digital content for harvesting on a flexible crawling schedule (from daily crawls to annual, in addition to “on demand”).
CRL issued a report in 2006 on “Middle Eastern Political Parties Web Harvesting and Other Efforts,” assessing the crawling capability and administrative tools of Archive-It in the context of a subject-based harvest of websites of Middle Eastern political parties. The detailed report highlighted successes as well as deficiencies in capture, and discussed broader concerns over the complexity of managing and making discoverable archived content.
Five years later, web archiving continues to pose significant challenges to libraries and archives around the world. National Libraries and cultural heritage institutions have coalesced around the International Internet Preservation Consortium (IIPC) to foster the development and use of common tools, techniques, and standards that enable the creation of international archives. The Internet Archive plays a key role in this initiative, providing the open source crawler technology based on IA’s Heritrix web crawler for further development.
Meanwhile, Archive-It has expanded its subscriber base to more than 175 academic, government, and nonprofit partners, reportedly harvesting more than 3.2 billion URLs contained in 1,687 public collections (as of September 2011). In this sea of content, numerous collections have developed around Middle Eastern and Islamic themes. These include one-off capture of organizations’ sites (IslamAmerica, Institute for Palestine Studies) to event-based captures of such world events as the Libyan uprising and the Tunisian revolution.
To evaluate how such crawling technologies currently serve research purposes, CRL recently performed an assessment of Archive-It content based on three public collections related to the Middle East. As the samples below demonstrate, a combination of human and technical pitfalls may cause errors in capture, resulting in an “imperfect history” of these events.
CRL Assessment of Archived Web Collections
Partner: New York University, Islamic and Middle Eastern Collection
Crawling Activity: 2008–2011
Sites crawled: approximately 755
This collection of sites (split among multiple collections within Archive-It) features an array of blogs produced by Iranian scholars, politicians, journalists, and the general public. Blogging factors prominently in Iranian youth and opposition movements. While print and broadcast media is tightly controlled by the state, a large percentage of Iran’s young population turns to the Internet and social media for free expression. During the parliamentary election in 2008 and more prominently following the presidential election in 2009, opposition sites were routinely blocked by authorities and many bloggers and journalists were suppressed and detained. Some of the collection’s sites are still active on the web, but many are no longer updated or have since been removed entirely.
As blogs tend to feature less-advanced programming and follow a consistent template, content capture for these sites is reasonably successful. Embedded videos frequently do not work, or else point to live versions of videos hosted by external sites (e.g., YouTube, Al Jazeera). External links were generally not included in the crawl, limiting the functionality of posts that refer to external news items or posts (a common occurrence).
The archived collection includes a number of prominent sites, such as that of Mohammad Ali Abtahi,1 former Vice President and reformist arrested in 2009; and Hossein Derakhshan (aka Hoder),2 widely considered the “father of Persian blogging,” who was arrested on charges of spying for Israel. Missing from the collection are sites of several other well-publicized bloggers such as that of Mohammad Pour Abdullah,3 a student activist sentenced to six years in prison on charges of anti-government publicity; or Omid Reza Mir Sayafi,4 who was similarly detained and later died in prison.
Across the various collections archived by NYU, roughly 755 unique seed URLs were input for crawling (not all submitted sites returned crawl results). The periods of capture vary according to each phase of crawling, but span from early 2008 to the present. Many sites were crawled only once or for a brief period of time, with some pages crawled more intensively over longer periods. Information about selection and frequency methodologies is not publicly available.
The Iranian blogosphere is large (by varying accounts, some 60,000 to 100,000 blogs are updated regularly). Given this scale, the Archive-It collection can be considered representative at best. For the Berkman Center for Internet & Society’s 2008 study “Mapping Iran’s Online Public: Politics and Culture in the Persian Blogosphere,”5 Morningside Analytics tracked over 200,000 Persian language blogs, including 98,875 blogs monitored daily. While the Berkman Center harvest focused on text analysis rather than full-page rendition, it is evident that a combined approach using social network analysis and robust web-crawling technology could feasibly harvest a more complete picture of the blogosphere.
2011 Egyptian Revolution
Partner: American University in Cairo
Crawling Activity: 2011
Sites crawled: 90
This collection of crawled sites relates to the popular uprising in Egypt that began on January 25, 2011. The collection, curated by the American University in Cairo, began its crawl on February 1, capturing events leading up to the resignation of Hosni Mubarak and the ongoing transformation of Egyptian government and society.
According to the Archive-It collection notes, the following categories of sites were crawled:
• Blogs and Twitter Feeds (25 URLs)
• Documentary Projects (5 URLs)
• Memorial Websites (3 URLs)
• News and Media Coverage (40 URLs)
• Photos and Videos (4 URLs)
• Related Websites (13 URLs)
News and media coverage varies significantly in success rate. News and breaking event pages frequently captured headline text and related articles, but photos, videos, and style sheets needed to properly render the pages as displayed frequently were not captured. In some cases, related sections of news sites were included; in others they were not captured.
A small selection of Wikipedia pages on figures such as Hosni Mubarak and Mohamed ElBaradei shows how articles describing current events undergo rapid and successive changes over time. It should be noted, however, that Wikipedia itself maintains revision histories of its pages at a more granular level than a periodic web harvester can capture.7
With a seed list of 90 URLs, this collection represents a small sampling of potentially relevant sites, though the inclusion of particular sites over others leads to questions of selection and objectivity. Contextual information on page selection would be particularly useful in this type of collection.
Collected by the “Internet Archive Global Events” effort, this initiative sponsored by the Internet Archive is responsible for capturing event-based collections such as the Jasmine Revolution – Tunisia 2011 and the earthquake in Haiti. According to the Archive-It site, this collection includes blogs, social media, and news sites about Egypt, Yemen, Libya, Sudan, and other countries documenting the tumultuous events in Northern Africa and the Middle East starting in January 2011. With an impressive 5,178 seed URLs crawled, this collection appears to be significant, with content reportedly contributed by partners at the Library of Congress, Bibliothèque nationale de France, the British Library, and Stanford University.
Of the crawled URLs, 4,741 (92%) point to YouTube channels or individual videos, many of which were not properly configured for capture. With Archive-It, videos hosted by YouTube are best crawled one-by-one, and cannot be viewed in the archived page. Documentation on how to access the archived video is not easily discoverable, though Archive-It is currently testing implementation of a linked video page for viewing.
Of the 437 remaining sites, 133 (30%) were individual or organizational Facebook pages, 71 (16%) were readily identified as blog posts, 49 (11%) linked to Twitter feeds, and 35 (8%) linked to not-for-profit sites. The remainder is a mix of news organization pages and portals, memory sites (http://www.iamjan25.com/, http://1000memories.com/egypt/), video collections, discussion boards, and official sites of journalists and public figures.
As with other collections, many of the crawls suffered capture problems, particularly video links and Java-based functionality. News links may capture top-level aggregated pages, but links to detailed articles are often not available.
While the collection description includes the topic of “Government,” notably absent from the collection is the presence of government pages or sites that represent the perspective of official regimes. With the exception of the page for the Government of Southern Sudan (http://www.goss.org/), no other government sites were included. A small amount of overlap occurs between sites selected for this collection and for the “2011 Egyptian Revolution” collection (above). This occurs most frequently with news organization sites, though the period and frequency of crawls in this collection appears much more extensive than the American University in Cairo collection.
From the research perspective, given the numerous limitations described above, it is difficult to draw firm conclusions on the utility of the archives for historical research. Without these efforts, many of the sites would not be available today in any form (much less the reduced form in which they currently appear). The recommendations below are intended to guide library partners in future web-archiving efforts.
Curatorial Challenges: The selection of sites and determination of crawl frequency remains one of the more time-consuming and challenging aspects of web archiving. The number of sites selected that were essentially “uncrawlable” is quite substantial in the case studies above. Seed URLs that point to too wide a capture scope (the entire site of the Bibliotheca Alexandrina appears to have been crawled in its entirety multiple times for the Egyptian Revolution archive) or too narrow (a single page of a multiple page PDF document) create inconsistencies in the collection focus. Selectors must be further educated in appropriate URL identification and have a priori knowledge of what types of sites cannot be adequately captured.
Providing contextualized access to the archived collections offers another curatorial challenge. For the three case studies above, there did not appear to be any guides that indicate the scope of the collection, content selection criteria, or links to particularly notable sites or pages. While Archive-It has improved its search functionality (including the ability to search in vernacular scripts) and provides advanced search capabilities, navigation and use of collections in a cross-temporal archive remains difficult.
Administrative Challenges: Checking the quality of capture and appropriateness of the frequency is another consistent challenge in web-archive collections. Certain sites that had not changed in months (or years) were still included in numerous crawl schedules. While Archive-It recommends reviewing crawl reports and browsing archived documents, such quality review appears to be undertaken with varying vigor by particular partners.
Technical Challenges: Many of the technical challenges in harvesting are well-known in the web-archiving community. Archive-It provides documentation and tips for its partners and Help Wiki, and is diligently pursuing technical solutions in response to developments in technology or changes in access to categories of sites.
Perhaps the biggest challenge of web archiving is not only to capture the content of a site but the full experience of the web at a given point of time. How can libraries and archives reconstruct the experience so that future researchers of the 2011 Egyptian Revolution can study what was happening on the web—the entire web—during the period leading up to and following the events? The Internet Archive is exploring this scenario through the beta “replay version” of the Wayback Machine.8
Libraries need to explore further how researchers, policy makers, and the commercial sector are using web archives, and what types of resources they require. A 2011 report from the Oxford Internet Institute entitled “Web Archives: The Future(s)” suggests that existing web-archive efforts have not seen wide takeup by the research community. Rather, researchers increasingly view the live web as the archive, with data loss “outweighed for the most part by the otherwise huge volume of data that remains on the web at any given time.”9
Understanding the user needs can inform libraries efforts in archiving content and the layering of tools on top of archives. The Oxford report recommends that the “web archiving community needs to connect the resources they are building with the cutting edge tools being developed by computer scientists, researchers, independent developers, and hackers to study the live web.” Social Network Analysis, data visualization, APIs, and linked data are all means of exploring the live web that, if applied to archived web data, can assist researchers’ quest for understanding and interpreting our own imperfect history.
- http://www.webneveshteha.com, accessed September 20, 2011. Archive-It page at http://wayback.archive-it.org/1749/*/http://webneveshteha.com/
- Archive-It page at http://wayback.archive-it.org/1035/*/http://i.hoder.com/
- http://www.feuer17.blogspot.com/, accessed September 20, 2011.
- http://rooznegaar.blogfa.com, no longer accessible. Press at the time pointed to an Internet Archive capture of this site, but this collection is no longer viewable via the Wayback Machine (the archived page now reads “Page cannot be crawled or displayed due to robots.txt”).
- http://cyber.law.harvard.edu/publications/2008/Mapping_Irans_Online_Public, accessed September 20, 2011.
- See, for instance, http://en.wikipedia.org/w/index.php?title=Mohamed_ElBaradei&action=history versus the Archive-It crawl at: http://wayback.archive-it.org/2358/*/http://en.wikipedia.org/wiki/Mohamed_ElBaradei, accessed September 20, 2011.
- The beta version of the Wayback Machine features a calendar to browse a site on a particular date, from which you may be able to visit linked sites crawled on or about the same time period. See http://www.archive.org/web/web.php, accessed September 20, 2011.
- Meyer, Eric T., et al., “Web Archives: The Future(s),” http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1830025, accessed September 20,