In March of this year, the Center for Research Libraries, working with a number of partner institutions, completed an investigation into the challenges and opportunities relating to the selection, capture, and long-term preservation of Web-based primary resource material. The Political Communications Web Archiving Project (PCWA) was a research and planning initiative funded by a grant from the Andrew W. Mellon Foundation.
The joint planning effort focused on several aspects of Web “harvesting” for archival purposes. One of the chief foci of the investigation was on the necessary curatorial regimes and issues of sustainability for selective (thematic) capture of Web-based communications–in this case, sites produced by or for political parties and organizations, protest and social movements, activists, electoral bodies, or non-governmental organizations. Questions addressed included how best to select and “annotate” the Web communications to be archived, what standards need be required of selectors, what “artifactual” characteristics need be preserved for archived Web content, and how intellectual property restrictions should be addressed. The project also assessed the technical requirements and challenges, informed by the needs and optimal characteristics recommended by the curatorial team.
In examining these issues, the curatorial and technical teams collaborated on a set of investigations utilizing a test bed of data harvested from various regions of the world (Latin America, Southeast Asia, Sub-Saharan Africa, and radical groups in Europe). One particular case study relating to Africa sheds light on producer behaviors (individuals or institutions creating or hosting sites for consumption) and the potential impact on both the use and persistence of such sites for future research.
The April 2003 Nigeria presidential elections were Nigeria’s first civilian-run elections in 20 years. Voting for candidates for the Senate and House of Representatives was held April 12, 2003. A week later, April 19th, voting for the presidential and gubernatorial candidates took place. Incumbent President Olusegun Obasanjo was re-elected with 24.4 million out of 60 million votes. Runner-up General Muhammadu Buhari received 12.1 million votes.
The PCWA project performed a focused Web crawl of 38 sites mounted by Nigerian political parties and candidates surrounding the presidential and gubernatorial election. Karen Fung, Curator of the African Collection at Stanford University, identified the sites for inclusion, which included individual candidate pages, party sites (both for the general party platform and those subsites hosted for specific individuals), and overarching election sites such as the European Union Election Observation Mission to Nigeria. These sites are listed on Stanford’s “Africa South of the Sahara” Web portal.
The sites were crawled and examined intensively for a period of about one month (April/May), with several subsequent crawls made through December 2003. The sites were harvested by the San Francisco-based Internet Archive (IA), using its own proprietary crawler, and by Cornell University using the Mercator crawler. The resultant .arc files (which package up to 100 MB of data into a single file) from IA were organized on a capture date basis, with one day’s worth of all 38 sites bundled together in a single arc. As the list of captured sites grew, more than one 100 MB .arc file was necessary to package one day’s crawl.
The size of the Nigerian election sites were smaller than comparable crawls in other regions. For the 38 sites, the average size of each site was 173 pages (compared to 1,433 pages in Southeast Asian sites). This number is inflated, however, by a few exceptionally large sites, such as the 2,345-page abdullahiadamu.com Web site, and a review shows 25 sites had fewer than 100 connected pages. The average size of sites (excluding abdullahiamdu) was 2,624,807 bytes (2.5 MB).
Interestingly, the vast majority of sites related to the elections were hosted out-of-country. Of them, 21 of the target sites were registered in the United States, five in Canada, five in the U.K.; and one each in Sweden and Albania. The reasons are not immediately apparent, but one may speculate that the relatively low prevalence of computers and the available bandwidth in Nigeria may be factors. This phenomenon also points to the limitations of domain-specific archiving in capturing all of the content relevant to a national domain.
In terms of the frequency of capture, it was discovered that the Nigerian election sites were not actively used by most candidates or parties prior to the elections. A survey of sites after capture demonstrated that of the 16 gubernatorial candidate sites identified, only four showed some content change during the period leading up to the elections. Of the presidential candidates, only five of 12 sites showed some content change, with four of those five supporting one candidate, Buhari. Only one of his sites (buhari2003.org) showed changes with every examination.
These crawls also revealed a high rate of disappearance for the target sites over time. That is, within three months, three of the 38 sites went down or no longer contained election content. Within six months, an additional five sites had disappeared or changed content, equating to a 21.6 percent loss. A comparable survey for Latin American sites showed a loss rate of 16 percent in year one, 32 percent in year two, and 56 percent by year three. On the basis of this analysis, it is clear that on the Political Web persistence of content will be not only rare but difficult to predict.
Additional challenges will arise with the access to and use of the archived content. Aside from the technical complexity of Web archiving, the ability to accurately index and present content will require significant manual intervention. A technical assessment of the ability to extract meaningful metadata shed light on additional producer behaviors that challenge efficient archiving and description. Of the 38 sites, only five of the sites had included any descriptive meta tags in the header of the pages (three sites were unavailable at time of study). One of the sites for the National Conscience Party included as its descriptive tag: "after years of misrule from the likes of Abacha, Obasanjo, Babangida, and the total collapse of our economy, the NCP is here to rescue the masses..". Meta tags were also frequently misused or mis-applied, leading to duplicate tags (such as repetitive title tags) on several pages.
While technical solutions are able to extract a good deal of metadata from and about the various sites, this applies more to structural and administrative metadata and less for objective descriptors of the content itself. On the basis of this survey, it is apparent that most of the descriptive metadata will have to be input manually. In cases where site-generated metadata are present, it could be incorporated into the record as an additional, complementary, or supplemental matter.
To accomplish this task, the PCWA project recommends that the prospective archive of such material would need to be built around a consortial model with a mix of distributed and centralized activities. Area specialists or trained assistants at various institutions would supply time and expertise to identify and describe sites, calculate the frequency of capture, and submit to a centralized harvester and repository for inclusion in the Web archive. These activities and relationships are discussed in depth in the PCWA final report, due to be released in the next few weeks.
For more information on the project, please visit the Political Communications Web Archive Web page.