Web Archiving: Past, Present and Future

or, Yet another Web Archiving Bibliography

We’ve started working with Ian Milligan this Fall as part of the Marshall McLuhan Centenary Fellowship in Digital Sustainability, with research exploring the differences between professionally-curated and crowd-sourced web archives collections.

And, as the Internet Archive celebrates 20 years of web archiving this past week (and released some fun and exciting new tools – the beta of Wayback with site search, and the incredible GifCities Geocities Animated GIF Search Engine), it seems like a good opportunity to take a look at the evolution of web archiving initiatives, where we are now, and where we are headed.

Building a community of web archivists

Early web archiving initiatives beginning in the mid-90s included not only the Internet Archive, but also the Australian PANDORA Web Archive, the UK Government Web Archive, and the Library of Congress (see their interactive timeline and map of members here). The formation of the International Internet Preservation Consortium IIPC in 2003 began a series of developments in web archiving for cultural and heritage institutions, including forming a web archiving practice community with meetings like the IIPC General Assembly and Archive-It Partner meetings.

A number of publications have addressed the challenges of ‘how to do web archiving,’ targeted at practitioners and institutions beginning a web archiving program, including:

A summary of this literature is provided in Brenda Reyes Ayala’s Web Archiving Bibliography 2013, sorted in categories: introductions to web archiving, institutional approaches, personal web archiving, challenges, legal issues, practices and standards, digital libraries, quality, and research on web archives.

Connecting web archives researchers

Looking at what’s happened since Reyes Ayala’s bibliography in 2013, ‘Research on Web Archives’ is an area of increased interest in the past few years, especially understanding the kinds of researchers that use web archives. In 2010, the JISC-funded project on “Researcher Engagement with Web Archives” (with Oxford Internet Institute, Virtual Knowledge Studio at Maastricht University), highlighted the gap between the potential community of researchers using web archives, and the actual uses. The final project reports made a number of recommendations for building community, tools and resources, and developing skills, training and integrating practice, and identified challenges and opportunities. This focus on researchers as users of archives is also seen in Peter Stirling, Philippe Chevallier and Gildas Illien’s 2012 article Web Archives for Researchers: Representations, Expectations and Potential Uses, as well as the Big UK Domain Data for Arts and Humanities (BUDDAH) project which began in 2013 and aims to help researchers access and use the UK web domain archive by developing tools and interfaces to support big data analysis of the collection.

Recent conferences and workshops have focused on different aspects of web archives research, and scholars using web archives in their research:

And it’s great to be working with Ian Milligan, who has also participated in many of the above, and plays a central role in other projects connecting researchers: the Archives Unleashed Datathons (March 2016 and June 2016), the Web Archives for Historians site (with Peter Webster), a new Internet Histories journal and forthcoming Handbook of Web History.

Beyond journals and conferences, there’s increasing recognition of the value in individual researchers sharing their experiences less formally. Ian’s blog posts are a great resource that provide descriptions of the iterative processes of developing a study and technical details of the methods used. The BUDDAH case studies (five published in 2015, and another five in 2016) describe the challenges of using web archives data for researchers from various disciplines in the Arts and Humanities and how their initial research questions and approaches are often not aligned with the scale of big data collections. In the book Digital Research Confidential two chapters, by Megan Sapnar Ankerson, and Michelle Shumate & Matthew Weber, describe experiences working with web archives and research methods used, with more detail and candor than is usually found in a methods section.

Where we are going?

Several reports on the state of the art of web archiving have been published in recent years, for example NDSA’s 2013 US Web Archiving Survey, and the National Library of New Zealand’s 2015 report on use of the NZ Web Archive by academics, and others have described possible futures web archiving, such as Oxford Internet Institute’s “Web Archives: The Futures(s)” from 2011 report for IIPC.  

Also, while not always included in the state of the art reports of Web Archiving, Social Media Archiving is increasingly recognized as an important source for research, and comes with its own challenges compared to capturing web data with crawlers. The NCSU Social Media Archives Toolkit (2015) includes an environmental scan of current institutional initiatives and tools and services available for social media archiving. Sarah Day Thomson’s DPC Technology Watch Reports on Preserving Social Media (2016) also summarizes strategies and challenges, as well as the tools available and case studies.

There’s also collaborative Social Media Archiving projects and communities forming, like Documenting the Now which is developing a tool for social media collection and engaging with activist communities that use social media and considering ethical implications of their documentation. Another tool is the open source Social Feed Manager developed by George Washington University for researchers and archivists that facilitates harvesting from Twitter, Flickr, Tumblr, and Weibo. As well, recent discussions from social science researchers have called for drawn attention to challenges of social media data sharing, archiving and preservation: Bruns & Weller’s “Twitter as a First Draft of the Present and the Challenges of Preserving It for the Future” (2016), and Weller & Kinder-Kurlanda’s “Uncovering the Challenges in Collection, Sharing and Documentation: The Hidden Data of Social Media Research?” (2015) and”A Manifesto for Data Sharing in Social Media Research” (2016).

A nice summary of the current state of the art is provided in the recent Harvard Library “Web Archiving Environmental Scan” released in March 2016. The report has been summarized with listing 22 Opportunities to Address Common Challenges, and two opportunities are especially relevant to our work: the call for increased transparency and documenting curatorial decisions, and establishing a standard for describing curatorial decisions so that there is “consistent (and machine-actionable) information for researchers”.

HarvardLibrary-ToolsLifecycleMatrix

Harvard Library’s Web Archiving Environmental Scan – Appendix C: Tools Lifecycle Matrix

The Harvard Library report also includes a “Tools Lifecycle Matrix” which summarizes the tools available for the different aspects of the archiving, processing and analysis lifecycle (living document here: http://bit.ly/1Zok3WB). I’ve found this useful to keep track of the wide range of tools available and hopefully this matrix can continue to be expanded and updated. 

Looking ahead from 2016, I think we need to find ways to reconcile the challenges of developing web archives with the inherent interdisciplinarity of web archives work. Web archivists need to understand the practices of researchers using web archives, the communities they come from, and how this impacts the kinds of evidence and documentation that are needed for different forms of scholarship. Also recognizing that ‘web archivists’ are not a group with a single background, we need a better understanding the different perspectives of researchers, curators, archivists, librarians, systems designers and developers have on information, and how they shape the web objects in the archive. I’m also keeping an eye on Jessica Ogden’s blog and the trajectory of her dissertation research studying web archival practice and production of knowledge of the web. And I hope to address some of these questions in my own PhD work.

Finally, I’ve covered only a few of the developments in web archiving in the short summary here – if there’s something I’ve missed that’s been useful in your work with web archives please let me know!