Christoph and I had the opportunity to attend the Computational Archival Science workshop at IEEE Big Data 2016 in December and present some of our work with Ian on applying Research Object frameworks to Web Archives Research. I’ll be writing a few posts here to give an overview of the presentations and discussion at the workshop.
The big question that arose is: what is Computational Archival Science? ‘Archival science’ can be understood as the study of archival practice, and the development of systematic approaches or theory to inform archival practice (and it should be noted that some prefer the phrase ‘archival theory’ to ‘archival science’). One interpretation of computational archival science then is studying, developing, or applying computational methods and techniques in archival practice – which is crucial when managing a large scale of digital records. In his overview of the archival profession in the past century, Terry Cook describes his focus on “the twin pillars of the archival profession, appraisal and arrangement/description, as these have been affected by changes in cultures, media and technology,” (“What is Past is Prologue”, 1997), and I think computational approaches to appraisal and arrangement & description (understood expansively) are themes that came up in much of the work presented at the workshop.
Computational Approaches to Appraisal
Two papers addressed aspects of appraisal decisions and the processes of ingesting material into archives. In “Appraising Digital Archives with Archivematica” Shallcross presented work with the Bentley Historical Library to develop new functionality and workflows as part of the ArchivesSpace-Archivematica-DSpace Integration Project. This paper presented the work on developing the Appraisal and Arrangement tab in the Archivematica dashboard to enable archivists as “users to explore and characterize content, identify sensitive data, and preview files to understand the information therein” by viewing directory structures, characterizing file formats, previewing content and tagging all within the system. Taking a different approach, Ruizhu Huang presented on behalf of Xu et al. with “Content-based Comparison for Collections Identification” which focused on comparing scientific research data sets from genomics. The identity of these complex datasets may not be clear as they evolve and change over time and a computational framework is proposed to compare datasets across repositories for similar or different content and metadata.
A common theme for appraisal at scale is then addressing the challenge of first knowing what you have – this came up in the presentation by Huang where it was noted that ‘metadata are not enough’ to determine unique identity of datasets. It seems that developing methods to address questions of identity and characterization are the first step then towards assessing and determining value of archival materials.
Computational Approaches to Arrangement and Description
Three presentations centered on computational approaches to access, and the challenges of arrangement and description at scale.
Hengchen et al.’s “Exploring archives with probabilistic models” identified an issue with lack of metadata ‘available on a file and document level’ and presented an approach using topic modelling. In many ways, topic modelling is algorithmically arranging or describing texts, finding connections amongst and between them – but while this is a start to providing access or entrée to a collection, it is only a first or early step in the analysis processes of the archivist (the process of which is discussed, for example, by Jennifer Meehan in “Making the Leap from the Parts to the Whole”, 2009). Their project adds another layer of complexity since they deal with translations of the same documents and apply topic modelling across languages.
Along similar lines, “Opening Up Dark Digital Archives Through The Use of Analytics to Identify Sensitive Content” by Baron & Borden suggests analytical approaches for ‘triaging’ government electronic records to redact personally identifiable information (PII). They note that some of this sensitive information may be findable by pattern matching (like SSNs) but other information is contextual – both in text and other forms like images. They call for a research agenda to explore and develop these analytics to isolate sensitive information, beginning with test sets of records like presidential emails that are already in the archives. Otherwise they warn that the wealth of government records will remain dark, and inaccessible to the public.
Sonia Ranade presented “Traces Through Time: A Probabilistic Approach to Connected Archival Data” on the Traces Through Time project from the UK National Archives, also drawing attention to how traditional approaches to description are unsustainable for the scale of digital and digitized records. This project instead explores a ‘probabilistic approach’: “The project’s technical approach was one of identifying links through ‘fuzzy’ comparison of attribute values; evaluating confidence based on statistical techniques and supporting data; and making these links available to researchers.” Since OCR techniques or illegible handwriting can lead to poor data quality, the fields in the database (like name, age, date) are then considered as probabilities, not absolute values. Connections within the collection are then sought out to determine if, for example, different names are in fact referring to the same individual. I found this approach very intriguing, it reminds me of other projects that aim to use computational approaches – which inherently seem to rely upon unambiguous digital values, data or categories – in ways that challenge the limits of formal knowledge (e.g. Johanna Drucker and Jerome McGann’s program of Speculative Computing). The way the project embraces elements of uncertainty also seems to operationalize some of the postmodern ideas about knowledge, i.e. connects to Heather MacNeil’s account of probability from “Trusting Records in a Postmodern World”, 2001.
These projects all share an impulse of seeking alternatives to old methods of facilitating access, like the Finding Aid. They also point to a theme of recognizing archives as data, as Ranade notes:
“At The National Archives we are witnessing a transformation in the nature of our collections: from the archive as static boxes of documents, to the archive as fluid, conceptually interconnected data”.
This shift towards data was the topic of a symposium recently hosted by the Library of Congress, “Collections as Data“. But what does viewing documents and records as data actually entail? These projects are first steps to envisioning new modes of data-centric discovery and access services, which were themes further explored in three papers (including ours) on users and user perspectives on archives, to be discussed in the next post.