The previous post on the Computational Archival Science workshop at IEEE Big Data 2016 focused on the papers that discussed appraisal and arrangement & description. Here I’ll discuss three papers, including our own, that focus on the researchers who use the archives.
“Mining and Analysing One Billion Requests to Linguistic Services” by Büchler et al. analyzes log data from the Leipzig Linguistic Service (LLS), a service that provides API access to Natural Language Processing and the Wortschatz database. In total, the log data show 965 million requests made to the service between September 2006 and July 2014. The corpora of texts included are crawled from the web, as described: “Through a self-developed distributed crawling environment, such as Heritrix, these processes are continuously revised and perform downloads of complete top-level domains as well as daily news in more than 80 languages via RSS feeds.” The authors identify which services were used most, and in which combinations or chains, in which languages (the collection includes significantly more material in English and German). Analysis also revealed the limitations of the text corpora and issues with text encoding and interoperability of services in different programming languages.
“Computational Provenance in DataONE: Implications for Cultural Heritage Institutions” by Robert Sandusky describes how the PROV data model was extended and implemented as ProvONE in the federation of scientific data repositories, the Data Observation Network for Earth (DataONE). The paper argues that data provenance is a key factor in tracking data’s ‘lineage’ or ‘pedigree’, and research data repository services must address this in order to support trust of data and linking evidence. The paper describes the development of the ProvONE conceptual model by the DataONE Cyberinfrastructure Working Group and implementation for provenance metadata in scientific workflows and tools like R and MATLAB. Future work will involve interviews with practicing librarians and archivists to see how data provenance might be applied in cultural heritage domains.
Our paper “Understanding Computational Web Archives Research Using Research Objects” similarly takes a concept from computational scientific research and applies it to social science and humanities work with web archives. The Research Object (RO) framework by Bechhofer et al. was developed to aggregate and link the resources used in scientific work in order to support reproducibility of experiments and reuse of component parts like code or data. Their work also focuses on semantic annotations and linked data that allow for services supporting discovery and automated workflows. In contrast, humanities work doesn’t normally aim to achieve reproducibility – however, we argue that since research with large scale data from web archives requires computational methods, the Research Object concepts can be used to consider certain aspects of the process. Specifically, we take the RO framework as a starting point to consider how web archives research methods can be documented as more systematic practices and how these concepts can serve as a common vocabulary for discussions of trust in the findings. We believe this will help advance the field and make it easier for new researchers to start working with web archives, as well as reuse data, tools, or analytical techniques.
We use the Research Object framework to analyze three cases of research with web archives, all completed by our co-author Ian Milligan. For example, one case explores the study “An Open-Source Strategy for Documenting Events” by Ian and Nick Ruest, a collaboration between a librarian and a historian. Their project was scoped around the Canadian federal election in October 2015, which elected the 42nd Canadian Parliament, and they collected and analyzed Twitter data using the popular hashtag #elxn42.
The diagram above shows the different key factors of organizational context, as well as research questions, decisions made in designing the study, and how results were disseminated in publications and presentations. The diagram above outlines and connects the steps of the analysis process, including inputs, specific methods and workflows, and results. In this case, these steps include collection of data from Twitter by the researchers themselves and by partners at Library and Archives Canada, as well as the steps to combine, prepare, and clean data. A few different types of analysis were described in their project, and we show one in detail: the text analysis process for creating word clouds for #elxn42 tweets, separated by most frequent words each day.
We hope to continue this work for case studies with other researchers using web archives, and see if this framework can be used to inform the design of infrastructures that support web archives research. We also want to look more at the data sources, or what’s happening on the left side of the workflow diagram, and ask: how did the input data come to be?
It is interesting that questions raised in our work are similar to those articulated in the ProvONE discussion, which describes what questions research might have about a dataset (Sandusky, p. 3267):
• Who collected this data?
• When was the data collected?
• Where was the data collected?
• How was the data collected?
• At what point in the research lifecycle was the instance of the data presented in the paper instantiated?
• What procedures were used to clean, normalize, or reduce the data?
• Has this dataset been altered since it was deposited, or since the publication of the paper? If so, by whom, when, and why?
• Are there other data to which this data is related? Is the data reported in this paper a subset of a larger dataset?
Their focus is on aspects of a dataset’s evolution – which steps of processing has the data undergone, and have other related versions been published or used in this or other research. In web archives work, this processing can be done by different people in different roles – not only researchers but librarians, archivists, programmers – and it’s useful to recognize that people coming from different backgrounds or disciplines might have different views, perspectives, and relationships with data. The impact of these different disciplinary perspectives, and the possibilities for education programs, will be discussed more in the next and final post on the workshop.