Prof. Cecilia Aragon: “The Hearts and Minds of Data Science”

Please join us at 4pm on September 21, 2017 for a DCI Lecture in BL 728  (Bissell building, 7th floor) by Prof. Cecilia Aragon from the University of Washington.

Abstract:

Extraordinary advances in our ability to acquire and generate data are transforming the fundamental nature of discovery across domains. Much of the research in the field of data science has focused on automated methods of analyzing data such as machine learning and new database techniques. However, the human aspects of data science, including how to maximize scientific creativity and human insight, how to address ethical concerns, and the consideration of societal impacts, are vital to the future of data science. Human-centered data science is a necessary part of the success of 21st century discovery. I will discuss promising research in this area, describe ongoing initiatives at the UW eScience Institute, and speculate upon future directions for data science.

Bio:
Cecilia Aragon is a Professor in the Department of Human Centered Design & Engineering, Senior Data Science Fellow at the eScience Institute, and Director of the Human Centered Data Science Lab at the University of Washington in Seattle, US. She earned her Ph.D. in computer science from UC Berkeley in 2004. Her research focuses on human-centered data science, an emerging field at the intersection of computer-supported cooperative work (CSCW) and the statistical and computational techniques of data science. She has published over 200 papers in the areas of HCI, CSCW, data science, visual analytics, machine learning, and astrophysics. In 2008, she received the Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the US government on outstanding scientists in the early stages of their careers.

 

Lab website: https://depts.washington.edu/hdsl/

Faculty website: http://faculty.washington.edu/aragon

 

screen-shot-2015-03-03-at-4-46-11-pm

Web Archive Analytics Workshop: Archiving and Accessing Ten Years of Political Websites

In association with the DCI lecture on October 29, Prof. Ian Milligan is offering a 2-hour hands-on workshop on web archive visualization on October 30.

 

This workshop uses the Canadian Political Parties and Political Interest Group collection to trace the web archiving workflow from collection development to analytics. Beginning with an introduction from Nicholas Worby, Government Information & Statistics Librarian at the University of Toronto’s Robarts Library, on the Archive-It dashboard and collections development process, attendees will learn about how web archiving happens from the perspective of a librarian. With Ian Milligan, a professor of digital history from the University of Waterloo, we then move into the process of accessing, downloading, and interpreting web archival data, from the UK Web Archive’s Shine portal (allowing faceted, n-gram style searches) to the warcbase platform for text and link analysis.

All software used will be open source, and will include warcbase, Shine, and Gephi.

cpppig-visualization

Time and place: Friday, October 30, 10am-noon at the Semaphore Demo Room in Robarts Library, University of Toronto (room 1150).

Participation is open and free, but you need to register by emailing christoph.becker@utoronto.ca!

How to find it:

map_kmdi_semaphore

 

 

 

cpppig-visualization

Ian Milligan: The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015

The first DCI lecture in Fall 2015: Prof. Ian Milligan from Waterloo.

Abstract:

The sheer amount of social, cultural, and political information that is generated and, crucially, preserved every day presents new exciting opportunities to historians. A large amount of this information is being contained within web archives, which contain billions of web pages. Scholars broaching topics dating back to the mid-1990s will find their projects enhanced by web data – military historians can use forum posts by soldiers, social historians can track aspects of everyday life through blogs and comments, political historians can study changing sentiment, tropes, and link structures, and economic historians can explore the rise and fall of businesses webpages. Yet this tremendous opportunity is mitigated to some degree by the sheer challenge of dealing with all that data: we have more information than ever before, but the scale is overwhelming.

We have several common tensions, however, beyond basic ones of having enough storage and computational power to deal with all of this information. I will focus on two. The first is that while historians largely want to work with content, technological limitations push us towards rich metadata. The second is that without basic understanding of the conceptual structure of the web archive, from crawl structure to the biases, we can generate wildly misleading results – a problem for historians with most digitized sources.

In this talk, I explore these tensions as they have played out over three case studies that I have studied: the Internet Archive’s March-December 2011 Wide Web Scrape (WARC files), the 2009 GeoCities end-of-life torrent (a wget-compiled collection of mirrored websites), and the 2005-Present Archive-It collections of Canadian political parties, unions, and organizations (WAT files, which contain derivative data). For each archive, I briefly discuss the usage, technical, and ethical challenges that such collections present for historians: problems of too much data, processing time, and the difficulties in applying cutting-edge natural language processing.

Milligan - Picture

Biography:

Ian Milligan is an assistant professor of digital and Canadian history at the University of Waterloo. There, he is principal investigator of the web archives for historical research group (https://uwaterloo.ca/web-archive-group/), which is supported by an Ontario Early Researcher Award and SSHRC. Milligan serves as a co-editor of the Programming Historian (programminghistorian.org). He has published several articles looking at the impact of born-digital sources on historians and has a forthcoming co-authored book, Exploring Big Historical Data: The Historians’ Macroscope on digital methods with Imperial College Press. His first book, Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada, appeared in 2014.

The lecture takes place at 16:00-17:30 on Thursday, 29th of October 2015, in Room 728 (7th floor) at the iSchool, Bissell Building, 140 St. George Street.

 

Prof. Milligan is also conducting a 2-hour hands-on Workshop on web archive visualization on October 30!

banner3c3

Stephen Abrams: Curation Semiotics – Foundational Theory and Practice

Stephen Abrams speaks in the DCI lecture series in March 2015.

The lecture takes place at 16:00-17:30 on Thursday, March 19, in room 728 (7th floor) at the iSchool, Bissell Building, 140 St. George Street.

NOTE: We will broadcast the event on youtube: See the corresponding event page !

Curation Semiotics: Foundational Theory and Practice

Digital curation is a complex of actors, policies, practices, and technologies that enables meaningful consumer engagement with content of interest across space and time.  The UC Curation Center (UC3) at the California Digital Library (CDL) supports a growing roster of innovative curation services for use by scholars across the 10 campus University of California system.  However, recent initiatives in the area of research data curation have led to a significant change in UC3’s target audience.  While UC3 continues to support its traditional campus stakeholders – librarians, archivists, and curators – it is now also engaging directly with faculty, researchers, and students.

In response, UC3 has embarked on a comprehensive review of its systems and services to ensure that it is meeting its goals most effectively.  In doing so, however, a number of seemingly simple, yet deceptively difficult to answer questions cropped up almost immediately.  What constitutes the full spectrum of scholarly activities for which curation support may be usefully offered?  What does “preservation” mean for the new genre of research objects (or indeed, for “traditional” content)?  While curation practitioners can draw upon a number of useful frameworks for specific areas of concern, for example, the Open Archival Information System (OAIS), Trusted Repositories Audit and Certification (TRAC), Preservation Metadata Implementation Strategies (PREMIS), etc., it is not clear how, or indeed whether, their underlying conceptual models cohere into a comprehensive and unified view of the curation domain.  For example, many of the concepts at the heart of these standards, perhaps most problematically, “digital object”, remain woefully overloaded and under-formalized.

UC3 has developed a new model of the curation domain to provide a comprehensive, self-consistent conceptual foundation for the planning and evaluation of its activities (https://wiki.ucop.edu/display/Curation/Foundations).  While drawing from many prior digital library efforts, it also incorporates relevant concepts from other disciplines.  Most notably, the model considers digital content in terms of five semiotic dimensions of semantics, syntactics, empirics, pragmatics, and dynamics.  This presentation will examine UC3’s role as a curation services provider within a digital age research university and the use of its domain model in decision-making processes regarding its programmatic mission, services, and initiatives.

 

Stephen Abrams

Stephen Abrams

Biography

Stephen Abrams is the associate director of the University of California Curation Center (UC3) at the California Digital Library (CDL), with responsibility for strategic planning, innovation, and technical oversight of UC3’s services, systems, and collections, including initiatives for repositories, web archiving, data management planning, and data curation.  He has participated in a leadership, governing, and advisory capacity for many digital library projects and organizations, including DataONE, Federal Agencies Digital Guidelines Initiative, International Internet Preservation Consortium, ISO 19005-1 (PDF/A), Jewish Women’s Archive, JHOVE/JHOVE2, PLANETS, and the Unified Digital Format Registry, and on conference program committees for the iPRES, IS&T Archiving, and Open Repositories conferences.  His most recent work focuses on economic cost modeling for long-term sustainability of digital library services and curation domain modeling.  Prior to joining the CDL in 2008, Mr. Abrams was the digital library program manager at the Harvard University Library.  He holds a BA in Mathematics from Boston University and an ALM in the History of Art and Architecture from Harvard University.

Lecture-Hedstrom

Margaret Hedstrom: CyberInfrastructure for Digital Curation – Some Lessons from SEAD

The DCI lecture on October 30 features Prof. Margaret Hedstrom

Abstract: Countless examples of standards, tools, and shared practices for digital curation exist, but do these puzzle pieces add up to a scalable infrastructure for Big Data?  SEAD (Sustainable Environment: Actionable Data) is building a suite of services for end-to-end capture, sharing, analysis, publishing and preservation of data for researchers in sustainability science.  Margaret Hedstrom, SEAD PI, will discuss SEAD’s efforts to align the needs and interests of diverse scientists with an evolving infrastructure for data preservation and access in the “long tail” of scientific research.

hedstrom_margaretMargaret Hedstrom is a Professor at the School of Information, University of Michigan. Her current research interests include digital preservation strategies, sharing and reuse of scientific data, and the role of archives in shaping collective memory.  She is PI for SEAD (Sustainable Environment: Actionable Data), an $8 million project funded by the US National Science Foundation, that is building cyberinfrastructure and developing new practices for data sharing, preservation, access and reuse. She also heads a NSF-sponsored traineeship (IGERT) at the University of Michigan called “Open Data” in partnership with faculty and doctoral students in bioinformatics, computer science, information science, materials science, and chemical engineering that is investigating tools and policies for data sharing and data management.  She currently chairs a study committee for the National Research Council, National Academy of Science, on Digital Curation Workforce and Education Issues.

The lecture took place on October 30 at 4pm at the Faculty of Information, in Room 728 (7th floor) at the iSchool, Bissell Building, 140 St. George Street.

Impressions

Margaret Hedstrom describes SEAD's focus on the long-tail in scientific research.

Margaret Hedstrom describes SEAD’s focus on the long-tail in scientific research.

20141030hedstrom1