Bridging the Gap Between Process and Procedural Provenance for Statistical Data

Timothy McPhillips; Jack Gager; Thomas Thelen; George Charles Alter; Jeremy Iverson; Bertram Ludäscher; Dan Smith

doi:10.2218/ijdc.v19i1.947

Authors

Timothy McPhillips University of Illinois at Urbana-Champaign
Jack Gager Metadata Technology North America
Thomas Thelen University of California at Santa Barbara
George Charles Alter University of Michigan https://orcid.org/0000-0003-3823-4972
Jeremy Iverson Colectica
Bertram Ludäscher University of Illinois at Urbana-Champaign
Dan Smith Colectica https://orcid.org/0000-0001-7492-0246

DOI:

https://doi.org/10.2218/ijdc.v19i1.947

Abstract

We show how two models of provenance can work together to answer basic questions about data provenance, such as “What computed variables were affected by values of variable X?”. Questions like this are central for understanding how data is managed and modified. W3C PROV is a widely used standard for describing the people, activities, and sources that create things like documents, a work of arts, and data sets. PROV associates processes with inputs and outputs, but it does not have a way to describe how data are changed within a process. PROV has no language for program components, like mathematical expressions or joining data tables. Structured Data Transformation Language (SDTL) was designed to provide machine-actionable representations of data transformation commands in statistical analysis software. SDTL describes the inner workings of programs that are black boxes in PROV. However, SDTL is detailed and verbose, and simple queries can be very complicated in SDTL. Structured Data Transformation History (SDTH) bridges the gap between PROV and SDTL. SDTH extends the PROV data model to answer questions about data preparation and management operations not available in PROV.

Downloads

Download data is not yet available.

Author Biographies

Timothy McPhillips, University of Illinois at Urbana-Champaign

Senior Research Scientist, School of Information Sciences, University of Illinois at Urbana-Champaign; Affiliate, National Center for Supercomputing Applications (NCSA)
Jack Gager, Metadata Technology North America

President, Metadata Technology North America (MTNA)
Thomas Thelen, University of California at Santa Barbara

Software engineer with interests in mathematics and graph technologies
George Charles Alter, University of Michigan

Research Professor Emeritus, Institute for Social Research, University of Michigan
Jeremy Iverson, Colectica

Co-founder and Partner, Colectica
Bertram Ludäscher, University of Illinois at Urbana-Champaign

Professor and Director, Center for Informatics Research in Science and Scholarship, School of Information Sciences, University of Illinois at Urbana-Champaign
Dan Smith, Colectica

Co-Founder and Partner, Colectica

Bridging the Gap Between Process and Procedural Provenance for Statistical Data

Authors

DOI:

Abstract

Downloads

Author Biographies

Downloads

Published

Issue

Section

License

Funding data

Latest publications

Information