Bridging the Gap Between Process and Procedural Provenance for Statistical Data

Authors

  • Timothy McPhillips University of Illinois at Urbana-Champaign
  • Jack Gager Metadata Technology North America
  • Thomas Thelen University of California at Santa Barbara
  • George Charles Alter University of Michigan https://orcid.org/0000-0003-3823-4972
  • Jeremy Iverson Colectica
  • Bertram Ludäscher University of Illinois at Urbana-Champaign
  • Dan Smith Colectica https://orcid.org/0000-0001-7492-0246

DOI:

https://doi.org/10.2218/ijdc.v19i1.947

Abstract

We show how two models of provenance can work together to answer basic questions about data provenance, such as “What computed variables were affected by values of variable X?”. Questions like this are central for understanding how data is managed and modified. W3C PROV is a widely used standard for describing the people, activities, and sources that create things like documents, a work of arts, and data sets. PROV associates processes with inputs and outputs, but it does not have a way to describe how data are changed within a process. PROV has no language for program components, like mathematical expressions or joining data tables.  Structured Data Transformation Language (SDTL) was designed to provide machine-actionable representations of data transformation commands in statistical analysis software. SDTL describes the inner workings of programs that are black boxes in PROV. However, SDTL is detailed and verbose, and simple queries can be very complicated in SDTL. Structured Data Transformation History (SDTH) bridges the gap between PROV and SDTL. SDTH extends the PROV data model to answer questions about data preparation and management operations not available in PROV.

Downloads

Download data is not yet available.

Author Biographies

  • Timothy McPhillips, University of Illinois at Urbana-Champaign

    Senior Research Scientist, School of Information Sciences, University of Illinois at Urbana-Champaign; Affiliate, National Center for Supercomputing Applications (NCSA)

  • Jack Gager, Metadata Technology North America

    President, Metadata Technology North America (MTNA)

  • Thomas Thelen, University of California at Santa Barbara

    Software engineer with interests in mathematics and graph technologies

  • George Charles Alter, University of Michigan

    Research Professor Emeritus, Institute for Social Research, University of Michigan

  • Jeremy Iverson, Colectica

    Co-founder and Partner, Colectica

  • Bertram Ludäscher, University of Illinois at Urbana-Champaign

    Professor and Director, Center for Informatics Research in Science and Scholarship, School of Information Sciences, University of Illinois at Urbana-Champaign

  • Dan Smith, Colectica

    Co-Founder and Partner, Colectica

Downloads

Published

2025-09-08

Issue

Section

Research Papers

Funding data