Bridging the Gap Between Process and Procedural Provenance for Statistical Data
DOI:
https://doi.org/10.2218/ijdc.v19i1.947Abstract
We show how two models of provenance can work together to answer basic questions about data provenance, such as “What computed variables were affected by values of variable X?”. Questions like this are central for understanding how data is managed and modified. W3C PROV is a widely used standard for describing the people, activities, and sources that create things like documents, a work of arts, and data sets. PROV associates processes with inputs and outputs, but it does not have a way to describe how data are changed within a process. PROV has no language for program components, like mathematical expressions or joining data tables. Structured Data Transformation Language (SDTL) was designed to provide machine-actionable representations of data transformation commands in statistical analysis software. SDTL describes the inner workings of programs that are black boxes in PROV. However, SDTL is detailed and verbose, and simple queries can be very complicated in SDTL. Structured Data Transformation History (SDTH) bridges the gap between PROV and SDTL. SDTH extends the PROV data model to answer questions about data preparation and management operations not available in PROV.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Timothy McPhillips, Jack Gager, Thomas Thelen, George Charles Alter, Jeremy Iverson, Bertram Ludäscher, Dan Smith

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright for papers and articles published in this journal is retained by the authors, with first publication rights granted to the University of Edinburgh. It is a condition of publication that authors license their paper or article under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence.
Funding data
-
National Science Foundation
Grant numbers ACI-1640575;OAC 1541450