Capturing Data Provenance from Statistical Software

George Charles Alter; Jack Gager; Pascal Heus; Carson Hunter; Sanda Ionescu; Jeremy Iverson; H.V. Jagadish; Jared Lyle; Alexander Mueller; Sigve Nordgaard; Ornulf Risnes; Dan Smith; Jie Song

doi:10.2218/ijdc.v16i1.763

Authors

George Charles Alter University of Michigan https://orcid.org/0000-0003-3823-4972
Jack Gager Metadata Technology North America Inc.
Pascal Heus Metadata Technology North America Inc.
Carson Hunter Metadata Technology North America Inc.
Sanda Ionescu University of Michigan
Jeremy Iverson Colectica http://orcid.org/0000-0003-3002-9245
H.V. Jagadish University of Michigan
Jared Lyle University of Michigan http://orcid.org/0000-0001-8623-7612
Alexander Mueller University of Michigan
Sigve Nordgaard Norwegian Centre for Research Data
Ornulf Risnes Norwegian Centre for Research Data
Dan Smith Colectica http://orcid.org/0000-0001-7492-0246
Jie Song University of Michigan

DOI:

https://doi.org/10.2218/ijdc.v16i1.763

Abstract

We have created tools that automate one of the most burdensome aspects of documenting the provenance of research data: describing data transformations performed by statistical software. Researchers in many fields use statistical software (SPSS, Stata, SAS, R, Python) for data transformation and data management as well as analysis. The C²Metadata ("Continuous Capture of Metadata for Statistical Data") Project creates a metadata workflow paralleling the data management process by deriving provenance information from scripts used to manage and transform data. C²Metadata differs from most previous data provenance initiatives by documenting transformations at the variable level rather than describing a sequence of opaque programs. Command scripts for statistical software are translated into an independent Structured Data Transformation Language (SDTL), which serves as an intermediate language for describing data transformations. SDTL can be used to add variable-level provenance to data catalogues and codebooks and to create "variable lineages" for auditing software operations. Better data documentation makes research more transparent and expands the discovery and re-use of research data.

Downloads

Download data is not yet available.

Capturing Data Provenance from Statistical Software

Authors

DOI:

Abstract

Downloads

Downloads

Published

Issue

Section

License

Funding data

Latest publications

Information