Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse

Lan Li; Nikolaus Parulian; Bertram Ludäscher

doi:10.2218/ijdc.v16i1.771

Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse

Authors

Lan Li University of Illinois, Urbana-Champaign https://orcid.org/0000-0003-4499-4126
Nikolaus Parulian University of Illinois, Urbana-Champaign http://orcid.org/0000-0002-6971-0882
Bertram Ludäscher University of Illinois, Urbana-Champaign https://orcid.org/0000-0001-9140-936X

DOI:

https://doi.org/10.2218/ijdc.v16i1.771

Abstract

Before data from multiple sources can be analyzed, data cleaning workflows (“recipes”) usually need to be employed to improve data quality. We identify a number of technical problems that make application of FAIR principles to data cleaning recipes challenging. We then demonstrate how transparency and reusability of recipes can be improved by analyzing dataflow dependencies within recipes. In particular column-level dependencies can be used to automatically detect independent subworkflows, which then can be reused individually as data cleaning modules. We have prototypically implemented this approach as part of an ongoing project to develop open-source companion tools for OpenRefine.

Keywords: Data Cleaning, Provenance, Workflow Analysis

Downloads

Download data is not yet available.

Downloads

Published

2022-04-18

Issue

Vol. 16 No. 1 (2021)

Section

Conference Papers

License

Copyright for papers and articles published in this journal is retained by the authors, with first publication rights granted to the University of Edinburgh. It is a condition of publication that authors license their paper or article under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence.

Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse

Authors

DOI:

Abstract

Downloads

Downloads

Published

Issue

Section

License

Latest publications

Information