Harmonizing the Metadata Among Diverse Climate Change Datasets
One of the critical problems in the curation of research data is the harmonization of its internal metadata schemata. The value of harmonizing such data is well illustrated by the Berkeley Earth project, which successfully integrated into one metadata schema the raw climate datasets from a wide variety geographical sources and time periods (250 years). Doing this enabled climate scientists to calculate a more accurate estimate of the recent changes in Earth’s average land surface temperatures and to ascertain the extent to which climate change is anthropogenic.
This paper surveys some of the approaches that have been taken to the integration of data schemata in general and examines some of the specific metadata features of the source surface temperature datasets that were harmonized by Berkeley Earth. The conclusion drawn from this analysis is that the original source data and the Berkeley Earth common format provides a promising training set on which to apply machine learning methods for replicating the human data integration process. This paper describes research in progress on a domain-independent approach to the metadata harmonization problem that could be applied to other fields of study and be incorporated into a data portal to enhance the discoverability and reuse of data from a broad range of data sources.
Copyright for papers and articles published in this journal is retained by the authors, with first publication rights granted to the University of Edinburgh. It is a condition of publication that authors license their paper or article under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence.