Assessing Migration Risk for Scientific Data Formats
AbstractThe majority of information about science, culture, society, economy and the environment is born digital, yet the underlying technology is subject to rapid obsolescence. One solution to this obsolescence, format migration, is widely practiced and supported by many software packages, yet migration has well known risks. For example, newer formats – even where similar in function – do not generally support all of the features of their predecessors, and, where similar features exist, there may be significant differences of interpretation.
There appears to be a conflict between the wide use of migration and its known risks. In this paper we explore a simple hypothesis – that, where migration paths exist, the majority of data files can be safely migrated leaving only a few that must be handled more carefully – in the context of several scientific data formats that are or were widely used. Our approach is to gather information about potential migration mismatches and, using custom tools, evaluate a large collection of data files for the incidence of these risks. Our results support our initial hypothesis, though with some caveats. Further, we found that writing a tool to identify “risky” format features is considerably easier than writing a migration tool.
Copyright for papers and articles published in this journal is retained by the authors, with first publication rights granted to the University of Edinburgh. It is a condition of publication that authors license their paper or article under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence.