International Journal of Digital Curation http://www.ijdc.net/ <p>The IJDC publishes research papers, general articles and brief reports on digital curation, research data management and related issues. &nbsp;It complements the International Conference on Digital Curation (IDCC) and includes selected proceedings as Conference Papers.</p> University of Edinburgh en-US International Journal of Digital Curation 1746-8256 <p>Copyright for papers and articles published in this journal is retained by the authors, with first publication rights granted to the University of Edinburgh. It is a condition of publication that authors license their paper or article under a <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International (CC BY 4.0)</a> licence.<br><br><a href="http://creativecommons.org/licenses/by/4.0/" rel="license"><img style="border-width: 0;" src="http://i.creativecommons.org/l/by/2.5/scotland/88x31.png" alt="Creative Commons License"></a></p> Data Management Planning for an Eight-Institution, Multi-Year Research Project http://www.ijdc.net/article/view/799 <p><span style="font-weight: 400;">While data management planning for grant applications has become commonplace alongside articles providing guidance for such plans, examples of data plans as they have been created, implemented, and used for specific projects are only beginning to appear in the scholarly record. This article describes data management planning for an eight-institution, multi-year research project. The project leveraged four data management plans (DMP) in total, one for the funding application and one for each of the three distinct project phases. By understanding researcher roles, development and content of each DMP, team internal and external challenges, and the overall benefits of creating and using the plans, these DMPs provide a demonstration of the utility of this project management tool. </span></p> Kristin A. Briney Abigail Goben Kyle M.L. Jones ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-07 2022-09-07 17 1 9 9 10.2218/ijdc.v17i1.799 Reusable, FAIR Humanities Data http://www.ijdc.net/article/view/820 <p class="Abstract">While stakeholders including funding agencies and academic publishers implement more stringent data sharing policies, challenges remain for researchers in the humanities who are increasingly prompted to share their research data.&nbsp;This paper outlines some key challenges of research data sharing in the humanities, and identifies existing work which has been undertaken to explore these challenges. It describes the current landscape regarding publishers’ research data sharing policies, and the impact which strong data policies can have, regardless of discipline.</p> <p class="Abstract">Using Routledge Open Research as a case study, the development of a set of humanities-inclusive Open Data publisher data guidelines is then described. These include practical guidance in relation to data sharing for humanities authors, and a close alignment with the FAIR Data Principles.</p> Rebecca Grant ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-09 2022-09-09 17 1 15 15 10.2218/ijdc.v17i1.820 OpenStack Swift: An Ideal Bit-Level Object Storage System for Digital Preservation http://www.ijdc.net/article/view/782 <p>A bit-level object storage system is a foundational building block of long-term digital preservation (LTDP). To achieve the purposes of LTDP, the system must be able to: preserve the authenticity and integrity of the original digital objects; scale up with dramatically increasing demands for preservation storage; mitigate the impact of hardware obsolescence and software ephemerality; replicate digital objects among distributed data centers at different geographical locations; and to constantly audit and automatically recover from compromised states. A realistic and daunting challenge to satisfy these requirements is not only to overcome technological difficulties but also to maintain economic sustainability by implementing and continuously operating such systems in a cost-effective way. In this paper, we present OpenStack Swift, an open-source, mature and widely accepted cloud platform, as a practical and proven solution with a case study at the University of Alberta Library. We emphasize the implementation, application, cost analysis and maintenance of the system, with the purpose of contributing to the community with an exceedingly robust, highly scalable, self-healing and comparatively cost-effective bit-level object storage system for long-term digital preservation.&nbsp;</p> Guanwen Zhang Kenton Good Weiwei Shi ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-10-07 2022-10-07 17 1 19 19 10.2218/ijdc.v17i1.782 Curating for Accessibility http://www.ijdc.net/article/view/837 <p class="AfterHeading12"><span lang="EN-GB">Accessibility of research data to disabled users has received scant attention in literature and practice. In this paper we briefly survey the current state of accessibility for research data and suggest some first steps that repositories should take to make their holdings more accessible. We then describe in depth how those steps were implemented at the Qualitative Data Repository (QDR), a domain repository for qualitative social-science data. The paper discusses accessibility testing and improvements on the repository and its underlying software, changes to the curation process to improve accessibility, as well as efforts to retroactively improve the accessibility of existing collections. We conclude by describing key lessons learned during this process as well as next steps.</span></p> Theresa Anderson Randy D. Colón Abigail Goben Sebastian Karcher ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-08-03 2022-08-03 17 1 10 10 10.2218/ijdc.v17i1.837 An Approach for Curating Collections of Historical Documents with the Use of Topic Detection Technologies http://www.ijdc.net/article/view/819 <p class="Abstract" style="margin: 0in -.05pt 5.0pt 0in;"><span lang="EN-GB">Digital curation of materials available in large online repositories is required to enable the reuse of Cultural Heritage resources in specific activities like education or scientific research. </span><span lang="EN-GB">The digitization of such valuable objects is an important task for making them accessible through digital platforms such as Europeana, therefore ensuring the success of transcription campaigns via the Transcribathon platform is highly important for this goal. </span><span lang="EN-GB">Based on impact assessment results, people are more engaged in the transcription process if the content is more oriented to specific themes, such as First World War. Currently, efforts to group related documents into thematic collections are in general hand-crafted and due to the large ingestion of new material they are difficult to maintain and update. The current solutions based on text retrieval are not able to support the discovery of related content since the existing collections are multi-lingual and contain heterogeneous items like postcards, letters, journals, photographs etc. Technological advances in natural language understanding and in data management have led to the automation of document categorization and via automatic topic detection. To use existing topic detection technologies on Europeana collections there are several challenges to be addressed: (1) ensure representative and qualitative training data, (2) ensure the quality of the learned topics, and (3) efficient and scalable solutions for searching related content based on the automatically detected topics, and for suggesting the most relevant topics on new items. This paper describes in more details each such challenge and the proposed solutions thus offering a novel perspective on how digital curation practices can be enhanced with the help of machine learning technologies.</span></p> Medina Andresel Sergiu Gordea Srdjan Stevanetic Mina Schütz ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-20 2022-09-20 17 1 12 12 10.2218/ijdc.v17i1.819 Synchronic Curation for Assessing Reuse and Integration Fitness of Multiple Data Collections http://www.ijdc.net/article/view/847 <p>Data driven applications often require using data integrated from different, large, and continuously updated collections. Each of these collections may present gaps, overlapping data, have conflicting information, or complement each other. Thus, a curation need is to continuously assess if data from multiple collections are fit for integration and reuse. To assess different large data collections at the same time, we present the Synchronic Curation (SC) framework. SC involves processing steps to map the different collections to a unifying data model that represents research problems in a scientific area. The data model, which includes the collections' provenance and a data dictionary, is implemented in a graph database where collections are continuously ingested and can be queried. SC has a collection analysis and comparison module to track updates, and to identify gaps, changes, and irregularities within and across collections. Assessment results can be accessed interactively through a web-based interactive graph. In this paper we introduce SC as an interdisciplinary enterprise, and illustrate its capabilities through its implementation in ASTRIAGraph, a space sustainability knowledge system.</p> Maria Esteva Weijia Xu Nevan Simone Kartik Nagpal Amit Gupta Moriba Jah ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-10-11 2022-10-11 17 1 11 11 10.2218/ijdc.v17i1.847 Building LABDRIVE, a Petabyte Scale, OAIS/ISO 16363 Conformant, Environmentally Sustainable Archive, Tested by Large Scientific Organisations to Preserve their Raw and Processed Data, Software and Documents http://www.ijdc.net/article/view/841 <p>Vast amounts of scientific, cultural, social, business and government, and other, information is being created every day. There are billions of objects, in a multitude of formats, semantics and associated software. Much, perhaps the majority, of this information is transitory but there is still an immense amount which should be preserved for the medium and long term – perhaps even indefinitely.</p> <p>Preservation requires that the information continues to be usable, not simply to be printed or displayed. Of course, the digital objects (the bits) must be preserved, as must the “metadata” which enables the bits to the understood which includes the software.</p> <p>Before LABDRIVE no system could adequately preserve such information, especially in such gigantic volume and variety.&nbsp;</p> <p>In this paper we describe the development of LABDRIVE and its ability to preserve tens or hundreds of petabytes in a way which is conformant to the OAIS Reference Model and capable of being ISO 16363 certified.</p> David Leslie Giaretta Teo Redondo ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-21 2022-09-21 17 1 15 15 10.2218/ijdc.v17i1.841 From Siloed to Reusable http://www.ijdc.net/article/view/827 <p>In the past twenty-five years, cross-institutional communities have come together in the creation and use of open source software and open data standards to build digital collections (Madden, 2012). These librarians, developers, archivists, artists, and researchers recognize that the custom-built architectures and bespoke data structures of earlier digital collections development are unsustainable. Their collaborations have produced now-standard technologies such as Samvera, Fedora, GeoBlacklight, Islandora 8, as well as RDF, and JSON-LD among other open schemas. A core principle animating these efforts is reusability: data, schemas, and technologies in the open era must be coherent and flexible enough to be reused across multiple digital contexts. The authors of this paper show how reuse guided the migration of the Hopkins Digital Library from an outdated isolated system to a sustainable interconnected environment in GeoBlacklight, Islandora, with metadata based in Linked Open Data. Three areas of reuse focus this paper: the creation of robust interoperable metadata; the expansion of IIIF functionality to integrate the needs of the Hopkins Geoportal’s users; the development of a broadly re/usable data migration module focused on expanding a diverse community of invested users. In focusing on reusability as an organising principle of digital collections development, this case study shows how one digital curation team produced a platform that meets the changing and specific needs of an individual institution, on the one hand, and participated in and furthered the creative coherence of the open communities supporting the team’s work, on the other.</p> Kathryn Gucer Michelle Janowiecki ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-21 2022-09-21 17 1 10 10 10.2218/ijdc.v17i1.827 Fostering the Adoption of DMP in Small Research Projects through a Collaborative Approach http://www.ijdc.net/article/view/849 <p>In order to promote sound management of research data the European Commission, under the Horizon 2020 framework program, is promoting the adoption of a Data Management Plan (DMP) in research projects. Despite the value of a DMP to make data findable, accessible, interoperable and reusable (FAIR) through time, the development and implementation of DMPs is not yet a common practice in health research. Raising the awareness of researchers in small projects to the benefits of early adoption of a DMP is, therefore, a motivator for others to follow suit. In this paper we describe an approach to engage researchers in the writing of a DMP, in an ongoing project, FrailSurvey, in which researchers are collecting data through a mobile application for self-assessment of fragility. The case study is supported by interviews, a metadata creation session, as well as the validation of recommendations by researchers. With the outline of our process we also outline tools and services that supported the development of the DMP in this small project, particularly since there were no institutional services available to researchers</p> André Maciel João Aguiar Castro Cristina Ribeiro Marta Almada Luís Midão ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-07 2022-09-07 17 1 14 14 10.2218/ijdc.v17i1.849 Who Writes Scholarly Code? http://www.ijdc.net/article/view/839 <p>This paper presents original research about the behaviours, histories, demographics, and motivations of scholars who code, specifically how they interact with version control systems locally and on the Web. By understanding patrons through multiple lenses – daily productivity habits, motivations, and scholarly needs – librarians and archivists can tailor services for software management, curation, and long-term reuse, raising the possibility for long-term reproducibility of a multitude of scholarship. </p> Sarah Nguyễn Vicky Rampin ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-11-01 2022-11-01 17 1 18 18 10.2218/ijdc.v17i1.839 Automation is Documentation: Functional Documentation of Human-Machine Interaction for Future Software Reuse http://www.ijdc.net/article/view/836 <p class="Abstract">Preserving software and providing access to obsolete software is necessary and will become even more important for work with any kind of born-digital artifacts. While usability and availability of emulation in digital curation and preservation workflow has improved significantly, productive (re)use of preserved obsolete software is a growing concern, due to a lack of (future) operational knowledge. In this article we describe solutions to automate and document software usage in a way, such that the result is not only instructive but also productive.</p> Jurek Oberhauser Rafael Gieschke Klaus Rechert ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-06 2022-09-06 17 1 11 11 10.2218/ijdc.v17i1.836 DBRepo: a Semantic Digital Repository for Relational Databases http://www.ijdc.net/article/view/825 <p>Data curation is a complex, multi-faceted task. While dedicated data stewards are starting to take care of these activities in close collaboration with researchers for many types of (usually file-based) data in many institutions, this is rarely yet the case for data held in relational databases. Beyond large-scale infrastructures hosting e.g. climate or genome data, researchers usually have to create, build and maintain their database, care about security patches, and feed data into it in order to use it in their research. Data curation, if at all, usually happens after a project is finished, when data may be exported for digital preservation into file repository systems.</p> <p>We present DBRepo, a semantic digital repository for relational databases in a private cloud setting designed to (1) host research data stored in relational databases right from the beginning of a research project, (2) provide separation of concerns, allowing the researchers to focus on the domain aspects of the data and their work while bringing in experts to handle classic data management tasks, (3) improve findability, accessibility and reusability by offering semantic mapping of metadata attributes, and (4) focus on reproducibility in dynamically evolving data by supporting versioning and precise identification/cite-ability for arbitrary subsets of data.<span class="Apple-converted-space">&nbsp;</span></p> Martin Weise Moritz Staudinger Cornelia Michlits Eva Gergely Kirill Stytsenko Raman Ganguly Andreas Rauber ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-07 2022-09-07 17 1 11 11 10.2218/ijdc.v17i1.825 OpenCitations: an Open e-Infrastructure to Foster Maximum Reuse of Citation Data http://www.ijdc.net/article/view/818 <p>OpenCitations is an independent not-for-profit infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. OpenCitations collaborates with projects that are part of the Open Science ecosystem and complies with the UNESCO founding principles of Open Science, the I4OC recommendations, and the FAIR data principles that data should be Findable, Accessible, Interoperable and Reusable. Since its data satisfies all the Reuse guidelines provided by FAIR in terms of richness, provenance, usage licenses and domain-relevant community standards, OpenCitations provides an example of a successful open e-infrastructure in which the reusability of data is integral to its mission.</p> Chiara Di Giambattista Ivan Heibi Silvio Peroni David Shotton ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-08-03 2022-08-03 17 1 5 5 10.2218/ijdc.v17i1.818 On the Reusability of Data Cleaning Workflows http://www.ijdc.net/article/view/828 <p>The goal of data cleaning is to make data fit for purpose, i.e., to improve data quality, through&nbsp;updates and data transformations, such that downstream analyses can be conducted and&nbsp;lead to trustworthy results. A transparent and reusable data cleaning workflow can save time&nbsp;and effort through automation, and make subsequent data cleaning on new data less errorprone.&nbsp;However, reusability of data cleaning workflows has received little to no attention in&nbsp;the research community. We identify some challenges and opportunities for reusing data&nbsp;cleaning workflows. We present a high-level conceptual model to clarify what we mean by&nbsp;reusability and propose ways to improve reusability along different dimensions. We use&nbsp;the opportunity of presenting at IDCC to invite the community to share their uses cases,&nbsp;experiences, and desiderata for the reuse of data cleaning workflows and recipes in order&nbsp;to foster new collaborations and guide future work.</p> Lan Li Bertram Ludäscher ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-27 2022-09-27 17 1 6 6 10.2218/ijdc.v17i1.828 Increasing the Reuse of Data through FAIR-enabling the Certification of Trustworthy Digital Repositories http://www.ijdc.net/article/view/852 <p class="Abstract">The long-term preservation of digital objects, and the means by which they can be reused, are addressed by both the FAIR Data Principles (Findable, Accessible, Interoperable, Reusable) and a number of standards bodies providing Trustworthy Digital Repository (TDR) certification, such as the CoreTrustSeal.&nbsp; Though many of the requirements listed in the <em>Core Trustworthy Data Repositories Requirements 2020–2022 Extended Guidance</em> address the FAIR Data Principles indirectly, there is currently no formal ‘FAIR Certification’ offered by the CoreTrustSeal or other TDR standards bodies. To address this gap the FAIRsFAIR project developed a number of tools and resources that facilitate the assessment of FAIR-enabling practices at the repository level as well as the FAIRness of datasets within them. These include the <em>CoreTrustSeal+FAIRenabling Capability Maturity model</em> (CTS+FAIR CapMat), a FAIR-Enabling<em> Trustworthy Digital Repositories-Capability Maturity Self-Assessment </em>template, and F-UJI , &nbsp;a web-based tool designed to assess the FAIRness of research data objects.&nbsp; The success of such tools and resources ultimately depends upon community uptake. This requires a community-wide commitment to develop best practices to increase the reuse of data and to reach consensus on what these practices are.&nbsp; One possible way of achieving community consensus would be through the creation of a network of FAIR-enabling TDRs, as proposed by FAIRsFAIR.</p> Benjamin Jacob Mathers Hervé L’Hours ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-12-01 2022-12-01 17 1 5 5 10.2218/ijdc.v17i1.852 Towards Environmentally Sustainable Long-term Digital Preservation http://www.ijdc.net/article/view/848 <p class="Abstract">ARCHIVER and Pre-Commercial Procurement funding has enabled small to medium enterprises (SMEs) to innovate and deliver new services for EOSC. Within the framework of the <a href="https://www.archiver-project.eu/"><span style="color: windowtext; text-decoration: none; text-underline: none;">ARCHIVER </span></a>pre-commercial procurement tender, between December 2020 and August 2021, three commercial consortia competed to deliver innovative, prototype solutions for long-term data preservation. Two of them were selected to continue with the pilot phase and deliver research-ready solutions for long-term data preservation of research data, therefore filling a gap in the current European Open Science panorama.</p> <p class="Abstract">Digital preservation relies on technological infrastructure (information and communication technology, ICT) that can have environmental impacts. While altering technology usage can reduce the impact of digital preservation practices, this alone is not a strategy for sustainable practice. Moving toward environmentally sustainable digital preservation requires critically examining the motivations and assumptions that shape current practice. The use of scalable cloud infrastructures can reduce the environmental impacts of long-term data preservation solutions.</p> Ignacio Peluaga João Fernandes Shreyasvi Natraj ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-10-31 2022-10-31 17 1 6 6 10.2218/ijdc.v17i1.848 Uncommon Commons? Creative Commons Licencing in Horizon 2020 Data Management Plans http://www.ijdc.net/article/view/840 <p class="Abstract" style="margin: 0cm -.05pt 5.0pt 0cm;"><span lang="EN-GB">As policies, good practices and mandates on research data management evolve, more emphasis has been put on the licencing of data, which allows potential re-users to quickly identify what they can do with the data in question. In this paper I analyse a pre-existing collection of 840 Horizon 2020 public data management plans (DMPs) to determine which ones mention creative commons licences and among those who do, which licences are being used. </span></p> <p class="Abstract" style="margin: 0cm -.05pt 5.0pt 0cm;"><span lang="EN-GB">I find that 36% of DMPs mention creative commons and among those a number of different approaches towards licencing exist (overall policy per project, licencing decisions per dataset, licencing decisions per partner, licensing decision per data format, licensing decision per perceived stakeholder interest), often clad in rather vague language with CC licences being “recommended” or “suggested”. Some DMPs also “kick the can further down the road” by mentioning that “a” CC licence will be used, but not which one. However, among those DMPs that do mention specific CC licences, a clear favourite emerges: the CC-BY licence, which accounts for half of the total mentioning of a specific licence. </span></p> <p class="Abstract" style="margin: 0cm -.05pt 5.0pt 0cm;"><span lang="EN-GB">The fact that 64% of DMPs did not mention creative commons at all is an indication for the need for further training and awareness raising on data management in general and licencing in particular in Horizon Europe. For those DMPs that do mention specific licences, 60% would be compliant with Horizon Europe requirements (CC-BY or CC0). However, it should be carefully monitored whether content similar to the 40% that is currently licenced with non- Horizon Europe compliant licences will in the future move to CC-BY or CC0 or whether such content will simply be kept fully closed by projects (by invoking the “as open as possible, as close as necessary” principle), which would be an unintended and potentially damaging consequence of the policy. </span></p> Daniel Spichtinger ##submission.copyrightStatement## http://creativecommons.org/licenses/by/4.0 2022-09-20 2022-09-20 17 1 9 9 10.2218/ijdc.v17i1.840