International Journal of Digital Curation <p>The IJDC publishes research papers, general articles and brief reports on digital curation, research data management and related issues. &nbsp;It complements the International Conference on Digital Curation (IDCC) and includes selected proceedings as Conference Papers.</p> University of Edinburgh en-US International Journal of Digital Curation 1746-8256 <p>Copyright for papers and articles published in this journal is retained by the authors, with first publication rights granted to the University of Edinburgh. It is a condition of publication that authors license their paper or article under a <a href="">Creative Commons Attribution 4.0 International (CC BY 4.0)</a> licence.<br><br><a href="" rel="license"><img style="border-width: 0;" src="" alt="Creative Commons License"></a></p> The Data Life Aquatic <p>&nbsp;</p> <p><span class="Apple-converted-space">&nbsp;</span>This paper assesses data consumers’ perspectives on the interoperable and re-usable aspects of the FAIR Data Principles. Taking a domain-specific informatics approach, ten oceanographers were asked to think of a recent search for data and describe their process of discovery, evaluation, and use. The interview schedule, derived from the FAIR Data Principles, included questions about the interoperability and re-usability of data. Through this critical incident technique, findings on data interoperability and re-usability give data curators valuable insights into how real-world users access, evaluate, and use data. Results from this study show that oceanographers utilize tools that make re-use simple, with interoperability seamless within the systems used. The processes employed by oceanographers present a good baseline for other domains adopting the FAIR Data Principles.<span class="Apple-converted-space">&nbsp;</span></p> Bradley Wade Bishop Carolyn F Hank Joel T Webster ##submission.copyrightStatement## 2022-01-05 2022-01-05 16 1 10 10 10.2218/ijdc.v16i1.635 Identifying Opportunities for Collective Curation During Archaeological Excavations <p>Archaeological excavations are comprised of interdisciplinary teams that create, manage, and share data as they unearth and analyse material culture. These team-based settings are ripe for collective curation during these data lifecycle stages. However, findings from four excavation sites show that the data interdisciplinary teams create are not well integrated. Knowing this, we recommended opportunities for collective curation to improve use and reuse of the data within and outside of the team.</p> Ixchel Faniel Anne Austin Sarah Whitcher Kansa Eric Kansa Jennifer Jacobs Phoebe France ##submission.copyrightStatement## 2021-04-18 2021-04-18 16 1 17 17 10.2218/ijdc.v16i1.742 Cross-tier Web Programming for Curated Databases: a Case Study <p>Curated databases have become important sources of information across several scientific disciplines, and as the result of manual work of experts, often become important reference works. Features such as provenance tracking, archiving, and data citation are widely regarded as important features for the curated databases, but implementing such features is challenging, and small database projects often lack the resources to do so.</p> <p>A scientific database application is not just the relational database itself, but also an ecosystem of web applications to display the data, and applications which allow data curation. Supporting advanced curation features requires changing all of these components, and there is currently no way to provide such capabilities in a reusable way.</p> <p>Cross-tier programming languages allow developers to write a web application in a single, uniform language. Consequently, database queries and updates can be written in the same language as the rest of the program, and it should be possible to provide curation features via program transformations. As a step towards this goal, it is important to establish that realistic curated databases can be implemented in a cross-tier programming language.</p> <p>In this article, we describe such a case study: reimplementing the web frontend of a realworld scientific database, the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb), in the Links cross-tier programming language. We show how programming language features such as language-integrated query simplify the development process, and rule out common errors. Through an automated functional correctness evaluation, we show that the Links implementation correctly implements the functionality of the official version. Through a comparative performance evaluation, we show that the Links implementation performs fewer database queries, while the time neededto handle the queries is comparable to the official Java version. Furthermore, while there is some overhead to using Links because of its comparative immaturity compared to Java, the Links version is usable as a proof-of-concept case study of cross-tier programming for curated databases.</p> Simon Fowler Simon Harding Joanna Sharman James Cheney ##submission.copyrightStatement## 2021-04-19 2021-04-19 16 1 21 21 10.2218/ijdc.v16i1.735 Understanding the Data Management Plan as a Boundary Object through a Multi-stakeholder perspective <div class="WordSection1"> <p class="Abstract">A three-phase Delphi study was used to investigate an emerging community for research data management in Norway and their understanding and application of data management plans (DMPs). The findings reveal visions of what the DMP should be as well as different practice approaches, yet the stakeholders present common goals. This paper discusses the different perspectives on the DMP by applying Star and Griesemer’s theory of boundary objects (Star &amp; Griesemer, 1989). The debate on what the DMP is and the findings presented are relevant to all research communities currently implementing DMP procedures and requirements. The current discussions about DMPs tend to be distant from the active researchers and limited to the needs of funders and institutions rather than to the usefulness for researchers. By analysing the DMP as a boundary object, plastic and adaptable yet with a robust identity (Star &amp; Griesemer, 1989), and by translating between worlds where collaboration on data sharing can take place we expand the perspectives and include all stakeholders. An understanding of the DMP as a boundary object can shift the focus from shaping a DMP which fulfils funders’ requirements to enabling collaboration on data management and sharing across domains using standardised forms.</p> </div> Live Kvale Nils Pharo ##submission.copyrightStatement## 2021-07-04 2021-07-04 16 1 16 16 10.2218/ijdc.v16i1.746 Privacy Impact Assessments for Digital Repositories <p class="AbstractTitle">Trustworthy data repositories ensure the security of their collections. We argue they should also ensure the privacy of researcher and research subject data. We demonstrate the use of a privacy impact assessment (PIA) to evaluate potential privacy risks to researchers using the ICPSR’s Researcher Passport as a case study. We present our workflow and discuss potential privacy risks and mitigations for those risks.</p> <p>[A previous version of this article is available as an <a href="">IDCC2020 Conference Paper</a>]&nbsp;</p> Abraham Mhaidli Libby Hemphill Florian Schaub Jordan Cundiff Andrea K. Thomer ##submission.copyrightStatement## 2022-05-10 2022-05-10 16 1 21 21 10.2218/ijdc.v15i1.753 Doctoral Students' Educational Needs in Research Data Management: Perceived Importance and Current Competencies <div class="WordSection1"> <p class="Abstract">Sound research data management (RDM) competencies are elementary tools used by researchers to ensure integrated, reliable, and re-usable data, and to produce high quality research results. In this study, 35 doctoral students and faculty members were asked to self-rate or rate doctoral students’ current RDM competencies and rate the importance of these competencies. Structured interviews were conducted, using close-ended and open-ended questions, covering research data lifecycle phases such as collection, storing, organization, documentation, processing, analysis, preservation, and data sharing. The quantitative analysis of the respondents’ answers indicated a wide gap between doctoral students’ rated/self-rated current competencies and the rated importance of these competencies. In conclusion, two major educational needs were identified in the qualitative analysis of the interviews: to improve and standardize data management planning, including awareness of the intellectual property and agreements issues affecting data processing and sharing; and to improve and standardize data documenting and describing, not only for the researcher themself but especially for data preservation, sharing, and re-using. Hence the study informs the development of RDM education for doctoral students.</p> </div> Jukka Rantasaari ##submission.copyrightStatement## 2021-08-09 2021-08-09 16 1 36 36 10.2218/ijdc.v16i1.684 Research Data Management Practices at the University of Namibia: Moving Towards Adoption <p>The management of research data in academic institutions is increasing across most disciplines. In Namibia, the requirement to manage research data, making it available for the purposes of sharing, preservation and to support research findings, has not yet been mandated. At the University of Namibia (UNAM) there is no institutional research data management (RDM) culture, yet RDM may nevertheless be practiced among its researchers. The extent to which these practices have been adopted is, however, not known. This study investigated the extent of RDM adoption by researchers at UNAM. It identifies current or potential challenges in managing research data, and proposes solutions to some of these challenges that could aid the university as it attempts to encourage the adoption of RDM practices. The investigation used Rogers’ Diffusion of Innovations (DOI) theory, with a focus on the innovation-decision process, as a means to establish where UNAM researchers are in the process of adopting RDM practices. The population under study were the UNAM faculty members who conduct research as part of their academic duties. Questionnaires were used to gather quantitative data. The study found that some researchers practice RDM to some extent out of their own free will, but there are many challenges that hinder these practices. Overall, though, there is a lack of interest in RDM as the knowledge of the concept among researchers is relatively low. The study found that most researchers were at the knowledge stage of the innovation-decision process and recommended, among other things, that the university puts effort into creating RDM awareness and encouraging data sharing, and that it moves forward with infrastructure and policy development so that RDM can be fully adopted by the researchers of the institution.<span class="Apple-converted-space">&nbsp;</span></p> Astridah Njala Samupwa Michelle Kahn ##submission.copyrightStatement## 2022-06-08 2022-06-08 16 1 12 12 10.2218/ijdc.v16i1.769 Futureproofing Visual Effects <p class="Abstract">Digital visual effects (VFX), including computer animation, have become a commonplace feature of contemporary episodic and film production projects. Using various commercial applications and bespoke tools, VFX artists craft digital objects (known as “assets”) to create visual elements such as characters and environments, which are composited together and output as shots.</p> <p class="Abstract">While the shots that make up the finished film or television (TV) episode are maintained and preserved within purpose-built digital asset management systems and repositories by the studios commissioning the projects; the wider VFX network currently has no consistent guidelines nor requirements around the digital curation of VFX digital assets and records. This includes a lack of guidance about how to effectively futureproof digital VFX and preserve it for the long-term.</p> <p class="Abstract">In this paper I provide a case study – a single shot from a 3D animation short film – to illustrate the complexities of digital VFX assets and records and the pipeline environments whence they are generated. I also draw from data collected from interviews with over 20 professional VFX practitioners from award-winning VFX companies, and I undertake socio-technical analysis of VFX using actor-network theory. I explain how high data volumes of digital information, rapid technology progression and dependencies on software pose significant preservation challenges.</p> <p>In addition, I outline that by conducting holistic appraisal, selection and disposal activities across their entire digital collections, and by continuing to develop and adopt open formats; the VFX industry has improved capability to preserve first-hand evidence of their work in years to come.</p> Evanthia Samaras ##submission.copyrightStatement## 2021-08-15 2021-08-15 16 1 15 15 10.2218/ijdc.v16i1.689 Assessment, Usability, and Sociocultural Impacts of DataONE <p class="Abstract">DataONE, funded from 2009-2019 by the U.S. National Science Foundation, is an early example of a large-scale project that built both a cyberinfrastructure and culture of data discovery, sharing, and reuse. DataONE used a Working Group model, where a diverse group of participants collaborated on targeted research and development activities to achieve broader project goals. This article summarizes the work carried out by two of DataONE’s working groups: Usability &amp; Assessment (2009-2019) and Sociocultural Issues (2009-2014). The activities of these working groups provide a unique longitudinal look at how scientists, librarians, and other key stakeholders engaged in convergence research to identify and analyze practices around research data management through the development of boundary objects, an iterative assessment program, and reflection. Members of the working groups disseminated their findings widely in papers, presentations, and datasets, reaching international audiences through publications in 25 different journals and presentations to over 5,000 people at interdisciplinary venues. The working groups helped inform the DataONE cyberinfrastructure and influenced the evolving data management landscape. By studying working groups over time, the paper also presents lessons learned about the working group model for global large-scale projects that bring together participants from&nbsp;multiple disciplines and communities in convergence research.</p> Robert J. Sandusky Suzie Allard Lynn Baird Leah Cannon Kevin Crowston Amy Forrester Bruce Grant Rachael Hu Robert Olendorf Danielle Pollock Alison Specht Carol Tenopir Rachel Volentine ##submission.copyrightStatement## 2021-04-18 2021-04-18 16 1 48 48 10.2218/ijdc.v16i1.678 Metajelo: a Metadata Package for Journals to Support External Linked Objects <p class="Abstract">We propose a metadata package that is intended to provide academic journals with a lightweight means of registering, at the time of publication, the existence and disposition of supplementary materials. Information about the supplementary materials is, in most cases, critical for the reproducibility and replicability of scholarly results. In many instances, these materials are curated by a third party, which may or may not follow developing standards for the identification and description of those materials. As such, the vocabulary described here complements existing initiatives that specify vocabularies to describe the supplementary materials or the repositories and archives in which they have been deposited. Where possible, it reuses elements of relevant other vocabularies, facilitating coexistence with them. Furthermore, it provides an “at publication” record of reproducibility characteristics of a particular article that has been selected for publication. The proposed metadata package documents the key characteristics that journals care about in the case of supplementary materials that are held by third parties: existence, accessibility, and permanence. It does so in a robust, time-invariant fashion at the time of publication, when the editorial decisions are made. It also allows for better documentation of less accessible (non-public data), by treating it symmetrically from the point of view of the journal, therefore increasing the transparency of what up until now has been very opaque.</p> <p>&nbsp;</p> Lars Vilhuber Carl Lagoze ##submission.copyrightStatement## 2021-10-26 2021-10-26 16 1 22 22 10.2218/ijdc.v16i1.600 Capturing Data Provenance from Statistical Software <p>We have created tools that automate one of the most burdensome aspects of documenting the provenance of research data: describing data transformations performed by statistical software.&nbsp; Researchers in many fields use statistical software (SPSS, Stata, SAS, R, Python) for data transformation and data management as well as analysis.&nbsp; The C<sup>2</sup>Metadata ("Continuous Capture of Metadata for Statistical Data") Project creates a metadata workflow paralleling the data management process by deriving provenance information from scripts used to manage and transform data.&nbsp; C<sup>2</sup>Metadata differs from most previous data provenance initiatives by documenting transformations at the variable level rather than describing a sequence of opaque programs.&nbsp; Command scripts for statistical software are translated into an independent Structured Data Transformation Language (SDTL), which serves as an intermediate language for describing data transformations.&nbsp;&nbsp; SDTL can be used to add variable-level provenance to data catalogues and codebooks and to create "variable lineages" for auditing software operations.&nbsp;&nbsp; Better data documentation makes research more transparent and expands the discovery and re-use of research data.</p> George Charles Alter Jack Gager Pascal Heus Carson Hunter Sanda Ionescu Jeremy Iverson H.V. Jagadish Jared Lyle Alexander Mueller Sigve Nordgaard Ornulf Risnes Dan Smith Jie Song ##submission.copyrightStatement## 2022-05-18 2022-05-18 16 1 14 14 10.2218/ijdc.v16i1.763 Where There's a Will, There's a Way: In-House Digitization of an Oral History Collection in a Lone-Arranger Situation <p>Analog audio materials present unique preservation and access challenges for even the largest libraries. These challenges are magnified for smaller institutions where budgets, staffing, and equipment limit what can be achieved. Because in-house migration to digital of analog audio is often out of reach for smaller institutions, the choice is between finding the room in the budget to out-source a project, or sit by and watch important materials decay. Cost is the most significant barrier to audio migration. Audio preservation labs can charge hundreds or even thousands of dollars to migrate analog to digital. Top-tier audio preservation equipment is equally expensive. When faced with the decomposition of an oral history collection recorded on cassette tape, one library decided that where there was a will, there was a way. The College of Education One-Room Schoolhouse Oral History Collection consisted of 247 audio cassettes containing interviews with one-room school house teachers from 68 counties in Kansas. The cassette tapes in this collection were between 20-40 years old and generally inaccessible for research due to fear the tapes could be damaged during playback. This case study looks at how a single Digital Curation Librarian with no audio digitization experience migrated nearly 200 hours of audio to digital using a $40 audio converter from Amazon and a campus subscription to Adobe Audition. This case study covers the decision to digitize the collection, the digitization process including audio clean-up, metadata collection and creation, presentation of the collection in CONTENTdm, and final preservation of audio files. The project took 20 months to complete and resulted in significant lessons learned that have informed decisions regarding future audio conversion projects.</p> <p>&nbsp;</p> <p>&nbsp;</p> Mary Elizabeth Downing-Turner ##submission.copyrightStatement## 2021-09-28 2021-09-28 16 1 8 8 10.2218/ijdc.v16i1.744 Improving the Usability of Organizational Data Systems <p>For research data repositories, web interfaces are usually the primary, if not the only, method that data users have to interact with repository systems. Data users often search, discover, understand, access, and sometimes use data directly through repository web interfaces. Given that sub-par user interfaces can reduce the ability of users to locate, obtain, and use data, it is important to consider how repositories’ web interfaces can be evaluated and improved in order to ensure useful and successful user interactions. This paper discusses how usability assessment techniques are being applied to improve the functioning of data repository interfaces at the National Center for Atmospheric Research (NCAR). At NCAR, a new suite of data system tools is being developed and collectively called the NCAR Digital Asset Services Hub (DASH). Usability evaluation techniques have been used throughout the NCAR DASH design and implementation cycles in order to ensure that the systems work well together for the intended user base. By applying user study, paper prototype, competitive analysis, journey mapping, and heuristic evaluation, the NCAR DASH Search and Repository experiences provide examples for how data systems can benefit from usability principles and techniques. Integrating usability principles and techniques into repository system design and implementation workflows helps to optimize the systems’ overall user experience.</p> Chung-Yi Hou Matthew S. Mayernik ##submission.copyrightStatement## 2021-05-18 2021-05-18 16 1 21 21 10.2218/ijdc.v16i1.592 Software Must be Recognised as an Important Output of Scholarly Research <div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>Software now lies at the heart of scholarly research. Here we argue that as well as being important from a methodological perspective, software should, in many instances, be recognised as an output of research, equivalent to an academic paper. The article discusses the different roles that software may play in research and highlights the relationship between software and research sustainability and reproducibility. It describes the challenges associated with the processes of citing and reviewing software, which differ from those used for papers. We conclude that whilst software outputs do not necessarily fit comfortably within the current publication model, there is a great deal of positive work underway that is likely to make an impact in addressing this.</p> </div> </div> </div> Caroline Jay Robert Haines Daniel S. Katz ##submission.copyrightStatement## 2021-12-22 2021-12-22 16 1 6 6 10.2218/ijdc.v16i1.745 Leveraging Existing Technology: Developing a Trusted Digital Repository for the U.S. Geological Survey <div class="WordSection1"> <p class="Abstract">As Federal Government agencies in the United States pivot to increase access to scientific data (Sheehan, 2016), the U.S. Geological Survey (USGS) has made substantial progress (Kriesberg et al., 2017). USGS authors are required to make federally funded data publicly available in an approved data repository (USGS, 2016b). This type of public data product, known as a USGS data release, serves as a method for publishing reviewed and approved data. In this paper, we present major milestones in the approach the USGS took to transition an existing technology platform to a Trusted Digital Repository. We describe both the technical and the non-technical actions that contributed to a successful outcome.We highlight how initial workflows revealed patterns that were later automated, and the ways in which assessments and user feedback influenced design and implementation. The paper concludes with lessons learned, such as the importance of a community of practice, application programming interface (API)-driven technologies, iterative development, and user-centered design. This paper is intended to offer a potential roadmap for organizations pursuing similar goals.</p> </div> <p>&nbsp;</p> Vivian B. Hutchison Tamar Norkin Madison L. Langseth Drew A. Ignizio Lisa S. Zolly Ricardo McClees-Funinan Amanda Liford ##submission.copyrightStatement## 2021-07-11 2021-07-11 16 1 23 23 10.2218/ijdc.v16i1.741 Data Curation, Fisheries, and Ecosystem-based Management: the Case Study of the Pecheker Database <div class="WordSection1"> <p class="Abstract">The scientific monitoring of the Southern Ocean French fishing industry is based on the use the Pecheker database. Pecheker is dedicated to the digital curation of the data collected on field by scientific observers and which analysis allows the scientists of the Muséum national d’Histoire naturelle institution to provide guidelines and advice for the regulation of the fishing activity, the protection of the fish stocks and the protection of the marine ecosystems. The template of Pecheker has been developed to make the database adapted to the ecosystem-based management concept. Considering the global context of biodiversity erosion, this modern approach of management aims to take account of the environmental background of the fisheries to ensure their sustainable development. Completeness and high quality of the raw data is a key element for an ecosystem-based management database such as Pecheker. Here, we present the development of this database as a case study of fisheries data curation to be shared with the readers. Full code to deploy a database based on the Pecheker template is provided in supplementary materials. Considering the success factors we could identify, we propose a discussion about how the community could build a global fisheries information system based on a network of small databases including interoperability standards.</p> </div> Alexis Martin Charlotte Chazeau Nicolas Gasco Guy Duhamel Patrice Pruvost ##submission.copyrightStatement## 2021-06-07 2021-06-07 16 1 31 31 10.2218/ijdc.v16i1.674 Scaling by Optimising: Modularisation of Data Curation Services in Growing Organisations <p class="Abstract">After a century of theorising and applying management practices, we are in the middle of entering a new stage in management science: digital management. The management of digital data submerges in traditional functions of management and, at the same time, continues to recreate viable solutions and conceptualisations in its established fields, e.g. research data management. Yet, one can observe bilateral synergies and mutual enrichment of traditional and data management practices in all fields. The paper at hand addresses a case in point, in which new and old management practices amalgamate to meet a steadily, in part characterised by leaps and bounds, increasing demand of data curation services in academic institutions. The idea of modularisation, as known from software engineering, is applied to data curation workflows so that economies of scale and scope can be used. While scaling refers to both management science and data science, optimising is understood in the traditional managerial sense, that is, with respect to the cost function. By means of a situation analysis describing how data curation services were applied from one department to the entire institution and an analysis of the factors of influence, a method of modularisation is outlined that converges to an optimal state of curation workflows.</p> Hagen Peukert ##submission.copyrightStatement## 2021-04-26 2021-04-26 16 1 20 20 10.2218/ijdc.v16i1.650 The Role of Data in an Emerging Research Community: <p class="Abstract"><span lang="EN-GB">Open science data benefit society by facilitating convergence across domains that are examining the same scientific problem. While cross-disciplinary data sharing and reuse is essential to the research done by convergent communities, so far little is known about the role data play in how these communities interact. An understanding of the role of data in these collaborations can help us identify and meet the needs of emerging research communities which may predict the next challenges faced by science. This paper represents an exploratory study of one emerging community, the environmental health community, examining how environmental health research groups form, collaborate, and share data. Five key insights about the role of data in emerging research communities are identified and suggestions are made for further research. </span></p> Danielle Pollock An Yan Michelle Parker Suzie Allard ##submission.copyrightStatement## 2022-06-13 2022-06-13 16 1 15 15 10.2218/ijdc.v16i1.653 First Line Research Data Management for Life Sciences: a Case Study <div class="WordSection1"> <p class="Abstract">Modern life sciences studies depend on the collection, management and analysis of comprehensive datasets in what has become data-intensive research. Life science research is also characterised by having relatively small groups of researchers. This combination of data-intensive research performed by a few people has led to an increasing bottleneck in research data management (RDM). Parallel to this, there has been an urgent call by initiatives like FAIR and Open Science to openly publish research data which has put additional pressure on improving the quality of RDM. Here, we reflect on the lessons learnt by DataHub Maastricht, a RDM support group of the Maastricht University Medical Centre (MUMC+) in Maastricht, the Netherlands, in providing first-line RDM support for life sciences. DataHub Maastricht operates with a small core team, and is complemented with disciplinary data stewards, many of whom have joint positions with DataHub and a research group. This organisational model helps creating shared knowledge between DataHub and the data stewards, including insights how to focus support on the most reusable datasets. This model has shown to be very beneficial given limited time and personnel. We found that co-hosting tailored platforms for specific domains, reducing storage costs by implementing tiered storage and promoting cross-institutional collaboration through federated authentication were all effective features to stimulate researchers to initiate RDM. Overall, utilising the expertise and communication channel of the embedded data stewards was also instrumental in our RDM success. Looking into the future, we foresee the need to further embed the role of data stewards into the lifeblood of the research organisation, along with policies on how to finance long-term storage of research data. The latter, to remain feasible, needs to be combined with a further formalising of appraisal and reappraisal of archived research data.</p> </div> J. Paul van Schayck Maarten Coonen ##submission.copyrightStatement## 2022-07-22 2022-07-22 16 1 13 13 10.2218/ijdc.v16i1.761 FAIR Forever? Accountabilities and Responsibilities in the Preservation of Research Data <p class="Abstract" style="margin: 0cm -.05pt 5.0pt 0cm;">Digital preservation is a fast-moving and growing community of practice of ubiquitous relevance, but in which capability is unevenly distributed. Within the open science and research data communities, digital preservation has a close alignment to the FAIR principles and is delivered through a complex specialist infrastructure comprising technology, staff and policy. However, capacity erodes quickly, establishing a need for ongoing examination and review to ensure that skills, technology, and policy remain fit for changing purpose.&nbsp;To address this challenge, the Digital Preservation Coalition (DPC) conducted the FAIR Forever study, commissioned by the European Open Science Cloud (EOSC) Sustainability Working Group and funded by the EOSC Secretariat Project in 2020, to assess the current strengths, weaknesses, opportunities and threats to the preservation of research data across EOSC, and the feasibility of establishing shared approaches, workflows and services that would benefit EOSC stakeholders.</p> <p class="Abstract" style="margin: 0cm -.05pt 5.0pt 0cm;">This paper draws from the FAIR Forever study to document and explore its key findings on the identified strengths, weaknesses, opportunities, and threats to the preservation of FAIR data in EOSC, and to the preservation of research data more broadly. It begins with background of the study and an overview of the methodology employed, which involved a desk-based assessment of the emerging EOSC vision, interviews with representatives of EOSC stakeholders, and focus groups with digital preservation specialists and data managers in research organizations. It summarizes key findings on the need for clarity on digital preservation in the EOSC vision and for elucidation of roles, responsibilities, and accountabilities to mitigate risks of data loss, reputation, and sustainability. It then outlines the recommendations provided in the final report presented to the EOSC Sustainability Working Group.</p> <p class="Abstract" style="margin: 0cm -.05pt 5.0pt 0cm;">To better ensure that research data can be FAIRer for longer, the recommendations of the study are presented with discussion on how they can be extended and applied to various research data stakeholders in and outside of EOSC, and suggest ways to bring together research data curation, management, and preservation communities to better ensure FAIRness now and in the long term.</p> Amy Currie William Kilbride ##submission.copyrightStatement## 2021-09-30 2021-09-30 16 1 16 16 10.2218/ijdc.v16i1.768 Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse <p>Before data from multiple sources can be analyzed, data cleaning workflows (“recipes”) usually need to be employed to improve data quality. We identify a number of technical problems that make application of FAIR principles to data cleaning recipes challenging. We then demonstrate how <em>transparency </em>and <em>reusability </em>of recipes can be improved by analyzing dataflow dependencies within recipes. In particular column-level dependencies can be used to automatically detect independent subworkflows, which then can be reused individually as data cleaning modules. We have prototypically implemented this approach as part of an ongoing project to develop open-source companion tools for OpenRefine.</p> <p><strong>Keywords: </strong>Data Cleaning, Provenance, Workflow Analysis</p> Lan Li Nikolaus Parulian Bertram Ludäscher ##submission.copyrightStatement## 2022-04-18 2022-04-18 16 1 11 11 10.2218/ijdc.v16i1.771 Towards a Semantic Interoperable Flemish Research Information Space: Development and Implementation of a Flemish Application Profile for Research Datasets <p>In Flanders, Research Performing Organizations (RPO) are required to provide information on publicly financed research to the Flemish Research Information Space (FRIS), a current research information system and research discovery platform hosted by the Flemish Department of Economics, Science and Innovation. FRIS currently discloses information onresearchers, research institutions, publications, and projects. Flemish decrees on Special and Industrial research funding, and the Flemish Open Science policy require RPOs to also provide metadata on research datasets to FRIS. To ensure accurate and uniform delivery of information across all information providing institutions on research datasets to FRIS, it isnecessary to develop a common application profile for research datasets. This article outlines the development of the Flemish application profile for research datasets that was developed by the Flemish Open Science Board (FOSB) WorkingGroup Metadata and Standardization.&nbsp;The main challenge was to achieve interoperability among stakeholders, which in part had existing metadata schemes and research information infrastructures in place, while others were still in the early stages of development.</p> Evy Neyens Sadia Vancauwenbergh ##submission.copyrightStatement## 2021-12-22 2021-12-22 16 1 17 17 10.2218/ijdc.v16i1.762 How Long Can We Build It? Ensuring Usability of a Scientific Code Base <p>Software and in particular source code became an important component of scientific publications and henceforth is now subject of research data management.&nbsp; Maintaining source code such that it remains a usable and a valuable scientific contribution is and remains a huge task. Not all code contributions can be actively maintained forever. Eventually, there will be a significant backlog of legacy source-code. In this article we analyse the requirements for applying the concept of long-term reusability to source code. We use simple case study to identify gaps and provide a technical infrastructure based on emulator to support automated builds of historic software in form of source code.</p> <p>&nbsp;</p> Klaus Rechert Jurek Oberhauser Rafael Gieschke ##submission.copyrightStatement## 2021-05-17 2021-05-17 16 1 11 11 10.2218/ijdc.v16i1.770