International Journal of Digital Curation 2021-03-05T03:19:30+00:00 IJDC Editorial Team Open Journal Systems <p>The IJDC publishes pre-prints, research papers, general articles and editorials on digital curation, research data management and related issues. &nbsp;It complements the International Conference on Digital Curation (IDCC) and includes selected proceedings as Conference Papers.</p> An Exploratory Analysis of Social Science Graduate Education in Data Management and Data Sharing 2021-03-05T03:15:19+00:00 Ashley Doonan Dharma Akmon Evan Cosby <p class="Abstract"><span lang="EN-US">Effective data management and data sharing are crucial components of the research lifecycle, yet evidence suggests that many social science graduate programs are not providing training in these areas. The current exploratory study assesses how U.S. masters and doctoral programs in the social sciences include formal, non-formal, and informal training in data management and sharing. We conducted a survey of 150 graduate programs across six social science disciplines, and used a mix of closed and open-ended questions focused on the extent to which programs provide such training and exposure. Results from our survey suggested a deficit of formal training in both data management and data sharing, limited non-formal training, and cursory informal exposure to these topics. Utilizing the results of our survey, we conducted a syllabus analysis to further explore the formal and non-formal content of graduate programs beyond self-report. Our syllabus analysis drew from an expanded seven social science disciplines for a total of 140 programs. The syllabus analysis supported our prior findings that formal and non-formal inclusion of data management and data sharing training is not common practice. Overall, in both the survey and syllabi study we found a lack of both formal and non-formal training on data management and data sharing. Our findings have implications for data repository staff and data service professionals as they consider their methods for encouraging data sharing and prepare for the needs of data depositors. These results can also inform the development and structuring of graduate education in the social sciences, so that researchers are trained early in data management and sharing skills and are able to benefit from making their data&nbsp;</span>available as early in their careers as possible.</p> 2020-07-22T12:16:49+01:00 ##submission.copyrightStatement## Towards Continuous Quality Control for Spoken Language Corpora 2021-03-05T03:15:20+00:00 Anne Ferger Hanna Hedeland <div class="WordSection1"> <p class="Abstract">This paper describes the development of a systematic approach to the creation, management and curation of linguistic resources, particularly spoken language corpora. It also presents first steps towards a framework for continuous quality control to be used within external research projects by non-technical users, and discuss various domain and discipline specific problems and individual solutions. The creation of spoken language corpora is not only a time-consuming and costly process, but the created resources often represent intangible cultural heritage, containing recordings of, for example, extinct languages or historical events. Since high quality resources are needed to enable re-use in as many future contexts as possible, researchers need to be provided with the necessary means for quality control. We believe that this includes methods and tools adapted to Humanities researchers as non-technical users, and that these methods and tools need to be developed to support existing tasks and goals of research projects.</p> </div> 2020-07-22T11:31:22+01:00 ##submission.copyrightStatement## Tool Selection Among Qualitative Data Reusers 2021-03-05T03:15:18+00:00 Rebecca D. Frank Kara Suzuka Eric Johnson Elizabeth Yakel <p class="Abstract"><span lang="EN-US">This paper explores the tension between the tools that data reusers in the field of education prefer to use when working with qualitative video data and the tools that repositories make available to data reusers. Findings from this mixed-methods study show that </span><span lang="EN-US">data reusers utilizing qualitative video data did not use repository-based tools. Rather, they </span><span lang="EN-US">valued common, widely available tools that were collaborative and easy to use.</span></p> <p>&nbsp;</p> 2020-08-05T22:56:09+01:00 ##submission.copyrightStatement## The Red Queen in the Repository 2021-03-05T03:15:19+00:00 Joakim Philipson <div class="WordSection1"> <p class="Abstract">One of the grand curation challenges is to secure metadata quality in the ever-changing environment of metadata standards and file formats. As the Red Queen tells Alice in <em>Through the Looking-Glass</em>: “Now, here, you see, it takes all the running you can do, to keep in the same place.” That is, there is some “running” needed to keep metadata records in a research data repository fit for long-term use and put in place. One of the main tools of adaptation and keeping pace with the evolution of new standards, formats – and versions of standards in this ever-changing environment are validation schemas. Validation schemas are mainly seen as methods of checking data quality and fitness for use, but are also important for long-term preservation. We might like to think that our present (meta)data standards and formats are made for eternity, but in reality we know that standards evolve, formats change (some even become obsolete with time), and so do our needs for storage, searching and future dissemination for re-use. Eventually, we come to a point where transformation of our archival records and migration to other formats will be necessary. This could also mean that even if the AIPs, the Archival Information Packages stay the same in storage, the DIPs, the Dissemination Information Packages that we want to extract from the archive are subject to change of format. Further, in order for archival information packages to be self-sustainable, as required in the OAIS model, it is important to take interdependencies between individual files in the information packages into account. This should be done already by the time of ingest and validation of the SIPs, the Submission Information Packages, and along the line at different points of necessary transformation/migration (from SIP to AIP, from AIP to DIP etc.), in order to counter obsolescence.</p> <p class="Abstract"><br>This paper investigates possible validation errors and missing elements in metadata records from three general purpose, multidisciplinary research data repositories – Figshare, Harvard’s Dataverse and Zenodo, and explores the potential effects of these errors on future transformation to AIPs and migration to other&nbsp;formats within a digital archive.</p> </div> <p>&nbsp;</p> 2020-07-22T12:02:11+01:00 ##submission.copyrightStatement## Facilitating Access to Restricted Data 2021-03-05T03:15:20+00:00 Allison Rae Bobyak Tyler <div class="WordSection1"> <p class="Abstract">The decision to allow users access to restricted and protected data is based on the development of trust in the user by data repositories. In this article, I propose a model of the process of trust development at restricted data repositories, a model which emphasizes the increasing levels of trust dependent on prior interactions between repositories and users. I find that repositories develop trust in their users through the interactions of four dimensions – promissory, experience, competence, and goodwill – that consider distinct types of researcher expertise and the role of a researcher’s reputation in the trust process. However, the processes used by repositories to determine a level of trust corresponding to data access are inconsistent and do not support the sharing of trusted users between repositories to maximize efficient yet secure access to restricted research data. I highlight the role of a researcher’s reputation as an important factor in trust development and trust transference, and discuss the implications of modelling the restricted data access process as a process of trust development.</p> </div> 2020-07-22T11:57:33+01:00 ##submission.copyrightStatement## Research Data Management (RDM) at the University of Ghana (UG) 2021-03-05T03:15:08+00:00 Bright Kwaku Avuglah <p>This article explores Research Data Management (RDM) at the University of Ghana (UG). It emphasises on institutional awareness and attitudes, and whether the University Library is officially supporting this emerging strategic interest in research focused Higher Education Institutions (HEIs). Purposive sampling was used to select information-rich respondents from across the University (i.e. Librarians, Research Administrators, ICT Managers and Senior Researchers) who were interviewed on a range of issues about RDM. Institutional documents were also reviewed to corroborate the primary data and get a deeper understanding of the research problem. The study shows that while RDM is recognised at the institutional level as good research practice and integrity issue, the concept is tenuously understood in the local community. Unsurprisingly, however, there was a general appreciation and awareness of the need for RDM and the implications for such critical concerns as security, integrity, continuity and institutional reputation. The library is yet to take a strategic approach to RDM issues and there is clearly a dearth in RDM expertise within the library system. The study recommends that the library must be proactive in advocating and promoting RDM issues at UG, but first, the Librarians must take advantage of numerous existing opportunities to build their capacity.</p> <p>&nbsp;</p> 2020-12-31T16:05:21+00:00 ##submission.copyrightStatement## Design and Implementation of the first Generic Archive Storage Service for Research Data in Germany 2021-03-05T03:15:21+00:00 Felix Bach Björn Schembera Jos van Wezel <p class="Abstract">Research data as the true valuable good in science must be saved and subsequently kept findable, accessible and reusable for reasons of proper scientific conduct for a time span of several years. However, managing long-term storage of research data is a burden for institutes and researchers. Because of the sheer size and the required retention time apt storage providers are hard to find.</p> <p class="Abstract">Aiming to solve this puzzle, the bwDataArchive project started development of a long-term research data archive that is reliable, cost effective and able store multiple petabytes of data. The hardware consists of data storage on magnetic tape, interfaced with disk caches and nodes for data movement and access. On the software side, the High Performance Storage System (HPSS) was chosen for its proven ability to reliably store huge amounts of data. However, the implementation of bwDataArchive is not dependant on HPSS. For authentication the bwDataArchive is integrated into the federated identity management for educational institutions in the State of Baden-Württemberg in Germany.</p> <p class="Abstract">The archive features data protection by means of a dual copy at two distinct locations on different tape technologies, data accessibility by common storage protocols, data retention assurance for more than ten years, data preservation with checksums, and data management capabilities supported by a flexible directory structure allowing sharing and publication. As of September 2019, the bwDataArchive holds over 9 PB and 90 million files and sees a constant increase in usage and users from many communities.</p> 2020-07-22T11:07:01+01:00 ##submission.copyrightStatement## Quality and Trust in the European Open Science Cloud 2021-03-05T03:15:14+00:00 Juan Carlos Bicarregui <div class="WordSection1"> <p class="Abstract">The European Open Science Cloud (EOSC) has the objective to provide a virtual environment offering open and seamless services for the re-use of research data across borders and scientific disciplines. This ambitious vision sets significant challenges that the research community must meet if the benefits of EOSC are to be realised. One of those challenges, which has both technical and cultural aspects, is to determine the “<em>Rules of Participation”</em> that enable users to assess the quality of the data and services provided through EOSC and thereby enable them to trust the data and services they access. This paper discusses some issues relevant to determining the Rules of Participation that will enable EOSC to meet these objectives.</p> </div> <p>&nbsp;</p> 2020-11-03T23:46:10+00:00 ##submission.copyrightStatement## Data Practices in Digital History 2021-03-05T03:15:21+00:00 Rongqian Ma Fanghui Xiao <div class="WordSection1"> <p class="Abstract">This paper presents an exploratory research project that investigates data practices in digital history research. Emerging from the 1950s and ‘60s in the United States, digital history remains a charged topic among historians, requiring a new research paradigm that includes new concepts and methodologies, an intensive degree of interdisciplinary, inter-institutional, and international collaboration, and experimental forms of research sharing, publishing, and evaluation. Using mixed methods of interviews and questionnaire, we identified data challenges in digital history research practices from three perspectives: ontology (e.g., the notion of data in historical research); workflow (e.g., data collection, processing, preservation, presentation and sharing); and challenges. Extending from the results, we also provide a critical discussion of the state-of-art in digital history research, particularly in respect of metadata, data sharing, digital history training, collaboration, as well as the transformation of librarians’ roles in digital history projects. We conclude with provisional recommendations of better data practices for participants in digital history, from the perspective of library and information science.</p> </div> 2020-07-22T11:20:12+01:00 ##submission.copyrightStatement## A Review of the History, Advocacy and Efficacy of Data Management Plans 2021-03-05T03:19:30+00:00 Nicholas Andrew Smale Kathryn Unsworth Gareth Denyer Elise Magatova Daniel Barr <div> <p class="Abstract">Data management plans (DMPs) have increasingly been encouraged as a key component of institutional and funding body policy. Although DMPs necessarily place administrative burden on researchers, proponents claim that DMPs have myriad benefits, including enhanced research data quality, increased rates of data sharing, and institutional planning and compliance benefits.</p> </div> <div> <p class="Abstract">In this article, we explore the international history of DMPs and describe institutional and funding body DMP policy. We find that economic and societal benefits from presumed increased rates of data sharing was the original driver of mandating DMPs by funding bodies. Today, 86% of UK Research Councils and 63% of US funding bodies require submission of a DMP with funding applications. Given that no major Australian funding bodies require DMP submission, it is of note that 37% of Australian universities have taken the initiative to internally mandate DMPs. Institutions both within Australia and internationally frequently promote the professional benefits of DMP use, and endorse DMPs as ‘best practice’. We analyse one such typical DMP implementation at a major Australian institution, finding that DMPs have low levels of apparent translational value. Indeed, an extensive literature review suggests there is very limited published systematic evidence that DMP use has any tangible benefit for researchers, institutions or funding bodies.</p> </div> <p>We are therefore led to question why DMPs have become the go-to tool for research data professionals and advocates of good data practice. By delineating multiple use-cases and highlighting the need for DMPs to be fit for intended purpose, we question the view that a good DMP is necessarily that which encompasses the entire data lifecycle of a project. Finally, we summarise recent developments in the DMP landscape, and note a positive shift towards evidence-based research management through more researcher-centric, educative, and integrated DMP services.</p> <p class="abstract-western" lang="en-GB"><a name="_GoBack"></a></p> 2021-03-05T17:27:48+00:00 ##submission.copyrightStatement## Long-Term Data Preservation Data Lifecycle, Standardisation Process, Implementation and Lessons Learned 2021-03-05T03:15:13+00:00 Mirko Albani Iolanda Maggio CEOS Data Stewardship Interest Group <p class="Abstract">Science and Earth Observation data represent today a unique and valuable asset for humankind that should be preserved without time constraints and kept accessible and exploitable by current and future generations. In Earth Science, knowledge of the past and tracking of the evolution are at the basis of our capability to effectively respond to the global changes that are putting increasing pressure on the environment, and on human society. This can only be achieved if long time series of data are properly preserved and made accessible to support international initiatives. Within ESA Member States and beyond, Earth Science data holders are increasingly coordinating data preservation efforts to ensure that the valuable data are safeguarded against loss and kept accessible and useable for current and future generations. This task becomes increasingly challenging in view of the existing 40 years’ worth of Earth Science data stored in archives around the world and the massive increase of data volumes expected over the next years from e.g., the European Copernicus Sentinel missions. Long Term Data Preservation (LTDP) aims at maintaining information discoverable and accessible in an independent and understandable way, with supporting information, which helps ensuring authenticity, over the long term. A focal aspect of LTDP is data Curation. Data Curation refers to the management of data throughout its life cycle. Data Curation activities enable data discovery and retrieval, maintain its quality, add value, and allow data re-use over time. It includes all the processes that involve data management, such as pre-ingest initiatives, ingest functions, archival storage and preservation, dissemination, and provision of access for a designated community.</p> <p class="Abstract">The paper presents specific aspects, of importance during the entire Earth observation data lifecycle, with respect to evolving data volumes and application scenarios. These particular issues are introduced in the section on 'Big Data' and LTDP. The Data Stewardship Reference lifecycle section describes how the data stewardship activities can be efficiently organised, while the following section addresses the overall preservation workflow and shows the technical steps to be taken during Data Curation. Earth Science Data Curation and preservation should be addressed during all mission stages - from the initial mission planning, throughout the entire mission lifetime, and during the post- mission phase. The Data Stewardship Reference Lifecycle gives a high-level overview of the steps useful for implementing Curation and preservation rules on mission data sets from initial conceptualisation or&nbsp;receipt through the iterative Curation cycle.</p> 2020-12-31T15:45:58+00:00 ##submission.copyrightStatement## Selecting Efficient and Reliable Preservation Strategies 2021-03-05T03:15:15+00:00 Micah Altman Richard Landau <p class="Abstract">This article addresses the problem of formulating efficient and reliable operational preservation policies that ensure bit-level information integrity over long periods, and in the presence of a diverse range of real-world technical, legal, organizational, and economic threats. We develop a systematic, quantitative prediction framework that combines formal modeling, discrete-event-based simulation, hierarchical modeling, and then use empirically calibrated sensitivity analysis to identify effective strategies.</p> <p>Specifically, the framework formally defines an objective function for preservation that maps a set of preservation policies and a risk profile to a set of preservation costs, and an expected collection loss distribution. In this framework, a curator’s objective is to select optimal policies that minimize expected loss subject to budget constraints. To estimate preservation loss under different policy conditions optimal policies, we develop a statistical hierarchical risk model that includes four sources of risk: the storage hardware; the physical environment; the curating institution; and the global environment. We then employ a general discrete event-based simulation framework to evaluate the expected loss and the cost of employing varying preservation strategies under specific parameterization of risks.</p> <p>The framework offers flexibility for the modeling of a wide range of preservation policies and threats. Since this framework is open source and easily deployed in a cloud computing environment, it can be used to produce analysis based on independent estimates of scenario-specific costs, reliability, and risks.</p> <p class="Abstract">We present results summarizing hundreds of thousands of simulations using this framework. This exploratory analysis points to a number of robust and broadly applicable preservation strategies, provides novel insights into specific preservation tactics, and provides evidence that challenges received wisdom.</p> 2020-09-29T23:14:51+01:00 ##submission.copyrightStatement## The CODATA-RDA Data Steward School 2021-03-05T03:19:18+00:00 Daniel Bangert Joy Davidson Steve Diggs Marjan Grootveld Hugh Shanahan Shanmugasundaram Venkataraman <p>Given the expected increase in demand for Data Stewards and Data Stewardship skills it is clear that there is a need to develop training, education and CPD (continuous professional development) in this area.</p> <p>In this paper a brief introduction is provided to the origin of definitions of Data Stewardship. Also it notes the present tendency towards equivalence between Data Stewardship skills and FAIR principles. It then focuses on one specific training event – the pilot Data Stewardship strand of the CODATA-RDA Research Data Science schools that by the time of the IDCC meeting will have been held in Trieste in August 2019. The paper will discuss the overall curriculum for the pilot school, how it matches with the FAIR4S framework, and plans for getting feedback from the students.</p> <p>Finally, the paper discuss future plans for the school, in particular how to deepen the integration between the Data Stewardship strand with the Early Career Researcher strand.</p> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2021-03-05T17:27:48+00:00 ##submission.copyrightStatement## Extending Support for Publishing Sensitive Research Data at the University of Bristol 2021-03-05T03:15:16+00:00 Zosia Beckles <div class="WordSection1"> <p class="Abstract">The University of Bristol Research Data Service was set up in 2014 to provide support and training for academic staff and postgraduate researchers in all aspects of research data management. As part of this, the data.bris Research Data Repository was developed to provide a publication platform for research data generated at the University of Bristol. Initially launched in 2015 to provide open access to data, since 2017 it has also been possible to publish access-controlled datasets containing sensitive data via this platform.</p> <p class="Abstract">The vast majority (90%) of datasets published are openly accessible, but there has been steady demand for access-controlled release of datasets containing information that is ethically or commercially sensitive. These cases require careful management of additional risk: for example, where datasets contain information on human participants, balancing the risk of re-identification with the need to provide robust data that maximises research value through re-use. Many groups within the University of Bristol (for example, the Avon Longitudinal Study of Parents and Children) have extensive experience and expertise in this area, but it became apparent that there was a need to provide additional support for researchers who were not able to draw on the experience of these established groups. This practice paper describes the process of setting up a dedicated service to provide training and basic disclosure risk assessments in order to address these skills gaps, and outlines lessons learnt and future directions for the service.</p> </div> 2020-08-07T11:01:01+01:00 ##submission.copyrightStatement## Out of the Jar into the World! A Case Study on Storing and Sharing Vertebrate Data 2021-03-05T03:19:20+00:00 Susan Borda <div class="WordSection1"> <p>In 2018, the Deep Blue Repositories and Research Data Services (DBRRDS) team at the University of Michigan Library began working with the University of Michigan Museum of Zoology (UMMZ) to provide a persistent and sustainable (i.e., non-grant funded, institutionally supported) solution for their part of the National Science Foundation’s (NSF) openVertebrate (oVert) initiative. The objective of oVert is to the digitize scientific collections of thousands of vertebrate specimens stored in jars on museum shelves and make the data freely accessible to researchers, students, classrooms, and the general public anywhere in the world. The University of Michigan (U-M) is one of five scanning centers working on oVert and will contribute scans of more than 3,500 specimens from the UMMZ collections (Erickson 2017).</p> <p>In addition to ingesting scans, the project involved developing methods to work around several significant system constraints: Deep Blue Data’s file structure (flat files only, no folders) and the closed use of Specify, UMMZ’s specimen database, for specimen metadata. DBRRDS had to create a completely new workflow for handling batch deposits at regular intervals, develop scripts to reorganize the data (according to a third-party data model) and augment the metadata using a third-party resource, Global Biodiversity Information Facility (GBIF).</p> <p class="Abstract">This paper will describe the following aspects of the UMMZ CT Scanning Project partnership in greater detail: data generation, metadata requirements, workflows, code development, lessons learned, and next steps.</p> <p class="Abstract">[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> </div> <p>&nbsp;</p> 2021-03-05T17:27:48+00:00 ##submission.copyrightStatement## Piloting a Community of Student Data Consultants that Supports and Enhances Research Data Services 2021-03-05T03:19:17+00:00 Jonathan S Briganti Andrea Ogier Anne M. Brown <div class="WordSection1"> <p class="Abstract">Research ecosystems within university environments are continuously evolving and requiring more resources and domain specialists to assist with the data lifecycle. Typically, academic researchers and professionals are overcommitted, making it challenging to be up-to-date on recent developments in best practices of data management, curation, transformation, analysis, and visualization. Recently, research groups, university core centers, and Libraries are revitalizing these services to fill in the gaps to aid researchers in finding new tools and approaches to make their work more impactful, sustainable, and replicable. In this paper, we report on a student consultation program built within the University Libraries, that takes an innovative, student-centered approach to meeting the research data needs in a university environment while also providing students with experiential learning opportunities. This student program, DataBridge, trains students to work in multi-disciplinary teams and as student consultants to assist faculty, staff, and students with their real-world, data-intensive research challenges. Centering DataBridge in the Libraries allows students the unique opportunity to work across all disciplines, on problems and in domains that some students may not interact with during their college careers. To encourage students from multiple disciplines to participate, we developed a scaffolded curriculum that allows students from any discipline and skill level to quickly develop the essential data science skill sets and begin contributing their own unique perspectives and specializations to the research consultations. These students, mentored by Informatics faculty in the Libraries, provide research support that can ultimately impact the entire research process. Through our pilot phase, we have found that DataBridge enhances the utilization and openness of data created through research, extends the reach and impact of the work beyond the researcher’s specialized community, and creates a network of student “data champions” across the University who see the value in working with the Library. Here, we describe the evolution of the DataBridge program and outline its unique role in both training the data stewards of the future with regard to FAIR data practices, and in contributing significant value to research projects at Virginia Tech. Ultimately, this work highlights the need for innovative, strategic programs that encourage and enable real-world experience of data curation, data analysis, and data publication for current researchers, all while training the next generation of researchers in these best practices.</p> <p class="Abstract">[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> </div> 2021-03-05T17:27:48+00:00 ##submission.copyrightStatement## Role of Content Analysis in Improving the Curation of Experimental Data 2021-03-05T03:15:17+00:00 João Daniel Aguiar Castro Cristiana Landeira João Rocha da Silva Cristina Ribeiro <div class="WordSection1"> <p class="Abstract">As researchers are increasingly seeking tools and specialized support to perform research data management activities, the collaboration with data curators can be fruitful. Yet, establishing a timely collaboration between researchers and data curators, grounded in sound communication, is often demanding. In this paper we propose manual content analysis as an approach to streamline the data curator workflow. With content analysis curators can obtain domain-specific concepts used to describe experimental configurations in scientific publications, to make it easier for researchers to understand the notion of metadata and for the development of metadata tools. We present three case studies from experimental domains, one related to sustainable chemistry, one to photovoltaic generation and another to nanoparticle synthesis. The curator started by performing content analysis in research publications, proceeded to create a metadata template based on the extracted concepts, and then interacted with researchers. The approach was validated by the researchers with a high rate of accepted concepts, 84 per cent. Researchers also provide feedback on how to improve some proposed descriptors. Content analysis has the potential to be a practical, proactive task, which can be extended to multiple experimental domains and bridge the communication gap between curators and researchers.</p> <p class="Abstract">[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> </div> 2020-08-06T12:03:27+01:00 ##submission.copyrightStatement## Updating the DCC Curation Lifecycle Model 2021-03-05T03:15:13+00:00 Sayeed Choudhury Caihong Huang Carole L. Palmer <p class="Abstract">The DCC Curation Lifecycle Model has played a vital role in the field of data curation for over a decade. During that time, the scale and complexity of data have changed dramatically, along with the contexts of data production and use. This paper reports on a study examining factors impacting data curation practices and presents recommendations for updating the DCC Curation Lifecycle Model. The study was grounded in a review of other lifecycle models and informed by a site visit to the Digital Curation Centre and consultation with expert practitioners and researchers. Framed by contemporary conditions impacting the conduct of research and provision of data services, the analysis and proposed recommendations account for the prominence of machine-actionable data, the importance of machine learning for data processing and analytics, growth of integrated research workflows, and escalating concerns with fairness, accountability, and transparency of data and algorithms.</p> 2020-12-31T15:34:05+00:00 ##submission.copyrightStatement## Curated Archiving of Research Software Artifacts: Lessons Learned from the French Open Archive (HAL) 2021-03-05T03:15:18+00:00 Roberto Di Cosmo Morane Gruenpeter Bruno Marmol Alain Monteil Laurent Romary Jozefina Sadowska <p class="Abstract">Software has become an indissociable support of technical and scientific knowledge. The preservation of this universal body of knowledge is as essential as preserving research articles and data sets. In the quest to make scientific results reproducible, and pass knowledge to future generations, we must preserve these three main pillars:&nbsp; research articles that describe the results, the data sets used or produced, and the software that embodies the logic of the data transformation.</p> <p>The collaboration between Software Heritage (SWH), the Center for Direct Scientific Communication (CCSD) and the scientific and technical information services (IES) of The French Institute for Research in Computer Science and Automation (Inria) has resulted in a specified moderation and curation workflow for research software artifacts deposited in the HAL the French global open access repository. The curation workflow was developed to help digital librarians and archivists handle this new and peculiar artifact - software source code. While implementing the workflow, a set of guidelines has emerged from the challenges and the solutions put in place to help all actors involved in the process.</p> 2020-08-05T17:09:19+01:00 ##submission.copyrightStatement## Mutually Assured Preservation: Fostering Active Preservation Practice through Fire Drills 2021-03-05T03:15:12+00:00 Bradley Daigle <p class="Abstract">Sound preservation practice is a series of active engagements with the content one hopes to preserve. In many cases, this has not always been the case. Both institutions and services—while not actively encouraging passive preservation—neglect the key components in the stewardship of our historical record. In other words, there is much more to preservation than simply choosing a storage solution and placing one’s content there. The materials need to be verified, checked, and tested against expectations within the service. This is accepted practice for many. However, very few services provide the necessary assurance to test both its own user expectations as well as the depositors’ themselves. Creating a methodology for both depositor and service to be assured that preservation meets expectations is critical. This is happening in very select ways. This paper discusses one such dialogue and its function.</p> <p>&nbsp;</p> 2020-12-31T15:48:23+00:00 ##submission.copyrightStatement## Inter-Organisational Coordination Work in Digital Curation: the Case of Eurobarometer 2021-03-05T03:15:15+00:00 Kristin Eschenfelder Kalpana Shankar <div class="WordSection1"> <p class="Abstract">Open research is predicated upon seamless access to curated research data. Major national and European funding schemes, such as Horizon Europe, strongly encourage or require publicly funded data to be FAIR&nbsp; - that is, Findable, Accessible, Interoperable, Reusable (Wilkinson, 2016). What underpins such initiatives are the many data organizations and repositories working with their stakeholders and each other to establish policies and practices, implement them, and do the curatorial work to increase the available, discoverability, and accessibility of high quality research data. However, such work has often been invisible and underfunded, necessitating creative and collaborative solutions.</p> <p class="Abstract">In this paper, we briefly describe how one such case from social science data: the processing of the Eurobarometer data set. Using content analysis of administrative documents and interviews, we detail how European data archives managed the tensions of curatorial work across borders and jurisdictions from the 1970s to the mid-2000s, the challenges that they faced in distributing work, and the solutions they found. In particular, we look at the interactions of the Council of European Social Science Data Archives (CESSDA) and social science data organizations (DO) like UKDA, ICPSR, and GESIS and the institutional and organizational collaborations that made Eurobarometer “too big to fail”. We describe some of the invisible work that they underwent in the past in making data in Europe findable, accessible, interoperable, and conclude with implications for “frictionless” data access and reuse today.</p> </div> <p>&nbsp;</p> 2020-08-12T12:28:26+01:00 ##submission.copyrightStatement## Identifying Opportunities for Collective Curation During Archaeological Excavations 2021-03-05T03:15:17+00:00 Ixchel Faniel Anne Austin Sarah Whitcher Kansa Eric Kansa Jennifer Jacobs Phoebe France <p>Archaeological excavations are comprised of interdisciplinary teams that create, manage, and share data as they unearth and analyse material culture. These team-based settings are ripe for collective curation during these data lifecycle stages. However, findings from four excavation sites show that the data interdisciplinary teams create are not well integrated. Knowing this, we recommended opportunities for collective curation to improve use and reuse of the data within and outside of the team.</p> 2020-08-06T15:36:05+01:00 ##submission.copyrightStatement## Sustaining Digital Humanities Collections: Challenges and Community-Centred Strategies 2021-03-05T03:15:12+00:00 Katrina Simone Fenlon <div class="WordSection1"> <p class="Abstract">Since the advent of digital scholarship in the humanities, decades of extensive, distributed scholarly efforts have produced a digital scholarly record that is increasingly scattered, heterogeneous, and independent of curatorial institutions. Digital scholarship produces collections with unique scholarly and cultural value—collections that serve as hubs for collaboration and communication, engage broad audiences, and support new research. Yet, lacking systematic support for digital scholarship in libraries, digital humanities collections are facing a widespread crisis of sustainability. This paper provides outcomes of a multimodal study of sustainability challenges confronting digital collections in the humanities, characterizing institutional and community-oriented strategies for sustaining collections. Strategies that prioritize community engagement with collections and the maintenance of sociotechnical workflows suggest possibilities for novel approaches to collaborative, community-centred sustainability for digital humanities collections.</p> </div> 2020-12-31T15:49:15+00:00 ##submission.copyrightStatement## Cross-tier Web Programming for Curated Databases: a Case Study 2021-03-05T03:15:19+00:00 Simon Fowler Simon Harding Joanna Sharman James Cheney <p class="Abstract">Curated databases have become important sources of information across several scientific disciplines, and as the result of manual work of experts, often become important reference works. Features such as provenance tracking, archiving, and data citation are widely regarded as important features for the curated databases, but implementing such features is challenging, and small database projects often lack the resources to do so.</p> <p class="Abstract">A scientific database application is not just the relational database itself, but also an ecosystem of web applications to display the data, and applications which allow data curation. Supporting advanced curation features requires changing all of these components, and there is currently no way to provide such capabilities in a reusable way.</p> <p class="Abstract">Cross-tier programming languages have been proposed to simplify the creation of web applications, where developers can write an application in a single, uniform language. Consequently, database queries and updates can be written in the same language as the rest of the program, and at least in principle, it should be possible to provide curation features reusably via program transformations. As a first step towards this goal, it is important to establish that realistic curated databases can be implemented in a cross-tier programming language.</p> <p class="Abstract">In this paper, we describe such a case study: reimplementing the web front end of a real world scientific database, the IUPHAR/BPS Guide to Pharmacology (GtoPdb), in the Links cross-tier programming language. We show how programming language features such as language-integrated query simplify the development process, and rule out common errors. Through a comparative performance evaluation, we show that the Links implementation performs fewer database queries, while the time needed to handle the queries is comparable to the Java version. Furthermore, while there is some overhead to using Links because of its comparative immaturity compared to Java, the Links version is usable as a proof-of-concept case study of cross-tier programming for curated databases.</p> <p>[ This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review. The most up-to-date version of the paper can be found on arXiv <a href=""></a> ]</p> 2020-07-30T16:49:03+01:00 ##submission.copyrightStatement## Data Curator in the Middle: Curating Data for a Diverse Community of Stakeholders 2021-03-05T03:15:13+00:00 Ruth Geraghty <div class="WordSection1"> <p class="Abstract">The Prevention and Early Intervention Research Initiative is an archiving project to preserve the data and reports that were generated by twelve years of philanthropic and state investment into prevention and early intervention approaches in the children and youth sector in Ireland and Northern Ireland. The investment resulted in an extensive collection of evaluation data and reports, which collectively provide an evidence base for continued investment into PEI programmes that are shown to be effective. In 2016, the Prevention and Early Intervention Research Initiative (PEI-RI) was established to preserve the outputs from these evaluations in the national data archives, as a publicly available evidence base. The political and social significance of this collection is manifest in the range of stakeholder groups that the project is engaging with, including the community and not-for-profit organisations that operated the PEI programmes, the research teams from academic institutions that evaluated these programmes, and representatives from government departments that co-funded many of these programmes with Atlantic.</p> <p class="Abstract">This paper tells the story of the PEI-RI archiving project, describing the steps we’ve taken since 2016 to preserve and promote the PEI data. During the course of the project we realised that it would not be enough to provide access to the data alone, as "[g]enerating and collating the evidence is of no use if it never reaches the commissioners and professionals who need it" (What Works Network, 2014, pp. 6). In the second phase of our project we are creating a range of resources for practitioner and decision maker audiences which provide a pathway to the data using the archival infrastructure.</p> <p class="Abstract">The project provides a case study of curating a digital collection that is intended for multiple stakeholders with different expectations of the archived material. The PEI-RI data curator is located in the middle of a triad of data creators, data consumers and data archives, and is tasked with balancing the interests, expectations and limitations of each.</p> </div> 2020-12-31T15:44:31+00:00 ##submission.copyrightStatement## Archivists Managing Research Data? a Survey of Irish Organisations 2021-03-05T03:15:11+00:00 Rebecca Grant <div class="WordSection1"> <p class="Abstract">This paper describes a survey undertaken in 2017 to establish which research data management policies and practices were in place at Irish organisations; the extent to which archivists and records managers were employed to manage research data at those organisations; and the impact that archival skills have on research data management at an organisation. The paper describes the survey methods and data analysis, and presents findings including the presence of archivists and records managers at more than half of the surveyed organisations. Next steps for the research are also outlined.</p> </div> 2020-12-31T15:50:53+00:00 ##submission.copyrightStatement## Access Some Areas: Reforming Access Categories for Data in a Social Science Data Archive 2021-03-05T03:15:10+00:00 Laurence Horton Anja Perry <p class="Abstract">In this paper we outline the process of revising data access categories for research data sets in GESIS – a large European social science data archive based in Germany. The challenge is to create a minimal set of workable access conditions that cope with a) facilitating as “open as possible, closed as necessary” expectations for data reuse; b) map on to existing legacy access categories and conditions in a data archive.</p> <p class="Abstract">The paper covers the work done in gathering data on data access categories used by data archives in their existing data catalogues, the choices offered to depositors of data in their user agreements, and work done by other data reuse platforms in categorising access to their data. Finally, we talk through the process of refining a minimal set of data access conditions for the GESIS data archive.</p> <p>&nbsp;</p> 2020-12-31T15:57:09+00:00 ##submission.copyrightStatement## The Road to Partnership: a Stepwise, Iterative Approach to Organisational Collaboration in RDM, Archives and Records Management 2021-03-05T03:15:14+00:00 Michelle Harricharan Carly Manson Kirsten Hylan <p>Research data management (RDM) sits at the confluence of a number of related roles. The shape an RDM confluence takes depends on several factors including the nature of an organisation and the research that it undertakes. At St George’s, University of London, the UK’s only university dedicated to medical and health sciences education, training and research, RDM has been intricately interwoven with organisational information governance roles since its inception. RDM is represented on our institutional Information Governance Steering Group and our Information Management Team consisting of information governance, data protection, freedom of information, archives, records management and RDM.</p> <p>This paper reports on how RDM, archives and records management have collaborated using a step-wise, iterative process to streamline and harmonise our guidance and workflows in relation to the stewardship, curation and preservation of research data. As part of this we consistently develop, conduct and evaluate small projects on managing, curating and preserving data. We present three projects that we collaborated on to transform research data services across each of our departments:</p> <ul> <li>planning for, conducting and reporting on interviews with wet laboratory researchers</li> <li>advocating, building a case for and delivering a university-wide digital preservation system</li> <li>ongoing work to recover, preserve and facilitate access to a unique national health database</li> </ul> <p>Learnings from these projects are used to develop our guidance, improve our activities and integrate our workflows, the outcomes of which may be further evaluated. Learnings are also used to improve our ways of working together. Through deeper integration of our activities and workflows, rather than simply aligning aspects of our work, we are increasingly becoming partners on research data stewardship, curation and preservation. This approach offers several benefits to the organisation as it allows us to build on our related knowledge and skills and deliver outcomes that demonstrate greater value to the organisation and the researchers we support.</p> 2020-10-25T22:32:16+00:00 ##submission.copyrightStatement## Extending the Research Data Toolkit: Data Curation Primers 2021-03-05T03:19:17+00:00 Cynthia Hudson-Vitale Hannah Hadley Jennifer Moore Lisa Johnston Wendy Kozlowski Jake Carlson Mara Blake Joel Herndon <div class="WordSection1"> <p class="Abstract">Niche and proprietary data formats used in cutting-edge research and technology have specific curation considerations and challenges. The increased demand for subject liaisons, library archivists, and digital curators to curate this variety of data types created locally at an institution or organization poses difficulties. Subject liaisons possess discipline knowledge and expertise for a given domain or discipline and digital curation experts know how to properly steward data assets generally. Yet, a gap often exists between the expertise available within the organization and local curation needs.</p> <p class="Abstract">While many institutions and organizations have expertise in certain domains and areas, oftentimes the heterogeneous data types received for deposit extend beyond this expertise. Additionally, evolving research methods and new, cutting-edge technology used in research often result in unfamiliar and niche data formats received for deposit. Knowing how to ‘get-started’ in curating these file types and formats can be a particular challenge. To address this need, the data curation community have been developing a new set of tools – data curation primers. These primers are evolving documents that detail a specific subject, disciplinary area or curation task, and that can be used as a reference or jump-start to curating research data. This paper will provide background on the data curation primers and their content detail the process of their development, highlight the data curation primers published to date, emphasize how curators can incorporate these resources into workflows, and show curators how they can get involved and share their own expertise.</p> </div> 2021-03-05T17:27:48+00:00 ##submission.copyrightStatement## Research Data Management Policy and Practice in China 2021-03-05T03:15:09+00:00 Yingshen Huang Andrew Cox Laura Sbaffi <div class="WordSection1"> <p class="Abstract">On April 2, 2018, the State Council of China formally released a national research data management (RDM) policy “Measures for Managing Scientific Data”. Literature review shows that university libraries have played an important role in supporting Research Data Management at an institutional level in countries in North America, Europe and Australasia. The aim of this paper is to capture the current status of RDM in Chinese universities, in particular how university libraries have involved in taking the agenda forward.</p> <p class="Abstract">This paper uses mixed methods: a website analysis of university policies and services; a questionnaire for university librarians; and semi-structured interviews. Findings from website analysis and questionnaires indicate that RDS at a local level in Chinese Universities are in their infancy. On the whole there is more evidence of activity in developing data repositories than support services. Despite the existence of a national policy there remain significant barriers to further service development, such as the lag in the creation of local policy, insufficient funding for technical infrastructure, shortages of staff skills in data curation, and language barriers to international data sharing and open science. RDS in Chinese university libraries are still lagging behind the English-speaking countries and Europe.</p> </div> 2020-12-31T16:02:24+00:00 ##submission.copyrightStatement## Understanding the Data Management Plan as a Boundary Object through a Multi-stakeholder perspective 2021-03-05T03:19:16+00:00 Live Kvale Nils Pharo <div class="WordSection1"> <p class="Abstract">A three-phase Delphi study was used to investigate an emerging community for research data management in Norway and their understanding and application of data management plans (DMPs). The findings reveal visions of what the DMP should be as well as different practice approaches, yet the stakeholders present common goals. This paper discusses the different perspectives on the DMP by applying Star and Griesemer’s theory of boundary objects (Star &amp; Griesemer, 1989). The debate on what the DMP is and the findings presented are relevant to all research communities currently implementing DMP procedures and requirements. The current discussions about DMPs tend to be distant from the active researchers and limited to the needs of funders and institutions rather than to the usefulness for researchers. By analysing the DMP as a boundary object, plastic and adaptable yet with a robust identity (Star &amp; Griesemer, 1989), and by translating between worlds where collaboration on data sharing can take place we expand the perspectives and include all stakeholders. An understanding of the DMP as a boundary object can shift the focus from shaping a DMP which fulfils funders’ requirements to enabling collaboration on data management and sharing across domains using standardised forms.</p> </div> <p>&nbsp;[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2021-03-05T17:27:48+00:00 ##submission.copyrightStatement## “You say potato, I say potato” Mapping Digital Preservation and Research Data Management Concepts towards Collective Curation and Preservation Strategies 2021-03-05T03:15:16+00:00 Michelle Lindlar Pia Rudnik Sarah Jones Laurence Horton <div class="WordSection1"> <p class="Abstract">This paper explores models, concepts and terminology used in the Research Data Management and Digital Preservation communities. In doing so we identify several overlaps and mutual concerns where the advancements of one professional field can apply to and assist another. By focusing on what unites rather than divides us, and by adopting a more holistic approach we advance towards collective curation and preservation strategies.</p> </div> <p>&nbsp;</p> 2020-08-09T21:15:29+01:00 ##submission.copyrightStatement## Building the Picture Behind a Dataset 2021-03-05T03:15:10+00:00 Frances Madden Jan Ashton Jez Cope <p>As part of the European Commission funded FREYA project The British Library wanted to explore the possibility of developing provenance information in datasets derived from the British Library’s collections, the collection. Provenance information is defined in this context as ‘information relating to the origin, source and curation of the datasets’. Provenance information is also identified within the FAIR principles as an important aspect of being able to reuse and understand research datasets.&nbsp;According to the FAIR principles, the aim is to understand how to cite and acknowledge the dataset as well as understanding how the dataset was created and has been processed. There is also reference to the importance of this metadata being machine readable. By enhancing the metadata of these datasets with additional persistent identifiers and metadata a fuller picture of the datasets and their content could be understood. This also adds to the veracity and understanding the dataset by end users of</p> 2020-12-31T15:55:43+00:00 ##submission.copyrightStatement## Privacy Impact Assessments for Digital Repositories 2021-03-05T03:19:21+00:00 Abraham Mhaidli Libby Hemphill Florian Schaub Cundiff Jordan Andrea K. Thomer <p class="AbstractTitle">Trustworthy data repositories ensure the security of their collections. We argue they should also ensure the security of researcher and human subject data. Here we demonstrate the use of a privacy impact assessment (PIA) to evaluate potential privacy risks to researchers using the ICPSR’s Open Badges Research Credential System as a case study. We present our workflow and discuss potential privacy risks and mitigations for those risks.</p> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]&nbsp;</p> 2021-03-05T17:27:48+00:00 ##submission.copyrightStatement## Finding a Repository with the Help of Machine-Actionable DMPs: Opportunities and Challenges 2021-03-05T03:19:19+00:00 Simon Oblasser Tomasz Miksa Asanobu Kitamoto <p class="Abstract">Finding a suitable repository to deposit research data is a difficult task for researchers since the landscape consists of thousands of repositories and automated tool support is limited. Machine-actionable DMPs can improve the situation since they contain relevant context information in a structured and machine-friendly way and therefore enable automated support in repository recommendation.</p> <p class="Abstract">This work describes the current practice of repository selection and the available support today. We outline the opportunities and challenges of using machine-actionable DMPs to improve repository recommendation. By linking the use case of repository recommendation to the ten principles for machine-actionable DMPs, we show how this vision can be realized. A filterable and searchable repository registry that provides rich metadata for each indexed repository record is a key element in the architecture described. At the example of repository registries we show that by mapping machine-actionable DMP content and data policy elements to their filter criteria and querying their APIs a ranked list of repositories can be suggested.</p> <p>&nbsp;[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2021-03-05T17:27:48+00:00 ##submission.copyrightStatement## Do Open Data Badges Influence Author Behaviour? a Case Study at Springer Nature 2021-03-05T03:15:11+00:00 Rebecca Pearce Rebecca Grant <div class="WordSection1"> <p class="Abstract">Digital badges have previously been shown to incentivise journal authors to share their data openly. In this paper we introduce an Open data badging project at the Springer Nature journal <em>BMC Microbiology</em>. The development of the Open data badge is described, as well as the challenges of developing standard badging criteria and ensuring authors’ awareness of the badges. Next steps for the badging project are outlined, which are based on the experiences of the team assessing the badges, the number of badges awarded at the journal to date, and the results of an author survey.</p> </div> 2020-12-31T15:53:16+00:00 ##submission.copyrightStatement## CiTAR - Preserving Software-based Research 2021-03-05T03:15:12+00:00 Klaus Rechert Oleg Stobbe Oleg Zharkow Rafael Gieschke Dennis Wehrle <div class="WordSection1"> <p class="Abstract">In contrast to books or published articles, pure digital output of research projects is more fragile and, thus, more difficult to preserve and more difficult to be made available and to be reused by a wider research community. Not only does a fast-growing format diversity in research data sets require additional software preservation but also today’s computer assisted research disciplines increasingly devote significant resources into creating new digital resources and software-based methods.</p> <p class="Abstract">In order to adapt FAIR data principles, especially to ensure re-usability of a wide variety of research outputs, novel ways for preservation of software and additional digital resources are required as well as their integration into existing research data management strategies.</p> <p class="Abstract">This article addresses preservation challenges and preservation options of containers and virtual machines to encapsulate software-based research methods as portable and preservable software-based research resources, provides a preservation plan as well as an implementation.</p> </div> <p>&nbsp;</p> 2020-12-31T15:47:23+00:00 ##submission.copyrightStatement## Sustaining Software Preservation Efforts Through Use and Communities of Practice 2021-03-05T03:15:18+00:00 Fernando Rios Monique Lassere Judd Ethan Ruggill Ken S. McAllister <p class="Abstract">The brief history of software preservation efforts illustrates one phenomenon repeatedly: not unlike spinning a plate on a broomstick, it is easy to get things going, but difficult to keep them stable and moving. Within the context of video games and other forms of cultural heritage (where most software preservation efforts have lately been focused), this challenge has several characteristic expressions, some technical (e.g., the difficulty of capturing and emulating protected binary files and proprietary hardware), and some legal (e.g., providing archive users with access to preserved games in the face of variously threatening end user licence agreements). In other contexts, such as the preservation of research-oriented software, there can be additional challenges, including insufficient awareness and training on unusual (or even unique) software and hardware systems, as well as a general lack of incentive for preserving “old data.” We believe that in both contexts, there is a relatively accessible solution: the fostering of communities of practice. Such groups are designed to bring together like-minded individuals to discuss, share, teach, implement, and sustain special interest groups—in this case, groups engaged in software preservation.</p> <p class="Abstract">In this paper, we present two approaches to sustaining software preservation efforts via community. The first is emphasizing within the community of practice the importance of “preservation through use,” that is, preserving software heritage by staying familiar with how it feels, looks, and works. The second approach for sustaining software preservation efforts is to convene direct and adjacent expertise to facilitate knowledge exchange across domain barriers to help address local needs; a sufficiently diverse community will be able (and eager) to provide these types of expertise on an as-needed basis. We outline here these sustainability mechanisms, then show how the networking of various domain-specific preservation efforts can be converted into a cohesive, transdisciplinary, and highly collaborative software preservation team.</p> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2020-08-02T22:04:02+01:00 ##submission.copyrightStatement## Data Sources and Persistent Identifiers in the Open Science Research Graph of OpenAIRE 2021-03-05T03:15:09+00:00 Jochen Schirrwagen Alessia Bardi Andreas Czerniak Aenne Loehden Najla Rettberg Mike Mertens Paolo Manghi <p>In this article, we give an overview of the data source typologies used in OpenAIRE and provide an outline on the role of persistent identifiers in the aggregation, curation and provision workflows that lead to the generation of the Research Graph in OpenAIRE.</p> 2020-12-31T16:03:50+00:00 ##submission.copyrightStatement## Data Communities: Empowering Researcher-Driven Data Sharing in the Sciences 2021-03-05T03:19:21+00:00 Rebecca Springer Danielle Cooper <p>There is a growing perception that science can progress more quickly, more innovatively, and more rigorously when researchers share data with each other. However many scientists are not engaging in data sharing and remain skeptical of its relevance to their work. As organizations and initiatives designed to promote STEM data sharing multiply – within, across, and outside academic institutions – there is a pressing need to decide strategically on the best ways to move forward. In this paper, we propose a new mechanism for conceptualizing and supporting STEM research data sharing..&nbsp;Successful data sharing happens within&nbsp;<em>data communities</em>, formal or informal groups of scholars who share a certain type of data with each other, regardless of disciplinary boundaries. Drawing on the findings of four large-scale qualitative studies of research practices conducted by Ithaka S+R, as well as the scholarly literature, we identify what constitutes a data community and outline its most important features by studying three success stories, investigating the circumstances under which intensive data sharing is already happening. We contend that stakeholders who wish to promote data sharing – librarians, information technologists, scholarly communications professionals, and research funders, to name a few – should work to identify and empower&nbsp;<em>emergent data communities</em>. These are groups of scholars for whom a relatively straightforward technological intervention, usually the establishment of a data repository, could kickstart the growth of a more active data sharing culture. We conclude by offering recommendations for ways forward.</p> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2021-03-05T17:27:48+00:00 ##submission.copyrightStatement## Three Approaches to Documenting Database Migrations 2021-03-05T03:15:17+00:00 Andrea K. Thomer Alexandria Jane Rayburn Allison R. B. Tyler <div class="WordSection1"> <p class="Abstract">Database migration is a crucial aspect of digital collections management, yet there are few best practices to guide practitioners in this work. There is also limited research on the patterns of use and processes motivating database migrations. In the “Migrating Research Data Collections” project, we are developing these best practices through a multi-case study of database and digital collections migration. We find that a first and fundamental problem faced by collection staff is a sheer lack of documentation about past database migrations. We contribute a discussion of ways information professionals can reconstruct missing documentation, and some three approaches that others might take for documenting migrations going forward.</p> </div> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2020-08-06T13:56:03+01:00 ##submission.copyrightStatement## Complementary Data as Metadata: Building Context for the Reuse of Video Records of Practice 2021-03-05T03:15:14+00:00 Allison Rae Bobyak Tyler Kara Suzuka Elizabeth Yakel <p class="AfterHeading12">Data reuse is often dependent on context external to the data. At times, this context is actually additional data that helps data reusers better assess and/or understand the target data upon which they are focused. We refer to these data as complementary data and define these as data external to the target data which could be used as evidence in their own right. In this paper, we specifically we focus on video records of practice in education. Records of practice are a type of data that more broadly document events surrounding teaching and learning. Video records of practice are an interesting case of data reuse as they can be extensive (e.g., days or weeks of video of a classroom), result in large files sizes, and require both metadata and other complementary data in order for reusers to understand the events depicted in the video. Through our mixed methods study, consisting of a survey of data reusers in 4 repositories and 44 in-depth interviews, we identified the types of complementary data that assist reusers of video records of practice for either teaching and/or research. While there were similarities in the types of complementary data identified as important to have when reusing VROP, the rationales and motivations for seeking out particular complementary data differed depending on whether the intended use was for teaching or research. <span lang="EN-US">While metadata is an important and valuable means of describing data for reuse, data’s meaning is often constructed through comparison, verification, or elucidation in reference to other data.</span></p> <p>&nbsp;</p> 2020-11-01T19:09:35+00:00 ##submission.copyrightStatement## Embedding Analytics within the Curation of Scientific Workflows 2021-03-05T03:15:10+00:00 Gerard Weatherby Michael Robert Gryk <div class="WordSection1"> <p class="Abstract">This paper reports on the ongoing activities and curation practices of the National Center for Biomolecular NMR Data Processing and Analysis1. Over the past several years, the Center has been developing and extending computational workflow management software for use by a community of biomolecular NMR spectroscopists. Previous work had been to refactor the workflow system to utilize the PREMIS framework for reporting retrospective provenance as well as for sharing workflows between scientists and to support data reuse. In this paper, we report on our recent efforts to embed analytics within the workflow execution and within provenance tracking. Important metrics for each of the intermediate datasets are included within the corresponding PREMIS intellectual object, which allows for both inspection of the operation of individual actors as well as visualization of the changes throughout a full processing workflow.</p> <p class="Abstract">These metrics can be viewed within the workflow management system or through standalone metadata widgets. Our approach is to support a hybrid approach of both automated, workflow execution as well as manual intervention and metadata management. In this combination, the workflow system and metadata widgets encourage the domain experts to be avid curators of the data which they create, fostering both computational reproducibility and scientific data reuse.</p> </div> <p>&nbsp;</p> 2020-12-31T15:58:09+00:00 ##submission.copyrightStatement## Towards a Risk Catalogue for Data Management Plans 2021-03-05T03:15:11+00:00 Franziska Weng Stella Thoben <div class="WordSection1"> <p class="Abstract">Although data management and its careful planning are no new topics, there is only little literature on risk mitigation in data management plans (DMPs). We consider it a problem that DMPs do not include a structured approach for the identification or mitigation of risks, because it would instil confidence and trust in the data and its stewards, and foster the successful conduction of data-generating projects, which often are funded research projects. In this paper, we present a lightweight approach for identifying general risks in DMPs. We introduce an initial version of a generic risk catalogue for funded research and similar projects. By analysing a selection of 13 DMPs for projects from multiple disciplines published by the Research Ideas and Outcomes (RIO) journal, we demonstrate that our approach is applicable to DMPs and transferable to multiple institutional constellations. As a result, the effort for integrating risk management in data management planning can be reduced.</p> </div> 2020-12-31T15:54:32+00:00 ##submission.copyrightStatement## Co-Creating Autonomy: Group Data Protection and Individual Self-determination within a Data Commons 2021-03-05T03:15:15+00:00 Janis Wong Tristan Henderson <div class="WordSection1"> <p class="Abstract">Recent privacy scandals such as Cambridge Analytica and the Nightingale Project show that data sharing must be carefully managed and regulated to prevent data misuse. Data protection law, legal frameworks, and technological solutions tend to focus on controller responsibilities as opposed to protecting data subjects from the beginning of the data collection process. Using a case study of how data subjects can be better protected during data curation, we propose that a co-created data commons can protect individual autonomy over personal data through collective curation and rebalance power between data subjects and controllers.</p> </div> <p>&nbsp;</p> 2020-08-11T22:49:00+01:00 ##submission.copyrightStatement##