International Journal of Digital Curation 2020-12-02T03:27:09+00:00 IJDC Editorial Team Open Journal Systems <p>The IJDC publishes pre-prints, research papers, general articles and editorials on digital curation, research data management and related issues. &nbsp;It complements the International Conference on Digital Curation (IDCC) and includes selected proceedings.</p> An Exploratory Analysis of Social Science Graduate Education in Data Management and Data Sharing 2020-12-02T02:52:14+00:00 Ashley Doonan Dharma Akmon Evan Cosby <p class="Abstract"><span lang="EN-US">Effective data management and data sharing are crucial components of the research lifecycle, yet evidence suggests that many social science graduate programs are not providing training in these areas. The current exploratory study assesses how U.S. masters and doctoral programs in the social sciences include formal, non-formal, and informal training in data management and sharing. We conducted a survey of 150 graduate programs across six social science disciplines, and used a mix of closed and open-ended questions focused on the extent to which programs provide such training and exposure. Results from our survey suggested a deficit of formal training in both data management and data sharing, limited non-formal training, and cursory informal exposure to these topics. Utilizing the results of our survey, we conducted a syllabus analysis to further explore the formal and non-formal content of graduate programs beyond self-report. Our syllabus analysis drew from an expanded seven social science disciplines for a total of 140 programs. The syllabus analysis supported our prior findings that formal and non-formal inclusion of data management and data sharing training is not common practice. Overall, in both the survey and syllabi study we found a lack of both formal and non-formal training on data management and data sharing. Our findings have implications for data repository staff and data service professionals as they consider their methods for encouraging data sharing and prepare for the needs of data depositors. These results can also inform the development and structuring of graduate education in the social sciences, so that researchers are trained early in data management and sharing skills and are able to benefit from making their data&nbsp;</span>available as early in their careers as possible.</p> 2020-07-22T12:16:49+01:00 ##submission.copyrightStatement## Towards Continuous Quality Control for Spoken Language Corpora 2020-12-02T02:53:01+00:00 Anne Ferger Hanna Hedeland <div class="WordSection1"> <p class="Abstract">This paper describes the development of a systematic approach to the creation, management and curation of linguistic resources, particularly spoken language corpora. It also presents first steps towards a framework for continuous quality control to be used within external research projects by non-technical users, and discuss various domain and discipline specific problems and individual solutions. The creation of spoken language corpora is not only a time-consuming and costly process, but the created resources often represent intangible cultural heritage, containing recordings of, for example, extinct languages or historical events. Since high quality resources are needed to enable re-use in as many future contexts as possible, researchers need to be provided with the necessary means for quality control. We believe that this includes methods and tools adapted to Humanities researchers as non-technical users, and that these methods and tools need to be developed to support existing tasks and goals of research projects.</p> </div> 2020-07-22T11:31:22+01:00 ##submission.copyrightStatement## Tool Selection Among Qualitative Data Reusers 2020-12-02T02:51:37+00:00 Rebecca D. Frank Kara Suzuka Eric Johnson Elizabeth Yakel <p class="Abstract"><span lang="EN-US">This paper explores the tension between the tools that data reusers in the field of education prefer to use when working with qualitative video data and the tools that repositories make available to data reusers. Findings from this mixed-methods study show that </span><span lang="EN-US">data reusers utilizing qualitative video data did not use repository-based tools. Rather, they </span><span lang="EN-US">valued common, widely available tools that were collaborative and easy to use.</span></p> <p>&nbsp;</p> 2020-08-05T22:56:09+01:00 ##submission.copyrightStatement## The Red Queen in the Repository 2020-12-02T02:52:22+00:00 Joakim Philipson <div class="WordSection1"> <p class="Abstract">One of the grand curation challenges is to secure metadata quality in the ever-changing environment of metadata standards and file formats. As the Red Queen tells Alice in <em>Through the Looking-Glass</em>: “Now, here, you see, it takes all the running you can do, to keep in the same place.” That is, there is some “running” needed to keep metadata records in a research data repository fit for long-term use and put in place. One of the main tools of adaptation and keeping pace with the evolution of new standards, formats – and versions of standards in this ever-changing environment are validation schemas. Validation schemas are mainly seen as methods of checking data quality and fitness for use, but are also important for long-term preservation. We might like to think that our present (meta)data standards and formats are made for eternity, but in reality we know that standards evolve, formats change (some even become obsolete with time), and so do our needs for storage, searching and future dissemination for re-use. Eventually, we come to a point where transformation of our archival records and migration to other formats will be necessary. This could also mean that even if the AIPs, the Archival Information Packages stay the same in storage, the DIPs, the Dissemination Information Packages that we want to extract from the archive are subject to change of format. Further, in order for archival information packages to be self-sustainable, as required in the OAIS model, it is important to take interdependencies between individual files in the information packages into account. This should be done already by the time of ingest and validation of the SIPs, the Submission Information Packages, and along the line at different points of necessary transformation/migration (from SIP to AIP, from AIP to DIP etc.), in order to counter obsolescence.</p> <p class="Abstract"><br>This paper investigates possible validation errors and missing elements in metadata records from three general purpose, multidisciplinary research data repositories – Figshare, Harvard’s Dataverse and Zenodo, and explores the potential effects of these errors on future transformation to AIPs and migration to other&nbsp;formats within a digital archive.</p> </div> <p>&nbsp;</p> 2020-07-22T12:02:11+01:00 ##submission.copyrightStatement## Facilitating Access to Restricted Data 2020-12-02T02:52:40+00:00 Allison Rae Bobyak Tyler <div class="WordSection1"> <p class="Abstract">The decision to allow users access to restricted and protected data is based on the development of trust in the user by data repositories. In this article, I propose a model of the process of trust development at restricted data repositories, a model which emphasizes the increasing levels of trust dependent on prior interactions between repositories and users. I find that repositories develop trust in their users through the interactions of four dimensions – promissory, experience, competence, and goodwill – that consider distinct types of researcher expertise and the role of a researcher’s reputation in the trust process. However, the processes used by repositories to determine a level of trust corresponding to data access are inconsistent and do not support the sharing of trusted users between repositories to maximize efficient yet secure access to restricted research data. I highlight the role of a researcher’s reputation as an important factor in trust development and trust transference, and discuss the implications of modelling the restricted data access process as a process of trust development.</p> </div> 2020-07-22T11:57:33+01:00 ##submission.copyrightStatement## Design and Implementation of the first Generic Archive Storage Service for Research Data in Germany 2020-12-02T02:53:48+00:00 Felix Bach Björn Schembera Jos van Wezel <p class="Abstract">Research data as the true valuable good in science must be saved and subsequently kept findable, accessible and reusable for reasons of proper scientific conduct for a time span of several years. However, managing long-term storage of research data is a burden for institutes and researchers. Because of the sheer size and the required retention time apt storage providers are hard to find.</p> <p class="Abstract">Aiming to solve this puzzle, the bwDataArchive project started development of a long-term research data archive that is reliable, cost effective and able store multiple petabytes of data. The hardware consists of data storage on magnetic tape, interfaced with disk caches and nodes for data movement and access. On the software side, the High Performance Storage System (HPSS) was chosen for its proven ability to reliably store huge amounts of data. However, the implementation of bwDataArchive is not dependant on HPSS. For authentication the bwDataArchive is integrated into the federated identity management for educational institutions in the State of Baden-Württemberg in Germany.</p> <p class="Abstract">The archive features data protection by means of a dual copy at two distinct locations on different tape technologies, data accessibility by common storage protocols, data retention assurance for more than ten years, data preservation with checksums, and data management capabilities supported by a flexible directory structure allowing sharing and publication. As of September 2019, the bwDataArchive holds over 9 PB and 90 million files and sees a constant increase in usage and users from many communities.</p> 2020-07-22T11:07:01+01:00 ##submission.copyrightStatement## Quality and Trust in the European Open Science Cloud 2020-12-02T02:50:06+00:00 Juan Carlos Bicarregui <div class="WordSection1"> <p class="Abstract">The European Open Science Cloud (EOSC) has the objective to provide a virtual environment offering open and seamless services for the re-use of research data across borders and scientific disciplines. This ambitious vision sets significant challenges that the research community must meet if the benefits of EOSC are to be realised. One of those challenges, which has both technical and cultural aspects, is to determine the “<em>Rules of Participation”</em> that enable users to assess the quality of the data and services provided through EOSC and thereby enable them to trust the data and services they access. This paper discusses some issues relevant to determining the Rules of Participation that will enable EOSC to meet these objectives.</p> </div> <p>&nbsp;</p> 2020-11-03T23:46:10+00:00 ##submission.copyrightStatement## Data Practices in Digital History 2020-12-02T02:53:14+00:00 Rongqian Ma Fanghui Xiao <div class="WordSection1"> <p class="Abstract">This paper presents an exploratory research project that investigates data practices in digital history research. Emerging from the 1950s and ‘60s in the United States, digital history remains a charged topic among historians, requiring a new research paradigm that includes new concepts and methodologies, an intensive degree of interdisciplinary, inter-institutional, and international collaboration, and experimental forms of research sharing, publishing, and evaluation. Using mixed methods of interviews and questionnaire, we identified data challenges in digital history research practices from three perspectives: ontology (e.g., the notion of data in historical research); workflow (e.g., data collection, processing, preservation, presentation and sharing); and challenges. Extending from the results, we also provide a critical discussion of the state-of-art in digital history research, particularly in respect of metadata, data sharing, digital history training, collaboration, as well as the transformation of librarians’ roles in digital history projects. We conclude with provisional recommendations of better data practices for participants in digital history, from the perspective of library and information science.</p> </div> 2020-07-22T11:20:12+01:00 ##submission.copyrightStatement## A Review of the History, Advocacy and Efficacy of Data Management Plans 2020-12-02T03:27:09+00:00 Nicholas Andrew Smale Kathryn Unsworth Gareth Denyer Elise Magatova Daniel Barr <div> <p class="Abstract">Data management plans (DMPs) have increasingly been encouraged as a key component of institutional and funding body policy. Although DMPs necessarily place administrative burden on researchers, proponents claim that DMPs have myriad benefits, including enhanced research data quality, increased rates of data sharing, and institutional planning and compliance benefits.</p> </div> <div> <p class="Abstract">In this article, we explore the international history of DMPs and describe institutional and funding body DMP policy. We find that economic and societal benefits from presumed increased rates of data sharing was the original driver of mandating DMPs by funding bodies. Today, 86% of UK Research Councils and 63% of US funding bodies require submission of a DMP with funding applications. Given that no major Australian funding bodies require DMP submission, it is of note that 37% of Australian universities have taken the initiative to internally mandate DMPs. Institutions both within Australia and internationally frequently promote the professional benefits of DMP use, and endorse DMPs as ‘best practice’. We analyse one such typical DMP implementation at a major Australian institution, finding that DMPs have low levels of apparent translational value. Indeed, an extensive literature review suggests there is very limited published systematic evidence that DMP use has any tangible benefit for researchers, institutions or funding bodies.</p> </div> <p>We are therefore led to question why DMPs have become the go-to tool for research data professionals and advocates of good data practice. By delineating multiple use-cases and highlighting the need for DMPs to be fit for intended purpose, we question the view that a good DMP is necessarily that which encompasses the entire data lifecycle of a project. Finally, we summarise recent developments in the DMP landscape, and note a positive shift towards evidence-based research management through more researcher-centric, educative, and integrated DMP services.</p> <p class="abstract-western" lang="en-GB"><a name="_GoBack"></a></p> 2020-12-02T03:58:27+00:00 ##submission.copyrightStatement## Selecting Efficient and Reliable Preservation Strategies 2020-12-02T02:50:18+00:00 Micah Altman Richard Landau <p class="Abstract">This article addresses the problem of formulating efficient and reliable operational preservation policies that ensure bit-level information integrity over long periods, and in the presence of a diverse range of real-world technical, legal, organizational, and economic threats. We develop a systematic, quantitative prediction framework that combines formal modeling, discrete-event-based simulation, hierarchical modeling, and then use empirically calibrated sensitivity analysis to identify effective strategies.</p> <p>Specifically, the framework formally defines an objective function for preservation that maps a set of preservation policies and a risk profile to a set of preservation costs, and an expected collection loss distribution. In this framework, a curator’s objective is to select optimal policies that minimize expected loss subject to budget constraints. To estimate preservation loss under different policy conditions optimal policies, we develop a statistical hierarchical risk model that includes four sources of risk: the storage hardware; the physical environment; the curating institution; and the global environment. We then employ a general discrete event-based simulation framework to evaluate the expected loss and the cost of employing varying preservation strategies under specific parameterization of risks.</p> <p>The framework offers flexibility for the modeling of a wide range of preservation policies and threats. Since this framework is open source and easily deployed in a cloud computing environment, it can be used to produce analysis based on independent estimates of scenario-specific costs, reliability, and risks.</p> <p class="Abstract">We present results summarizing hundreds of thousands of simulations using this framework. This exploratory analysis points to a number of robust and broadly applicable preservation strategies, provides novel insights into specific preservation tactics, and provides evidence that challenges received wisdom.</p> 2020-09-29T23:14:51+01:00 ##submission.copyrightStatement## The CODATA-RDA Data Steward School 2020-12-02T03:26:58+00:00 Daniel Bangert Joy Davidson Steve Diggs Marjan Grootveld Hugh Shanahan Shanmugasundaram Venkataraman <p>Given the expected increase in demand for Data Stewards and Data Stewardship skills it is clear that there is a need to develop training, education and CPD (continuous professional development) in this area.</p> <p>In this paper a brief introduction is provided to the origin of definitions of Data Stewardship. Also it notes the present tendency towards equivalence between Data Stewardship skills and FAIR principles. It then focuses on one specific training event – the pilot Data Stewardship strand of the CODATA-RDA Research Data Science schools that by the time of the IDCC meeting will have been held in Trieste in August 2019. The paper will discuss the overall curriculum for the pilot school, how it matches with the FAIR4S framework, and plans for getting feedback from the students.</p> <p>Finally, the paper discuss future plans for the school, in particular how to deepen the integration between the Data Stewardship strand with the Early Career Researcher strand.</p> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2020-12-02T03:58:27+00:00 ##submission.copyrightStatement## Extending Support for Publishing Sensitive Research Data at the University of Bristol 2020-12-02T02:50:49+00:00 Zosia Beckles <div class="WordSection1"> <p class="Abstract">The University of Bristol Research Data Service was set up in 2014 to provide support and training for academic staff and postgraduate researchers in all aspects of research data management. As part of this, the data.bris Research Data Repository was developed to provide a publication platform for research data generated at the University of Bristol. Initially launched in 2015 to provide open access to data, since 2017 it has also been possible to publish access-controlled datasets containing sensitive data via this platform.</p> <p class="Abstract">The vast majority (90%) of datasets published are openly accessible, but there has been steady demand for access-controlled release of datasets containing information that is ethically or commercially sensitive. These cases require careful management of additional risk: for example, where datasets contain information on human participants, balancing the risk of re-identification with the need to provide robust data that maximises research value through re-use. Many groups within the University of Bristol (for example, the Avon Longitudinal Study of Parents and Children) have extensive experience and expertise in this area, but it became apparent that there was a need to provide additional support for researchers who were not able to draw on the experience of these established groups. This practice paper describes the process of setting up a dedicated service to provide training and basic disclosure risk assessments in order to address these skills gaps, and outlines lessons learnt and future directions for the service.</p> </div> 2020-08-07T11:01:01+01:00 ##submission.copyrightStatement## Out of the Jar into the World! A Case Study on Storing and Sharing Vertebrate Data 2020-12-02T03:26:59+00:00 Susan Borda <div class="WordSection1"> <p>In 2018, the Deep Blue Repositories and Research Data Services (DBRRDS) team at the University of Michigan Library began working with the University of Michigan Museum of Zoology (UMMZ) to provide a persistent and sustainable (i.e., non-grant funded, institutionally supported) solution for their part of the National Science Foundation’s (NSF) openVertebrate (oVert) initiative. The objective of oVert is to the digitize scientific collections of thousands of vertebrate specimens stored in jars on museum shelves and make the data freely accessible to researchers, students, classrooms, and the general public anywhere in the world. The University of Michigan (U-M) is one of five scanning centers working on oVert and will contribute scans of more than 3,500 specimens from the UMMZ collections (Erickson 2017).</p> <p>In addition to ingesting scans, the project involved developing methods to work around several significant system constraints: Deep Blue Data’s file structure (flat files only, no folders) and the closed use of Specify, UMMZ’s specimen database, for specimen metadata. DBRRDS had to create a completely new workflow for handling batch deposits at regular intervals, develop scripts to reorganize the data (according to a third-party data model) and augment the metadata using a third-party resource, Global Biodiversity Information Facility (GBIF).</p> <p class="Abstract">This paper will describe the following aspects of the UMMZ CT Scanning Project partnership in greater detail: data generation, metadata requirements, workflows, code development, lessons learned, and next steps.</p> <p class="Abstract">[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> </div> <p>&nbsp;</p> 2020-12-02T03:58:27+00:00 ##submission.copyrightStatement## Piloting a Community of Student Data Consultants that Supports and Enhances Research Data Services 2020-12-02T03:26:57+00:00 Jonathan S Briganti Andrea Ogier Anne M. Brown <div class="WordSection1"> <p class="Abstract">Research ecosystems within university environments are continuously evolving and requiring more resources and domain specialists to assist with the data lifecycle. Typically, academic researchers and professionals are overcommitted, making it challenging to be up-to-date on recent developments in best practices of data management, curation, transformation, analysis, and visualization. Recently, research groups, university core centers, and Libraries are revitalizing these services to fill in the gaps to aid researchers in finding new tools and approaches to make their work more impactful, sustainable, and replicable. In this paper, we report on a student consultation program built within the University Libraries, that takes an innovative, student-centered approach to meeting the research data needs in a university environment while also providing students with experiential learning opportunities. This student program, DataBridge, trains students to work in multi-disciplinary teams and as student consultants to assist faculty, staff, and students with their real-world, data-intensive research challenges. Centering DataBridge in the Libraries allows students the unique opportunity to work across all disciplines, on problems and in domains that some students may not interact with during their college careers. To encourage students from multiple disciplines to participate, we developed a scaffolded curriculum that allows students from any discipline and skill level to quickly develop the essential data science skill sets and begin contributing their own unique perspectives and specializations to the research consultations. These students, mentored by Informatics faculty in the Libraries, provide research support that can ultimately impact the entire research process. Through our pilot phase, we have found that DataBridge enhances the utilization and openness of data created through research, extends the reach and impact of the work beyond the researcher’s specialized community, and creates a network of student “data champions” across the University who see the value in working with the Library. Here, we describe the evolution of the DataBridge program and outline its unique role in both training the data stewards of the future with regard to FAIR data practices, and in contributing significant value to research projects at Virginia Tech. Ultimately, this work highlights the need for innovative, strategic programs that encourage and enable real-world experience of data curation, data analysis, and data publication for current researchers, all while training the next generation of researchers in these best practices.</p> <p class="Abstract">[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> </div> 2020-12-02T03:58:27+00:00 ##submission.copyrightStatement## Role of Content Analysis in Improving the Curation of Experimental Data 2020-12-02T02:51:24+00:00 João Daniel Aguiar Castro Cristiana Landeira João Rocha da Silva Cristina Ribeiro <div class="WordSection1"> <p class="Abstract">As researchers are increasingly seeking tools and specialized support to perform research data management activities, the collaboration with data curators can be fruitful. Yet, establishing a timely collaboration between researchers and data curators, grounded in sound communication, is often demanding. In this paper we propose manual content analysis as an approach to streamline the data curator workflow. With content analysis curators can obtain domain-specific concepts used to describe experimental configurations in scientific publications, to make it easier for researchers to understand the notion of metadata and for the development of metadata tools. We present three case studies from experimental domains, one related to sustainable chemistry, one to photovoltaic generation and another to nanoparticle synthesis. The curator started by performing content analysis in research publications, proceeded to create a metadata template based on the extracted concepts, and then interacted with researchers. The approach was validated by the researchers with a high rate of accepted concepts, 84 per cent. Researchers also provide feedback on how to improve some proposed descriptors. Content analysis has the potential to be a practical, proactive task, which can be extended to multiple experimental domains and bridge the communication gap between curators and researchers.</p> <p class="Abstract">[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> </div> 2020-08-06T12:03:27+01:00 ##submission.copyrightStatement## Curated Archiving of Research Software Artifacts: Lessons Learned from the French Open Archive (HAL) 2020-12-02T02:51:47+00:00 Roberto Di Cosmo Morane Gruenpeter Bruno Marmol Alain Monteil Laurent Romary Jozefina Sadowska <p class="Abstract">Software has become an indissociable support of technical and scientific knowledge. The preservation of this universal body of knowledge is as essential as preserving research articles and data sets. In the quest to make scientific results reproducible, and pass knowledge to future generations, we must preserve these three main pillars:&nbsp; research articles that describe the results, the data sets used or produced, and the software that embodies the logic of the data transformation.</p> <p>The collaboration between Software Heritage (SWH), the Center for Direct Scientific Communication (CCSD) and the scientific and technical information services (IES) of The French Institute for Research in Computer Science and Automation (Inria) has resulted in a specified moderation and curation workflow for research software artifacts deposited in the HAL the French global open access repository. The curation workflow was developed to help digital librarians and archivists handle this new and peculiar artifact - software source code. While implementing the workflow, a set of guidelines has emerged from the challenges and the solutions put in place to help all actors involved in the process.</p> 2020-08-05T17:09:19+01:00 ##submission.copyrightStatement## Inter-Organisational Coordination Work in Digital Curation: the Case of Eurobarometer 2020-12-02T02:50:27+00:00 Kristin Eschenfelder Kalpana Shankar <div class="WordSection1"> <p class="Abstract">Open research is predicated upon seamless access to curated research data. Major national and European funding schemes, such as Horizon Europe, strongly encourage or require publicly funded data to be FAIR&nbsp; - that is, Findable, Accessible, Interoperable, Reusable (Wilkinson, 2016). What underpins such initiatives are the many data organizations and repositories working with their stakeholders and each other to establish policies and practices, implement them, and do the curatorial work to increase the available, discoverability, and accessibility of high quality research data. However, such work has often been invisible and underfunded, necessitating creative and collaborative solutions.</p> <p class="Abstract">In this paper, we briefly describe how one such case from social science data: the processing of the Eurobarometer data set. Using content analysis of administrative documents and interviews, we detail how European data archives managed the tensions of curatorial work across borders and jurisdictions from the 1970s to the mid-2000s, the challenges that they faced in distributing work, and the solutions they found. In particular, we look at the interactions of the Council of European Social Science Data Archives (CESSDA) and social science data organizations (DO) like UKDA, ICPSR, and GESIS and the institutional and organizational collaborations that made Eurobarometer “too big to fail”. We describe some of the invisible work that they underwent in the past in making data in Europe findable, accessible, interoperable, and conclude with implications for “frictionless” data access and reuse today.</p> </div> <p>&nbsp;</p> 2020-08-12T12:28:26+01:00 ##submission.copyrightStatement## Identifying Opportunities for Collective Curation During Archaeological Excavations 2020-12-02T02:51:02+00:00 Ixchel Faniel Anne Austin Sarah Whitcher Kansa Eric Kansa Jennifer Jacobs Phoebe France <p>Archaeological excavations are comprised of interdisciplinary teams that create, manage, and share data as they unearth and analyse material culture. These team-based settings are ripe for collective curation during these data lifecycle stages. However, findings from four excavation sites show that the data interdisciplinary teams create are not well integrated. Knowing this, we recommended opportunities for collective curation to improve use and reuse of the data within and outside of the team.</p> 2020-08-06T15:36:05+01:00 ##submission.copyrightStatement## Cross-tier Web Programming for Curated Databases: a Case Study 2020-12-02T02:52:08+00:00 Simon Fowler Simon Harding Joanna Sharman James Cheney <p class="Abstract">Curated databases have become important sources of information across several scientific disciplines, and as the result of manual work of experts, often become important reference works. Features such as provenance tracking, archiving, and data citation are widely regarded as important features for the curated databases, but implementing such features is challenging, and small database projects often lack the resources to do so.</p> <p class="Abstract">A scientific database application is not just the relational database itself, but also an ecosystem of web applications to display the data, and applications which allow data curation. Supporting advanced curation features requires changing all of these components, and there is currently no way to provide such capabilities in a reusable way.</p> <p class="Abstract">Cross-tier programming languages have been proposed to simplify the creation of web applications, where developers can write an application in a single, uniform language. Consequently, database queries and updates can be written in the same language as the rest of the program, and at least in principle, it should be possible to provide curation features reusably via program transformations. As a first step towards this goal, it is important to establish that realistic curated databases can be implemented in a cross-tier programming language.</p> <p class="Abstract">In this paper, we describe such a case study: reimplementing the web front end of a real world scientific database, the IUPHAR/BPS Guide to Pharmacology (GtoPdb), in the Links cross-tier programming language. We show how programming language features such as language-integrated query simplify the development process, and rule out common errors. Through a comparative performance evaluation, we show that the Links implementation performs fewer database queries, while the time needed to handle the queries is comparable to the Java version. Furthermore, while there is some overhead to using Links because of its comparative immaturity compared to Java, the Links version is usable as a proof-of-concept case study of cross-tier programming for curated databases.</p> <p>[ This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review. The most up-to-date version of the paper can be found on arXiv <a href=""></a> ]</p> 2020-07-30T16:49:03+01:00 ##submission.copyrightStatement## The Road to Partnership: a Stepwise, Iterative Approach to Organisational Collaboration in RDM, Archives and Records Management 2020-12-02T02:50:13+00:00 Michelle Harricharan Carly Manson Kirsten Hylan <p>Research data management (RDM) sits at the confluence of a number of related roles. The shape an RDM confluence takes depends on several factors including the nature of an organisation and the research that it undertakes. At St George’s, University of London, the UK’s only university dedicated to medical and health sciences education, training and research, RDM has been intricately interwoven with organisational information governance roles since its inception. RDM is represented on our institutional Information Governance Steering Group and our Information Management Team consisting of information governance, data protection, freedom of information, archives, records management and RDM.</p> <p>This paper reports on how RDM, archives and records management have collaborated using a step-wise, iterative process to streamline and harmonise our guidance and workflows in relation to the stewardship, curation and preservation of research data. As part of this we consistently develop, conduct and evaluate small projects on managing, curating and preserving data. We present three projects that we collaborated on to transform research data services across each of our departments:</p> <ul> <li>planning for, conducting and reporting on interviews with wet laboratory researchers</li> <li>advocating, building a case for and delivering a university-wide digital preservation system</li> <li>ongoing work to recover, preserve and facilitate access to a unique national health database</li> </ul> <p>Learnings from these projects are used to develop our guidance, improve our activities and integrate our workflows, the outcomes of which may be further evaluated. Learnings are also used to improve our ways of working together. Through deeper integration of our activities and workflows, rather than simply aligning aspects of our work, we are increasingly becoming partners on research data stewardship, curation and preservation. This approach offers several benefits to the organisation as it allows us to build on our related knowledge and skills and deliver outcomes that demonstrate greater value to the organisation and the researchers we support.</p> 2020-10-25T22:32:16+00:00 ##submission.copyrightStatement## Understanding the Data Management Plan as a Boundary Object through a Multi-stakeholder perspective 2020-12-02T03:26:56+00:00 Live Kvale Nils Pharo <div class="WordSection1"> <p class="Abstract">A three-phase Delphi study was used to investigate an emerging community for research data management in Norway and their understanding and application of data management plans (DMPs). The findings reveal visions of what the DMP should be as well as different practice approaches, yet the stakeholders present common goals. This paper discusses the different perspectives on the DMP by applying Star and Griesemer’s theory of boundary objects (Star &amp; Griesemer, 1989). The debate on what the DMP is and the findings presented are relevant to all research communities currently implementing DMP procedures and requirements. The current discussions about DMPs tend to be distant from the active researchers and limited to the needs of funders and institutions rather than to the usefulness for researchers. By analysing the DMP as a boundary object, plastic and adaptable yet with a robust identity (Star &amp; Griesemer, 1989), and by translating between worlds where collaboration on data sharing can take place we expand the perspectives and include all stakeholders. An understanding of the DMP as a boundary object can shift the focus from shaping a DMP which fulfils funders’ requirements to enabling collaboration on data management and sharing across domains using standardised forms.</p> </div> <p>&nbsp;[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2020-12-02T03:58:27+00:00 ##submission.copyrightStatement## “You say potato, I say potato“ Mapping Digital Preservation and Research Data Management Concepts towards Collective Curation and Preservation Strategies 2020-12-02T02:50:39+00:00 Michelle Lindlar Pia Rudnik Sarah Jones Laurence Horton <div class="WordSection1"> <p class="Abstract">This paper explores models, concepts and terminology used in the Research Data Management and Digital Preservation communities. In doing so we identify several overlaps and mutual concerns where the advancements of one professional field can apply to and assist another. By focusing on what unites rather than divides us, and by adopting a more holistic approach we advance towards collective curation and preservation strategies.</p> </div> <p>&nbsp;</p> 2020-08-09T21:15:29+01:00 ##submission.copyrightStatement## Privacy Impact Assessments for Digital Repositories 2020-12-02T03:27:01+00:00 Abraham Mhaidli Libby Hemphill Florian Schaub Cundiff Jordan Andrea K. Thomer <p class="AbstractTitle">Trustworthy data repositories ensure the security of their collections. We argue they should also ensure the security of researcher and human subject data. Here we demonstrate the use of a privacy impact assessment (PIA) to evaluate potential privacy risks to researchers using the ICPSR’s Open Badges Research Credential System as a case study. We present our workflow and discuss potential privacy risks and mitigations for those risks.</p> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]&nbsp;</p> 2020-12-02T03:58:27+00:00 ##submission.copyrightStatement## Finding a Repository with the Help of Machine-Actionable DMPs: Opportunities and Challenges 2020-12-02T03:26:58+00:00 Simon Oblasser Tomasz Miksa Asanobu Kitamoto <p class="Abstract">Finding a suitable repository to deposit research data is a difficult task for researchers since the landscape consists of thousands of repositories and automated tool support is limited. Machine-actionable DMPs can improve the situation since they contain relevant context information in a structured and machine-friendly way and therefore enable automated support in repository recommendation.</p> <p class="Abstract">This work describes the current practice of repository selection and the available support today. We outline the opportunities and challenges of using machine-actionable DMPs to improve repository recommendation. By linking the use case of repository recommendation to the ten principles for machine-actionable DMPs, we show how this vision can be realized. A filterable and searchable repository registry that provides rich metadata for each indexed repository record is a key element in the architecture described. At the example of repository registries we show that by mapping machine-actionable DMP content and data policy elements to their filter criteria and querying their APIs a ranked list of repositories can be suggested.</p> <p>&nbsp;[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2020-12-02T03:58:27+00:00 ##submission.copyrightStatement## Sustaining Software Preservation Efforts Through Use and Communities of Practice 2020-12-02T02:52:02+00:00 Fernando Rios Monique Lassere Judd Ethan Ruggill Ken S. McAllister <p class="Abstract">The brief history of software preservation efforts illustrates one phenomenon repeatedly: not unlike spinning a plate on a broomstick, it is easy to get things going, but difficult to keep them stable and moving. Within the context of video games and other forms of cultural heritage (where most software preservation efforts have lately been focused), this challenge has several characteristic expressions, some technical (e.g., the difficulty of capturing and emulating protected binary files and proprietary hardware), and some legal (e.g., providing archive users with access to preserved games in the face of variously threatening end user licence agreements). In other contexts, such as the preservation of research-oriented software, there can be additional challenges, including insufficient awareness and training on unusual (or even unique) software and hardware systems, as well as a general lack of incentive for preserving “old data.” We believe that in both contexts, there is a relatively accessible solution: the fostering of communities of practice. Such groups are designed to bring together like-minded individuals to discuss, share, teach, implement, and sustain special interest groups—in this case, groups engaged in software preservation.</p> <p class="Abstract">In this paper, we present two approaches to sustaining software preservation efforts via community. The first is emphasizing within the community of practice the importance of “preservation through use,” that is, preserving software heritage by staying familiar with how it feels, looks, and works. The second approach for sustaining software preservation efforts is to convene direct and adjacent expertise to facilitate knowledge exchange across domain barriers to help address local needs; a sufficiently diverse community will be able (and eager) to provide these types of expertise on an as-needed basis. We outline here these sustainability mechanisms, then show how the networking of various domain-specific preservation efforts can be converted into a cohesive, transdisciplinary, and highly collaborative software preservation team.</p> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2020-08-02T22:04:02+01:00 ##submission.copyrightStatement## Data Communities: Empowering Researcher-Driven Data Sharing in the Sciences 2020-12-02T03:27:00+00:00 Rebecca Springer Danielle Cooper <p>There is a growing perception that science can progress more quickly, more innovatively, and more rigorously when researchers share data with each other. However many scientists are not engaging in data sharing and remain skeptical of its relevance to their work. As organizations and initiatives designed to promote STEM data sharing multiply – within, across, and outside academic institutions – there is a pressing need to decide strategically on the best ways to move forward. In this paper, we propose a new mechanism for conceptualizing and supporting STEM research data sharing..&nbsp;Successful data sharing happens within&nbsp;<em>data communities</em>, formal or informal groups of scholars who share a certain type of data with each other, regardless of disciplinary boundaries. Drawing on the findings of four large-scale qualitative studies of research practices conducted by Ithaka S+R, as well as the scholarly literature, we identify what constitutes a data community and outline its most important features by studying three success stories, investigating the circumstances under which intensive data sharing is already happening. We contend that stakeholders who wish to promote data sharing – librarians, information technologists, scholarly communications professionals, and research funders, to name a few – should work to identify and empower&nbsp;<em>emergent data communities</em>. These are groups of scholars for whom a relatively straightforward technological intervention, usually the establishment of a data repository, could kickstart the growth of a more active data sharing culture. We conclude by offering recommendations for ways forward.</p> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2020-12-02T03:58:27+00:00 ##submission.copyrightStatement## Three Approaches to Documenting Database Migrations 2020-12-02T02:51:14+00:00 Andrea K. Thomer Alexandria Jane Rayburn Allison R. B. Tyler <div class="WordSection1"> <p class="Abstract">Database migration is a crucial aspect of digital collections management, yet there are few best practices to guide practitioners in this work. There is also limited research on the patterns of use and processes motivating database migrations. In the “Migrating Research Data Collections” project, we are developing these best practices through a multi-case study of database and digital collections migration. We find that a first and fundamental problem faced by collection staff is a sheer lack of documentation about past database migrations. We contribute a discussion of ways information professionals can reconstruct missing documentation, and some three approaches that others might take for documenting migrations going forward.</p> </div> <p>[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.]</p> 2020-08-06T13:56:03+01:00 ##submission.copyrightStatement## Complementary Data as Metadata: Building Context for the Reuse of Video Records of Practice 2020-12-02T02:50:11+00:00 Allison Rae Bobyak Tyler Kara Suzuka Elizabeth Yakel <p class="AfterHeading12">Data reuse is often dependent on context external to the data. At times, this context is actually additional data that helps data reusers better assess and/or understand the target data upon which they are focused. We refer to these data as complementary data and define these as data external to the target data which could be used as evidence in their own right. In this paper, we specifically we focus on video records of practice in education. Records of practice are a type of data that more broadly document events surrounding teaching and learning. Video records of practice are an interesting case of data reuse as they can be extensive (e.g., days or weeks of video of a classroom), result in large files sizes, and require both metadata and other complementary data in order for reusers to understand the events depicted in the video. Through our mixed methods study, consisting of a survey of data reusers in 4 repositories and 44 in-depth interviews, we identified the types of complementary data that assist reusers of video records of practice for either teaching and/or research. While there were similarities in the types of complementary data identified as important to have when reusing VROP, the rationales and motivations for seeking out particular complementary data differed depending on whether the intended use was for teaching or research. <span lang="EN-US">While metadata is an important and valuable means of describing data for reuse, data’s meaning is often constructed through comparison, verification, or elucidation in reference to other data.</span></p> <p>&nbsp;</p> 2020-11-01T19:09:35+00:00 ##submission.copyrightStatement## Extending the Research Data Toolkit: Data Curation Primers 2020-12-02T02:50:22+00:00 Cynthia Hudson-Vitale Hannah Hadley Jennifer Moore Lisa Johnston Wendy Kozlowski Jake Carlson Mara Blake Joel Herndon <p>Niche and proprietary data formats used in cutting-edge research and technology have specific curation considerations and challenges. The increased demand for subject liaisons, library archivists, and digital curators to curate this variety of data types created locally at an institution or organization poses difficulties. Subject liaisons possess discipline knowledge and expertise for a given domain or discipline and digital curation experts know how to properly steward data assets generally. Yet, a gap often exists between the expertise available within the organization and local curation needs.</p> <p>While many institutions and organizations have expertise in certain domains and areas, oftentimes the heterogeneous data types received for deposit extend beyond this expertise. Additionally, evolving research methods and new, cutting-edge technology used in research often result in unfamiliar and niche data formats received for deposit. Knowing how to ‘get-started’ in curating these file types and formats can be a particular challenge. To address this need, the data curation community have been developing a new set of tools - data curation primers. These primers are evolving documents that detail a specific subject, disciplinary area or curation task, and that can be used as a reference or jump-start to curating research data. This paper will provide background on the data curation primers and their content detail the process of their development, highlight the data curation primers published to date, emphasize how curators can incorporate these resources into workflows, and show curators how they can get involved and share their own expertise.</p> 2020-08-19T16:38:55+01:00 ##submission.copyrightStatement## Co-Creating Autonomy: Group Data Protection and Individual Self-determination within a Data Commons 2020-12-02T02:50:32+00:00 Janis Wong Tristan Henderson <div class="WordSection1"> <p class="Abstract">Recent privacy scandals such as Cambridge Analytica and the Nightingale Project show that data sharing must be carefully managed and regulated to prevent data misuse. Data protection law, legal frameworks, and technological solutions tend to focus on controller responsibilities as opposed to protecting data subjects from the beginning of the data collection process. Using a case study of how data subjects can be better protected during data curation, we propose that a co-created data commons can protect individual autonomy over personal data through collective curation and rebalance power between data subjects and controllers.</p> </div> <p>&nbsp;</p> 2020-08-11T22:49:00+01:00 ##submission.copyrightStatement##