Data Curation and Preservation

From GRDI2020

Jump to: navigation, search

This is one of the "GRDI challenges"; return to Main Page


Contents

Introduction

The notions "long-term preservation" and "data curation" vary between contexts and communities. As a common terminology basis we distinguish between three curation levels that build upon each other. These three levels are directly derived from the three aspects of digital objects distinguished by the preservation community (Thibodeau, 2002):

  1. Bit Preservation: physical - signs inscribed on a medium
  2. Content Preservation: logical - processable units
  3. Data Curation: conceptual - epistemic objects, what we deal with in the real world

These three curation levels describe—just like the analogy in digital objects—independent aspects, yet they build upon each other and can influence and support each other. Arguably, global data infrastructures should cover both bitstream preservation and content preservation.

Some of the following issues therefore need to be tackled to attain data curation and preservation. How can we be sure that the important information we collect will be usable and understandable in the future; in particular how can we fund our information resources in the long term? How can we share the costs and efforts required for sustainability? How can we decide what to preserve? (cf. HLEG on Scientific Data)

A task force commissioned by the Research Libraries Group provided one of the first comprehensive analyses of the challenges associated with digital preservation. At that time, it was not yet widely known that, for example, invaluable data from NASA moon missions or the 2.5 Million GBP BBC Domesday project were lost due to lack of data preservation. Following the 1996 report, numerous preservation initiatives were inaugurated and with the latest publication of the UNESCO Guidelines for the Preservation of Digital Heritage, preservation challenges were widely acknowledged and addressed.

Today, there are numerous preservation initiatives at the local, national, and international level. The community is fairly well established, as indicated by the following exemplary list of existing standards and tools. Initially, the topics were often dominated by cultural heritage institutions (archives and libraries), while today the community is more diverse: Commercial companies often have an inherent interest in preserving their corporate knowledge, and they are often obliged by law to retain data (e.g., banks are required by BASEL-II to retain data; pharmaceutical companies have to retain data about their medicines for decades after the last person who took respective medicine has died). Increasingly, research funders require the retention of research data for at least 10 years to be verified and reused in future research.

All in all, financial and legal constraints drive a more structured and international approach to data curation and preservation. Many of the issues are more organisational than technological, and they might need to be addressed in an on-going manner. By definition, there is no one-shot solution to digital preservation and curation. Nevertheless, there are still various (technological) issues to be resolved in order to establish a stable, decentralised ecosystem for digital preservation and curation.


Standards

  1. OAIS (ISO standard)
  2. TRAC (nascent ISO standard)
  3. PREMIS and METS (community standard, hosted by influential organisation - LoC in this case)

Tools

  1. JHOVE - format validation and metadata extraction
  2. crib - format conversion
  3. Pronom - format registry (community service, hosted by influential organisation - the UK National Archives in this case)

10-year vision

(from a user perspective) 10 years from now, there will be defined pathways to preservation and curation, and researchers will at all times be aware of the consequences of their data management actions:

  • How likely will the data be available?
  • How well will the data be comprehensible, retrievable, and re-useable?
  • How much will it cost? (both in terms of financial and time cost)


(from an infrastructure perspective) 10 years from now, there will be interacting infrastructures:

  • supporting each other in different aspects of preservation and curation (e.g. bit-preservation, format conversion, data validation)
  • with interoperable paths between them (costs or data loss between them will be transparent and can be accounted)
  • ensure scalability, availability and recovery through e.g. mechanisms for disaster recovery, infrastructure monitoring and planning that are super-institutional(e.g. inter-/national, community-wide)


(from a policy and funder perspective) 10 years from now, there will be:

  • an international community of practice across disciplinary borders that ensures that experiences will be passed on and make best use of funds
  • competency clusters are forming for focus areas (i.e. a modular, international ecosystem rather than local (e.g. discipline-specific or regional), all-in-one solutions) that ensure sustainable services and economies of scale
  • a nascent commercial market for various aspects of perservation and curation, e.g. curation consultants, providers of migration services, insurances for data loss (cf. re-insurances)


State of the art

One of the first comprehensive analyses of the challenges associated with digital preservation was provided by a Task Force commissioned by the Research Libraries Group (1996). At that time it was not yet widely known that e.g. invaluable data from NASA moon missions or the 2.5 Million GBP BBC Domesday project were lost due to lack of data preservation. Following the 1996 report, numerous preservation initiatives were inaugurated and at latest with the publication of the UNESCO Guidelines for the Preservation of Digital Heritage (2003), preservation challenges were widely acknowledged and addressed.

Today there are numerous preservation initiatives on a local, national, and international level. The community is fairly established, as the following examplary list of existing standards and tools indicates. While the topics were initially often dominated by cultural heritage institutions (archives and libraries), today the community is more diverse: Commercial companies often have an inherent interest in preserving their corporate knowledge and they are often obliged by law to retain data (e.g. banks are required by BASEL-II to retain data; pharmaceutical companies have to retain data about their medicines for decades after the last person who took respective medicine has died; etc.) Also, increaingly research funders require the retention of research data for at least 10 years to be verified and re-used in future research.

All in all, financial and legal constraints drive a more structured and international approach to data curation and preservation. Many of the issues are more of an organisational than a technical nature, and they may need to be addressed in an ongoing manner - there is by definition no one-shot solution to digital preservation and curation. Nevertheless, there are still various (technical) issues to be resolved in order to establish a stable, decentralised ecosystem for digital preservation and curation.



Challenges and Recommendations

Data Curation and Preservation interacts with many other topics, particularly with Data Storage on a Bit-Preservation level and Data Use - Virtual Research Environments on a Curation level. The following hence represents a sample of pressing challenges, mostly on the content preservation level.



A related recommendation with regard to Bit Preservation is "Define reference SLAs to cluster data retention requirements".

External Links

Personal tools