Define reference SLAs to cluster data retention requirements

From GRDI2020

Jump to: navigation, search

This is a GRDI recommendation; return to Main Page with all the challenges or to recommendations


Context and challenges

Data centres are ready to take some responsibility with regard to data retention. Services could include the following.

  • bit preservation of data for a specific period of time (e.g., 10 years for good scientific practice and legal requirements)
  • apply mechanisms to ensure bit preservation (e.g., redundant data storage, migration to fresh media, backups, and integrity checks every few years)
  • allow different modes of accessibility (e.g., immediate access via REST interface to online storage, or tape backup with a maximum retrieval time of less than a few minutes)

However, to be able to offer this as a sustainable service, data centers need to leverage economies of scale, and cluster requirements for data retention into a select set of service offerings and SLAs. Currently, there are few available business models for such data retention policies (e.g., by some communities and by Princeton University’s World Data Centers). Such service offers could stretch over many years (e.g., medical data might be legally required to be retained for more than 70 years after a medicine has been discontinued), which means there is a level of financial risk that data centres often find difficult to cover from their institutional budgets. Commercial offerings like Amazon S3 offer no guarantees that data will be available (e.g., data might be lost if a data centre is damaged or if Amazon goes bankrupt).

Recommendation

Trigger a concertation of data centres to define a set of SLAs for data retention and bit preservation, together with select user communities and with the support of policy makers and funding organisations. The outcome could be a set of reference SLAs, which data centres can implement, users can reference, and policy makers can support.

Stakeholders and Impact

Enforcing the basic validity criteria for data will make data reuse more efficient for the community itself and also for possible future inter- or trans-disciplinary activities. This involves a common effort by all stakeholders.

  • There must be a concertation between all stakeholders to define reference SLAs.
  • Data centres need to transparently document their service and how it varies from reference SLAs. For example, if the SLA ensures "bit preservation", does that include data replication to geographically separate locations?
  • Users need to define their expectations and choose adequate SLAs accordingly. This could include retention schedules that define whether data can be disposed of after, for example, 10 years.
  • Policy makers have to define their expectations for good scientific practice, and funders need to account for the costs of data retention that specific research activities might entail.
Personal tools