Implement data lifecycle management into and across data infrastructures
From GRDI2020
This is a GRDI recommendation; return to Main Page with all the challenges and recommendations This recommendation is cited by Data Curation and Preservation.
Contents |
Context and Challenges
As Francine Berman <ref>Francine Berman: Got data?: a guide to data preservation in the information age. Commun. ACM 51(12): 50-56 (2008)</ref> argues, not all data can be kept.
And not all data need to be kept. Some data may be unique and irrecoverable (e.g. interviews with survivors of the Shoa cannot be recreated when the survivors have deceased); other data may expire after several years, when scientific instruments have improved; may be re-created with little costs in the case of loss; or may indeed only be relevant during the lifetime of a specific project.
Recommendation
Develop frameworks for data lifecycle management in data infrastructures, which acknowledge that not all data are born equal. This may include e.g. that active data is being managed for live usage, valuable data is versioned and preserved for reuse, and superseeded or redundant data is disposed of. Data lifecycle management in data infrastructures needs to ensure reliability (i.e. valuable data are reliably retained), while at the same it should keep costs down and efficiency up (e.g. disposed of data does not use storage space and does not need to be migrated to fresh media).
This includes:
- data value assessment
- retention schedules: How long is data to be preserved (minimum time due to legal, etc issues), and what happens after that period has passed (e.g. re-evaluation, disposal)? Needs to be implemented into APIs and administrative data infrastructure processes, as well as e.g. metadata.
- from redundancy to replication: While replication may be good to ensure reliable data preservation (e.g. "lots of copies keeps stuff safe"), uncontrolled redundancy may just increase costs and fragmentation for the user.
- monitoring of usage patterns and quality assurance in lifecycle stages: e.g. when moving from "active use" to "archived" data may need to be validated and frozen for citability
Stakeholders
Computer scientists need to extend current data infrastructures with the capabilities for lifecycle management. Lifecycle management is a feature of various commercial data management software vendors (e.g. wikipedia) and SNIA has also built up relevant experience that should be embraced and made interoperable with public data infrastructures.
Last but certainly not least, usage of data lifecycle management needs to be embedded in the research workflow and trained.
References
<references />