Data Provenance and Trust
From GRDI2020
This is a GRDI challenge; return to Main Page with all the challenges and recommendations
Introduction
Provenance refers to the source of data, as well as how those data have been updated over time. Essentially, provenance is a key aspect in ensuring the authenticity of the data, which in turn builds user trust. As such, it can serve a number of important functions.
- Explanation: Users might be particularly interested in or wary of specific portions of a derived data set. Provenance supports “drilling down” to examine the sources and the evolution of data elements of interest, enabling a deeper understanding of the data.
- Verification: Derived data can appear suspect—because of possible bugs in data processing and manipulation, because the data might be stale, or even because of maliciousness. Provenance enables auditing how data were produced, either for verifying their correctness or for identifying the erroneous or outdated source data or processing nodes that are responsible for erroneous or outdated data.
- Repeatability / Reuse: Having found outdated or incorrect source data, or buggy processing nodes, we might want to correct the errors and propagate the corrections forward to all “downstream” data that are affected. Provenance helps to recompute only those data elements that are affected by the corrections. Moreover, researchers can reuse existing data in different contexts, using different algorithms for analysis, merging them with other data, or applying wholly new research questions. The version trees created through data reuse are essentially established and described through provenance.
There has been a large body of very interesting work in provenance over the past two decades. Nevertheless, there are still many limitations and open areas. Specifically, the primary focus is on modelling and capturing provenance: How is provenance information represented? How is it generated and maintained across heterogeneous systems and organisational/policy domains? There has been considerably less work on querying provenance: What can we do with provenance information after we’ve captured it?
How can we make informed judgements about whether certain data are authentic and can be trusted? How can we judge which repositories we can trust? How can appropriate access and use of resources be granted or controlled? How can data producers be rewarded for publishing data? How can we know who has deposited what data and who is reusing them—or who has the right to access data that is restricted in some way? (from: HLEG on Scientific Data)
10-year vision
Ten years from now, there will be provenance standards for data and tools that enable a complete and consistent tracking of digital objects, as they are being generated and processed in the research lifecycle through manually and by automatic means. From a policy perspective, this enables the verification and reproducibility of research results. From a user perspective, this enables users to understand and trust the authenticity of digital objects, and hence enables users to reuse research data in other contexts.
Challenges and Recommendations
Part of provenance tracking is recording events, when they occur and who triggered them.
Recommendation 1: To describe events, infrastructures need to agree on standards (a) for data standards that describe services in various verbosity (e.g., software versions), (b) for automatically obtaining service descriptions in an automatic workflow, and (c) for embedding human-generated provenance information.
(recommendation 2) To describe who triggered an event, global user references need to be established to ensure that a user (who might work in multiple institutions, change affiliation over time, and work across mutliple research environments with distinct user management mechanisms) is adequately identified in accordance with data protection laws. CRIS systems might or might not be a path to solving this recommendation.
The implementation of provenance needs to be ensured across systems and organisational domains by way of certification frameworks, and will be an essential part of data curation (Identify and promote best practices for data curation as part of good scientific practice).