Data Interoperability
From GRDI2020
This is a GRDI challenge; return to Main Page with all the challenges and recommendations
Contents |
Introduction
“The ability of two or more systems or components to exchange information and to use the information that has been exchanged” (IEEE definition)
As data is shared between technological and organisational contexts, a set of challenges may occur, which can be clustered in three levels:
- syntactic - data can be read on a machine-level. This includes the identification and retrieval of data. (Metaphor: Take a book from the bookshelf and browse it.)
- structural - data can be interpreted on a logical level. This includes the interpretation of the data format through machines, and it enables automatic services like format conversion (Metaphor: Recognise pages, chapters and other structural entities in a book; even if the book is written in a foreign language.)
- semantic - data is meaningful for machine and/or human consumption. This may include the comprehension of the data's content and context through metadata and references to related documents, and it may enable services like (semi-)automatic data fusion or information extraction.(Metaphor: Understand the content and purpose of the book. If it is a cookbook, the consumer might be able to - with the proper tools and skills - make a recipe.)
Eventually, data is worthless without semantics, and all levels of interoperability need to be covered, for large-scale federated systems as much as for singular interaction between two agents. Interoperability can either be achieved through shared standards or by mediating between different data types. The two approaches are conceptually different, in that standards (global-as-view) enable a single interpretation across the whole ecosystem, whereas mediation (local-as-view) focuses on the relationship between specific sources. Both methods may incur information loss when converting away from a native format, and eventually an approach needs to be modeled with the expected use in mind and needs to be agreed upon by all stakeholders.
How can we implement interoperability within disciplines and move to an overarching multi-disciplinary way of understanding and using data? How can we find unfamiliar but relevant data resources beyond simple keyword searches, but involving a deeper probing into the data? How can automated tools find the information needed to tackle unfamiliar data? (from: HLEG on Scientific Data)
10-year vision
10 years from now, there will be mechanisms embedded into data infrastructure that assist in the translation and cleaning-up of data between distinct formats and data paradigms (e.g. relational data vs. XML vs. file-based). These mechanisms will be automatic where feasible, and otherwise capable of learning interactively as users guide on interoperability paths.
For example, in the digitisation of Irish 1901 and 1911 census records data needs to be converted from digitised images, to full-text, to tabular data in order to be useful and searchable in a tailored manner. OCR will need to be supplemented with manual correction, and information extraction on the plain-text may subsequently be improved through "teaching" the information extraction tool the template and the context of census files.
Challenges and Recommendations
The following recommendations address this challenge:
- Support disciplinary data infrastructures in building networks of boundary objects
- Research and deploy services to cover a mediation function
- Facilitate interoperability across preservation infrastructures
And by preservation infrastructure we mean the infrastructure that will provide digital preservation as: long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time span the information is required for. Solving it is also a prerequisite for solving issues such as developing easy-to-use service interfaces.