Data Interoperability

From GRDI2020

Jump to: navigation, search

This is a GRDI challenge; return to Main Page with all the challenges and recommendations


Contents

Introduction

“The ability of two or more systems or components to exchange information and to use the information that has been exchanged” (IEEE definition)

As data is shared between technological and organisational contexts, a set of challenges may occur, which can be clustered in three levels:

  • syntactic - data can be read on a machine-level. This includes the identification and retrieval of data. (Metaphor: Take a book from the bookshelf and browse it.)
  • structural - data can be interpreted on a logical level. This includes the interpretation of the data format through machines, and it enables automatic services like format conversion (Metaphor: Recognise pages, chapters and other structural entities in a book; even if the book is written in a foreign language.)
  • semantic - data is meaningful for machine and/or human consumption. This may include the comprehension of the data's content and context through metadata and references to related documents, and it may enable services like (semi-)automatic data fusion or information extraction.(Metaphor: Understand the content and purpose of the book. If it is a cookbook, the consumer might be able to - with the proper tools and skills - make a recipe.)

Eventually, data is worthless without semantics, and all levels of interoperability need to be covered, for large-scale federated systems as much as for singular interaction between two agents. Interoperability can either be achieved through shared standards or by mediating between different data types. The two approaches are conceptually different, in that standards (global-as-view) enable a single interpretation across the whole ecosystem, whereas mediation (local-as-view) focuses on the relationship between specific sources. Both methods may incur information loss when converting away from a native format, and eventually an approach needs to be modeled with the expected use in mind and needs to be agreed upon by all stakeholders.

How can we implement interoperability within disciplines and move to an overarching multi-disciplinary way of understanding and using data? How can we find unfamiliar but relevant data resources beyond simple keyword searches, but involving a deeper probing into the data? How can automated tools find the information needed to tackle unfamiliar data? (from: HLEG on Scientific Data)


10-year vision

10 years from now, there will be mechanisms embedded into data infrastructure that assist in the translation and cleaning-up of data between distinct formats and data paradigms (e.g. relational data vs. XML vs. file-based). These mechanisms will be automatic where feasible, and otherwise capable of learning interactively as users guide on interoperability paths.

For example, in the digitisation of Irish 1901 and 1911 census records data needs to be converted from digitised images, to full-text, to tabular data in order to be useful and searchable in a tailored manner. OCR will need to be supplemented with manual correction, and information extraction on the plain-text may subsequently be improved through "teaching" the information extraction tool the template and the context of census files.


Challenges and Recommendations

The following recommendations address this challenge:

And by preservation infrastructure we mean the infrastructure that will provide digital preservation as: long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time span the information is required for. Solving it is also a prerequisite for solving issues such as developing easy-to-use service interfaces.


External Links

Personal tools