Data Discovery
From GRDI2020
This is a GRDI challenge; return to Main Page with all the challenges and recommendations
Introduction
Without mechanisms for discovery, data infrastructures are merely a storage area, which only allow you to retrieve the data you know from the location you know. The public web initially used to be a dark place like this, and only with the advent of search engines it opened up to other user communities.
A Google-like search interface may be a first step to data discovery across data infrastructures, however in a scientific environment keyword-based search may be insufficient. Users may want to be able to search for number ranges in specific measurements, for molecule structures or other specialised concepts. For this reason, there may be a number of data discovery mechanisms in a single data infrastructure, and they may be driven by specialised user requirements.
How do we deal with the various ‘filters’ that different disciplines use when choosing and describing data? What about differences in these attitudes within disciplines, or from one time to another? (from: HELG on Scientific Data)
10-year vision
Ten years from now, there will be mechanisms to instantiate dedicated discovery mechanisms for specific communities or even specific research questions. This equips them with different entry-points and perspectives on research data without losing grounding: even where discovery mechanisms abstract from the actual data through e.g. visualisations, transparency through direct access to the very data items should be guaranteed.
While the prediction of the rise of "semantic search" and more localised/specialised search engines has as yet not become a reality for the public web at large, applying such technologies on research data (that may be quality controlled, confirm to standards and may have consistent collections within fields) appears obvious and is already being done in many initiatives. Technologies might include the following:
- visualisation
- structural query mechanisms (e.g. though XPath)
- semantic annotations, enhancing with controlled vocabularies (depending on the community context)
- recommender systems (e.g. through data usage statistics and information retrieval techniques such as clustering)
These mechanisms may only need to trawl through dedicated subsets of all available research data (e.g. only XML/TEI files that conform to a specific schema), and they aim to create more dedicated ways of exploring, analysing and interacting with the data. Analysis and visualisation are supporting research lifecycles and may be the key to innovation, but they need to be sufficiently tailored to a research domain and must be modelled by the researcher in order not to be misleading.
Challenges and Recommendations
"Discovery" needs to be established as an application-level mechanism (as opposed to "infrastructure service"), to enable users to build their own discovery tool for search, analysis and visualisation.
From a technical point of view, this may be in reach (e.g. with the availabiliy of an API for processing research data and existing platforms like ManyEyes). Also analysis of data usage (i.e. access statistics) may be useful.
In addition to this, also the attitude of users needs to change and users need to be more pro-active in designing their environments for search, analysis and visualisation. Translators may need to be equipped accordingly.