- The New Science Paradigm
Some areas of science are currently facing from a hundred – to a thousand-fold increase in volumes of data
compared to the volumes generated only a decade ago. This data is coming from satellites, telescopes,
high-throughput instruments, sensor networks, accelerators, supercomputers, simulations, and so on [1]. The
availability and use of huge datasets presents both new opportunities and at the same time new challenges
for scientific research.
Often referred to as a data deluge massive datasets are revolutionizing the way research is carried out and
resulting in the emergence of a new fourth paradigm of science based on data-intensive computing
[2]. New data-dominated science will lead to a new data-centric way of conceptualizing, organizing and
carrying out research activities which could lead to a rethinking of new approaches to solve problems that
were previously considered extremely hard or, in some cases, even impossible to solve and also lead to
serendipitous discoveries.
The new availability of huge amounts of data, along with advanced tools of exploratory data analysis, data
mining/machine learning and data visualization, offers a whole new way of understanding the world. One
view put forward is that in the new data-rich environment correlation supersedes causation, and science can
advance even without coherent models, unified theories, or really any mechanistic explanation at all [3].
In order to be able to exploit these huge volumes of data, new techniques and technologies are needed. A
new type of e-infrastructure, the Research Data Infrastructure, must be developed for harnessing the
accumulating data and knowledge produced by the communities of research, optimizing the data movement
across scientific disciplines, enabling large increases in multi- and inter- disciplinary science while reducing
duplication of effort and resources, and integrating research data with published literature.
To make this happen several breakthroughs must be achieved in the fields of research data modelling, management and tools.
Global Research Data Infrastructures: The GRDI2020 Vision
4 - Research Data Infrastructures
Research Data Infrastructures can be defined as managed networked environments for digital
research data consisting of services and tools that support: (i) the whole research cycle, (ii) the movement of scientific data across scientific disciplines, (iii) the creation of open linked data spaces by
connecting data sets from diverse disciplines, (iv) the management of scientific workflows, (v) the interoperation between scientific data and literature, and (vi) an Integrated Science Policy Framework.
Research data infrastructures are not systems in the traditional sense of the term; they are networks
that enable locally controlled and maintained digital data and library systems to interoperate more
or less seamlessly. Genuine research data infrastructures should be ubiquitous, reliable, and widely
shared resources operating on national and transnational scales.
A research data infrastructure should include organizational practices, technical infrastructure and social forms that collectively provide for the smooth operation of collaborative scientific
work across multiple geographic locations. All three should be objects of design and engineering; a
data infrastructure will fail if any one of these three elements is ignored [4].
Another school of thought considers an (data) infrastructure as a fundamentally relational concept.
It becomes an infrastructure in relation to organized (research) practices [5]. The relational property
of an (data) infrastructure talks about that which is between – between communities and data/publications collections mediated by services and tools. According to this school of thought the exact
sense of the term (data) infrastructure and its “betweenness” are both theoretical and empirical
questions.
In Star and Ruhleder’s “Steps toward an ecology of infrastructure: Design and access for large information spaces” [6] an (data) infrastructure emerges with the following dimensions:
- Embeddedness: the infrastructure is “sunk” into, inside of, other structures, social arrangements and technologies
- Transparency: the infrastructure is transparent to use, in the sense that it does not have to be
reinvented each time or assembled for each task, but invisibly supports those tasks. - Reach of scope: the infrastructure reaches beyond a single event or one-site practice.
- Learned as part of membership: The taken-for-grantedness of artefacts and organizational
arrangements is a sine qua non of membership in a community of practice. Strangers and outsiders encounter the infrastructure as a target object to be learned about. New participants acquire a
naturalized familiarity with its objects as they become members.
Global Research Data Infrastructures: The GRDI2020 Vision
5 - Links with conventions of practice: the infrastructure both shapes and is shaped by the conventions
of a community of practice. - Embodiment of standards: Modified by scope and often by conflicting conventions, the infrastructure
takes on transparency by plugging into other infrastructures and tools in a standardized fashion. - Builds on an installed base: the infrastructure does not grow de novo; it wrestles with the “inertia of
the installed base” and inherits strengths and limitations from that base. - Becomes visible upon breakdown: The normally invisible quality of the working infrastructure becomes visible when breaks occur: the server is down, the bridge washes out, there is a power blackout.
Even when there are back-up mechanisms or procedures, their existence further highlights the now-visible
infrastructure.
Research data infrastructures should be science- and engineering-driven and when coupled with high
performance computational systems increase the overall capacity and scope of scientific research.
Optimization for specific applications may be necessary to support the entire research cycle but work in this
area is mature in many problem domains.
Science is a global undertaking and research data are both national and global assets. There is a need for
a seamless infrastructure to facilitate collaborative arrangements necessary for the intellectual and practical
challenges the world faces.
Therefore, there is a need for global research data infrastructures to be able to interconnect the components of a distributed worldwide science ecosystem by overcoming language, policy, methodology, and social barriers. Advances in technology should
enable the development of global research data infrastructures
that reduce geographic, temporal, social, and national barriers
in order to discover, access, and use data.
Their ultimate goal should be to enable
researchers to make the best use of the
world’s growing wealth of data.
The next generation of global research
data infrastructures is facing two main
challenges: - To effectively and efficiently support
data-intensive science - To effectively and efficiently support multidisciplinary/interdisciplinary science
Global Research Data Infrastructures: The GRDI2020 Vision
6
Data-Intensive Science
By data-intensive science we mean any scientific research activity whose progress is heavily dependent
on careful thought about how to use data. Such research activities are characterized by: - increasing volumes and sources of data,
- complexity of data and data queries,
- complexity of data processing,
- high dynamicity of data,
- high demand for data,
- complexity of the interaction between researchers and data, and
- importance of data for a large range of end-user tasks.
Fundamentally, data-intensive disciplines face two major challenges [7]: - Managing and processing exponentially growing data volumes, often arriving in time-sensitive streams
from arrays of sensors and instruments, or as outputs from simulations; and - Significantly reducing data analysis cycles so that researchers can make timely decisions.
Multidisciplinary – Interdisciplinary Science
By multidisciplinary approach to a research problem we mean an approach that draws appropriately
from multiple disciplines in order to redefine the problem outside of normal boundaries and reach solutions
based on a new understanding of complex situations.
There are several barriers to the multidisciplinary approach of a behavioural and technological nature.
Among the major technological barriers we identify those that must be overcome when moving data, information, and knowledge between disciplines. There is the risk of interpreting representations in different
ways caused by the loss of the interpretative context. This can lead to a phenomenon called “ontological
drift” – the intended meaning becomes distorted as the information object moves across semantic boundaries
(semantic distortion) [8].
A relatively similar concept is the interdisciplinary approach to a research problem. It involves the connection and integration of expertise belonging to different disciplines for the purpose of solving a common
research problem.
Again, the barriers faced by an interdisciplinary approach are of two types: behavioural and technological.
Amongst the major technological barriers we identify the need for integrating data, information, and knowledge created by different disciplines. In fact, one of the major barriers to be overcome concerns the integration of activities that are taking place on different ontological foundations.
Global Research Data Infrastructures: The GRDI2020 Vision
7
The requirements described above, imposed by data-intensive multidisciplinary-interdisciplinary science are
the motivations behind building the theoretical foundations of the next generation data infrastructures. To
make this happen a considerable number of difficult data, application, system, organizational, and policy
challenges must be successfully tackled.
The breakthrough technologies needed to address many of the critical problems in data-intensive multidisciplinary-interdisciplinary computing will come from collaborative efforts involving many domain application
disciplines as well as computer science, engineering and mathematics.