Research mechanisms to execute user software close to the data

From GRDI2020

Jump to: navigation, search

This is a GRDI recommendation; return to Main Page with all the challenges and recommendations


Context and Challenges

Often, bandwidth is a limitation. This means that data storage and environments for analysing or processing the data (e.g., bulk format conversion) need to be collocated to avoid excessive data transfers.

One approach would be to allow the execution of helper services directly at the storage site. However, this might bear serious security issues for an archive, which needs to preserve the integrity of the digital objects.

Programming environments like Hadoop allow the deployment of Java-based functionalities upon data infrastructure; however, nothing is sufficiently open to users. More user-oriented frameworks like workflow systems (e.g., myExperiment, Yahoo Pipes) lack integration into data infrastructure. Grid environments are often impractical for their lack of openness and usability.

Recommendation

Research mechanisms to execute service workflows and/or user software close to the data, without, for example, added impact on security, or putting excessive load on the storage systems.

Stakeholders and Impact

The impact for users could be considerable. Today, even with the availability of storage infrastructures, users often still need to cater for their own compute infrastructure (e.g., collate and pre-process data, analyse data, or other activities on data). This could essentially free users of such plumbing works. For now this remains computer science research.

Personal tools