Application of scientific workflows to data-intensive domains

Basic information

Ph.D. Student: Rubén Salado Cid
Advisors: José Raúl Romero
Started on: March 2016
Keywords: scientific workflows, big data, data science

Thesis proposal

It is estimated that only 20% of data are currently useful, and not even all of them are properly structured. Additionally, as the amount of available information doubles every two years , the need of extracting knowledge from any sort of data, both structured and unstructured, has caused that data science to become an extremely fast growing area. The development of new computing resources (processing units or storage systems) or the improvement of the existing ones, as well as the use of novel techniques and theories from multiple areas (mathematics, statistics or computer science) has enabled the processing and analysis of data within a shorter computational time. In this context, a wide range of scientific domains (such as bioinformatics, geoinformatics or genomics) have begun to use these techniques in order to perform data-intensive research. However, developing and deploying these data-intensive applications involves the use of advanced technical knowledge, meaning that scientists have to be familiarized with terms that are not directly related to their own domain and daily work. The aim of this Ph.D. thesis is to bring the processing of large amounts of data closer to scientists, independently of their computation skills. To this end, it is necessary to develop new mechanisms that facilitate the use of data science techniques and technologies to computing non-experts by, e.g., reusing and adapting generic data-intensive techniques to specific application domains, enabling the high level definition of complex data processing applications or managing their execution over the most suitable computing platforms in a transparent way. This involves studying and applying knowledge from different research areas, such as data science, big data, workflow management systems, eScience, distributed computing and software engineering. The main objectives in the development of this thesis are the following:

  • To study the current state-of-the-art in the following fields of knowledge: data-intensive computing, big data, data science, eScience, scientific workflows and workflow management systems.
  • To specify a domain-specific language (DSL) to enable the definition of computational scientific experiments in terms of workflows at a high abstraction level.
  • To study the integration of high performance and cloud computing platforms (e.g., Globus Toolkit, Apache Hadoop, IBM Bluemix) into workflow management systems.
  • To develop the required techniques to build a system for the automatic generation of custom workflow management systems, providing ready-to-use elements to define and execute scientific workflows. At this point, the model-driven engineering provides a proper conceptual framework to reach the required interoperability and flexibility.
  • To apply the results obtained from previous research to other specific domains, such as education or evolutionary computation, by generating of custom workflow tools as proofs of concept.

Funds

The development of this thesis is being supported by:

  • Spanish Ministry of Science and Competitiveness, project TIN-2014-55252-P.

Publications associated with this thesis

National Conferences

  1. R. Salado-Cid, J.R. Romero and S. Ventura. Metaherramienta para la generación de aplicaciones científicas basadas en workflows. Actas de las X Jornadas de Ciencia e Ingeniería de Servicios (JCIS'14), pp. 96-105. 2014
  2. R. Salado-Cid, G. Luque and J.R. Romero. Sistema de gestión de flujos de trabajo para la definición visual de aplicaciones basadas en algoritmos evolutivos. Actas de la XVI Conferencia de la Asociación EspaƱola para la Inteligencia Artificial (CAEPIA'15), pp. 261-270. 2015
  3. R. Salado-Cid and J.R. Romero. Lenguaje específico para el modelado de flujos de trabajo aplicados a ciencia de datos. Actas de las XXI Jornadas en Ingenieróa del Software y Bases de Datos (JISBD'16), pp. 227-240. 2016

International Chapters

  1. R. Salado-Cid, A. Ramirez and J.R. Romero. On the need of opening the Big Data landscape to everyone: challenges and new trends. Chapter in Digital Marketplaces Unleashed, 2017

International Conferences

  1. R. Salado-Cid and J.R. Romero. Enabling the definition and reuse of multi-domain workflow-based data analysis. Proceedings of the 16th International Conference on Intelligent Systems Design and Applications (ISDA'16). 2016