EMERGING TRENDS IN DATA ANALYSIS

Status:
In progress
PI:
Sebastián Ventura
Reference:
TIN2017-83445-P
Members:

Rafael Barbudo
Alberto Cano
Krzysztof Cios
Carlos García
Eva L. Gibaja
Jorge González
José María Luna
María Luque
Gabriella Melki
José María Moyano
Mykola Pechenizkiy
Alexandro Provetti
Oscar G. Reyes
Cristóbal Romero
José Raúl Romero
Marie Sacksick
Amelia Zafra
Duration:
January 2018 – December 2020
Budget:
57,112 €
Project:

SUMMARY

Project EMERalD (EMERging trends in Data analysis) has as main objective to develop data analysis methodologies/proposals for solving complex problems in biomedicine and education. In this sense, solutions will be reached by designing new algorithms or adapting existing ones according to the problem characteristics, following a systematic work for its resolution and performing data validation on real datasets. Lastly, but not least, new intuitive tools will be developed to work with the proposed methodologies so specialists in the application domain having no background neither in programming nor in existing tools (Weka, R, …) can easily use them.

Regarding the Data Mining paradigms involved in this project, and based on the nature of the problems to be tackled, both predictive models (classification and regression) and descriptive models (pattern mining and extracting subsets of interest) will be developed. In both cases, either conventional or flexible data representations (multi-instance and/or multi-target) will be considered depending on the problem at hand. In the specific case of predictive models, interpretability of the models is a dare so a special emphasis will be placed on white box models (either directly obtained or derived from accurate black box models).

As a natural evolution of the previous works in scalability and parallel models on multi-core platforms and GPUs carried out by the research group, the project will focus on models based on MapReduce considering the Spark framework, as well as traditional programing languages (Java) and some of those that have become more popular for data analysis (Python, Julia and R). The project will also focus on the development of Deep Learning models (which have shown exceptional behavior in a multitude of problems) and combined with more flexible data representations, when applicable.

The practical component of this project is notorious as demonstrated the importance given to the development of solutions to the raised problems. In the educational field, early prediction models will be developed and they will be applied to different sets of students. Additionally, different models for self-assessment and peer-review will be developed, as well as models to recommend didactic materials for students with similar characteristics. As for the biomedical field, models for early diagnosis of melanoma will be developed, and different temporal patterns of hypertension will be analyzed, which will be related to different pathologies that might cause them. Furthermore, we will analyse patterns related to complications that occur after removing
a colorec
tal cancer and, finally, we will analyse which gene expression factors are responsible for the appearance of different tumors.

There are some national research groups that works in the subjects related to this project. These groups meet periodically in the Taller de Minería de Datos y Aprendizaje (TAMIDA), as well as in the Conferencia de la Asociación Española para la Inteligencia Artificial (AEPIA). The list of research groups that work in data mining is available at the REDMIDAS website. Our team, as well as several of the following detailed research groups, belong to the Red de Excelencia en Big Data y Análisis de Datos Escalable (BigDADE). The following research groups have published works on the topics related to this project:

  • SCI2S research group (University of Granada), headed by Francisco Herrera, which is working in multi-instance learning, association rules mining, and more recently they have published several works in the big data area.
  • SIMIDAT research group (University of Jaén), headed by María José del Jesus, which have worked in association rules mining, subgroup discovery, and more recently in multi-label learning.
  • CIG research group (Universidad Politécnica de Madrid), headed by Concepción Bielza and Pedro Larrañaga, is mainly focused on multi-label learning, among other topics.
  • ML research group (University of Oviedo, in Gijón), headed by Antonio Bahamonde, with high experience in multi-label learning.
  • MINERVA research group (University of Sevilla), headed by José C. Riquelme, which have different works in association rules mining.
  • IDBIS research group (University of Granada), headed by Juan Carlos Cubero, which have been worked in association rules mining.
  • MIDAS research group (Universidad Politécnica de Madrid), headed by Ernestina Menasalvas and Alejandro Rodríguez, which is focused on the big data analytics line.
  • LIDIA research group (University of A Coruña), headed by Amparo Alonso, which have worked in artificial neural networks, and more recently in the big data area.

Many of the research groups listed above have also worked in problems similar to the ones proposed in this project proposal. Thus, the ML, LIDIA and MINERVA groups have worked in educational problems. Also SIMIDAT, LIDIA and MIDAS groups have worked in clinical and/or biomedical data mining, although these are not the only fields of application in which they have worked. Our team has already collaborated with many of these groups in several occasions, and we hope to stablish new collaborations with them and other groups (specially international groups) along the development of this project, in order to be able to form competitive consortiums to request projects in international research calls (as H2020).