Ph.D. Student: Francisco Solano Padillo Ruz
Advisors: Jose Maria Luna, Sebastián Ventura
Defended on: July 2020
Keywords: pattern mining, associative classification, big data, map reduce
Digital version: PDF
The increasing innovation in technology over the last decades has provoked an exponential growth on both the quantity of data being generated, and its complexity. The discovery of high level information and knowledge from these complex large quantities of data has become significantly ambitious and challenging. Additionally, as technology advances and hardware is improved, more and more data are being able to be stored, thus, the quantity of data to deal with have increased like never before. Big Data is the term more and more used to comprise a subset of these techniques focused on facing up the problems derived from the management and analysis from very huge quantities of data.
Aiming at extracting hidden, interesting and previously unknown information from large quantities of data, many different techniques have been proposed along the years. Nevertheless, all of them could be categorized in two main groups: descriptive tasks, which depict intrinsic and important properties of data; and predictive tasks, which predict an output variable for unseen data. Classification based on association rule mining, generally known as Associative Classification (AC), integrates a descriptive task in the process of generating a classifier. Several researches have proved that AC algorithms are able to obtain accurate and interpretable results in an efficient way thanks to leveraging association rule discovery methods in the training phase. This enables to obtain all the possible hidden relationships among the attribute values which possibly may be missed by other lesser exhaustive methodologies. Furthermore, AC also enables to update and tune a subset of rules without having to redraw the whole tree as happens in decision tree approaches. Last but not least, the main advantage of AC with regard to other techniques is the final model representation, which is formed by simple and easy to interpretate rules that enables end-user to understand and interpret the results.
This Doctoral Thesis aims at solving the challenging problem of AC and its application on very large datasets. The main contributions of this Ph.D. thesis are summarized in the following points::
- AC state-of-art has been studied and analyzed, and a new tool covering the whole taxonomy of algorithms as well as providing many different measures has been proposed. The goal of this tool is two-fold: 1) unification of comparisons, since existing works compare with very different measures; 2) providing a unique tool which has at least one algorithm of each category forming the taxonomy.
- AC has been analyzed on very large quantities of data. In this regard, many different platforms for distributed computing have been studied and different proposals have been developed on them. These proposals enable to deal with very large data in a efficient way scaling up the load on very different compute nodes.
- As one of the most important part of the AC is to extract high quality rules, it has been proposed a novel grammar-guided genetic programming algorithm which enables to obtain interesting association rules with regard to different metrics and in different kinds of data, including truly Big Data datasets. This proposal has proved to obtain very good results in terms of both quality and interpretability, at the same time of providing a very flexible way of representing the solutions and enabling to introduce subjective knowledge in the search process. Then, a novel algorithm has been proposed for AC using a non-trivial adaptation of the aforementioned algorithm to obtain the rules forming the classifier. This methodology is also based on grammar-guided genetic programming enabling user not only to constrain the form of the rules, but the final form of the classifier. Results have proved that this algorithm obtains very accurate classifiers at the same time of maintaining a good level of interpretability.
The development of this thesis has been supported by:
- Spanish Ministry of Science and Competitiveness, project TIN-2014-55252-P.
- Spanish Ministry of Science and Competitiveness, project TIN-2017-83445-P.
PUBLICATIONS ASSOCIATED WITH THIS THESIS
- F. Padillo, J.M. Luna and S. Ventura. LAC: Library for associative classification. Knowledge-Based Systems, 193: 105432 (2020). DOI: 10.1016/j.knosys.2019.105432.
- F. Padillo, J.M. Luna and S. Ventura. A Grammar-Guided Genetic Programing Algorithm for Associative Classification in Big Data. Cognitive Computation, 11(3): 331-346 (2019). DOI: 10.1007/s12559-018-9617-2.
- F. Padillo, J.M. Luna, F. Herrera and S. Ventura. Mining association rules on Big Data through MapReduce genetic programming. Integrated Computer-Aided Engineering, 25(1): 31-48 (2018). DOI: 10.3233/ICA-170555.
- F. Padillo, J.M. Luna and S. Ventura. Associative Classification in Big Data through a G3P Approach. In 4th International Conference on Internet of Things, Big Data and Security (IoTBDS), 94-102 (2019). DOI: 10.5220/0007688400940102.
- F. Padillo, J.M. Luna and S. Ventura. An evolutionary algorithm for mining rare association rules: A Big Data approach. In 2017 IEEE Congress on Evolutionary Computation (CEC), 2007-2014 (2017). DOI: 10.1109/CEC.2017.7969547.