Evaluating Associative Classification Algorithms for Big Data

Abstract

Background. Associative Classification, a combination of two important and different fields (classification and association rule mining), aims at building accurate and interpretable classifiers by means of association rules. A major problem in this field is that existing proposals do not scale well when really Big Data are considered. In this regard, the aim of this work is to propose adaptations of well-known associative classification algorithms (CBA and CPAR) by considering different Big Data platforms (Spark and Flink). %In this regard, an experimental study has been performed selecting CBA among state-of-art since is obtained the best trade-off between interpretability and accuracy in the classifiers. Furthermore, as CBA could take a large time to be run by the derived complexity of its exhaustive search, CPAR has been selected as the another approach which also obtains very accurate classifiers in a reduced quantum of time by means of a fast greedy algorithm.

Results. An experimental study has been performed on 30 datasets and the results have been analyzed by means of non-parametric tests. Results proved that CBA-Spark and CBA-Flink obtained really interpretable classifiers but it was more time consuming than CPAR-Spark or CPAR-Flink. In this study, it was demonstrated that the proposals were able to run on truly Big Data (file sizes up to 200 GBytes). Finally, the analysis of different quality metrics revealed that no statistical difference can be found for these two approaches.

Conclusions. The experimental study has revealed that sequential algorithms cannot be used on large quantities of data and approaches such as CBA-Spark, CBA-Flink, CPAR-Spark or CPAR-Flink are required. CBA has proved to be very useful when the main goal is to obtain highly interpretable results. However, when the runtime has to be minimized CPAR should be used. No statistical difference could be found between the two proposals in terms of quality of the results except for the interpretability of the final classifiers, CBA being statistically better than CPAR.

Additional Information

Results for the accuracy measure:

Datasets CBA CBA2 CMAR CPAR C45 Ripper Core OneR
appendicitis 0.896 0.886 0.868 0.869 0.867 0.82 0.877 0.82
australian 0.866 0.868 0.871 0.864 0.864 0.81 0.832 0.855
banana 0.57 0.612 0.019 0.752 0.752 0.643 0.631 0.702
breast 0.653 0.737 0.643 0.737 0.693 0.617 0.754 0.69
cleveland 0.537 0.539 0.485 0.56 0.556 0.441 0.512 0.543
contraceptive 0.448 0.461 0.1 0.559 0.546 0.523 0.434 0.409
flare 0.67 0.675 0.446 0.754 0.703 0.672 0.646 0.614
german 0.753 0.737 0.642 0.741 0.694 0.667 0.698 0.717
hayes-roth 0.537 0.538 0.194 0.538 0.538 0.85 0.481 0.388
heart 0.83 0.826 0.837 0.833 0.807 0.693 0.7 0.715
iris 0.933 0.953 0.94 0.947 0.96 0.947 0.947 0.94
lymphography 0.769 0.79 0.776 0.782 0.751 0.798 0.688 0.75
magic 0.815 0.774 0.449 0.853 0.773 0.844 0.744 0.735
mammographic 0.82 0.819 0.763 0.84 0.836 0.795 0.79 0.828
monk-2 0.97 0.972 0.972 0.972 0.972 1 0.913 0.806
mushroom 0.99 1 0.998 0.994 1 1 0.799 0.984
page-blocks 0.97 0.962 0.897 0.96 0.943 0.962 0.903 0.936
phoneme 0.805 0.751 0.444 0.823 0.799 0.822 0.754 0.766
pima 0.727 0.751 0.521 0.772 0.764 0.701 0.742 0.747
post-operative 0.556 0.59 0.589 0.622 0.6 0.4 0.69 0.703
saheart 0.67 0.697 0.35 0.727 0.695 0.601 0.707 0.692
spectfheart 0.798 0.854 0.776 0.836 0.836 0.76 0.795 0.795
splice 0.938 0.73 0.887 0.947 0.905 0.928 0.519 0.613
tae 0.25 0.252 0 0 0 0.595 0.437 0.252
tic-tac-toe 1 1 0.975 0.978 0.865 0.977 0.689 0.7
titanic 0.74 0.741 0.067 0.776 0.776 0.705 0.783 0.776
vehicle 0.69 0.695 0.52 0.702 0.637 0.715 0.375 0.551
wine 0.938 0.989 0.983 0.989 0.961 0.91 0.944 0.826
winequality-white 0.45 0.45 0 0.555 0.486 0.538 0.446 0.484
wisconsin 0.96 0.965 0.963 0.956 0.956 0.966 0.934 0.927

Results for the kappa measure:

Datasets CBA CBA2 CMAR CPAR C45 Ripper Core OneR
appendicitis 0.54 0.547 0.516 0.481 0.452 0.461 0.58 0.447
australian 0.72 0.727 0.738 0.72 0.732 0.622 0.661 0.709
banana 0.045 0.147 0.01 0.498 0.499 0.328 0.219 0.383
breast 0.65 0.289 0.167 0.205 0.242 0.171 0.295 0.163
cleveland 0.28 0.291 0.238 0.219 0.208 0.206 0.138 0.17
contraceptive 0.127 0.159 0.056 0.302 0.317 0.272 0.074 0.066
flare 0.56 0.577 0.373 0.679 0.672 0.588 0.545 0.493
german 0.244 0.33 0.105 0.264 0.294 0.243 0.005 0.152
hayes-roth 0.28 0.284 0.152 0.284 0.284 0.732 0.193 0.056
heart 0.66 0.641 0.667 0.656 0.615 0.408 0.375 0.421
iris 0.91 0.928 0.902 0.916 0.892 0.91 0.913 0.902
lymphography 0.55 0.595 0.571 0.563 0.505 0.613 0.331 0.515
magic 0.24 0.428 0.213 0.667 0.644 0.647 0.432 0.424
mammographic 0.63 0.636 0.582 0.677 0.693 0.589 0.576 0.652
monk-2 0.944 0.945 0.945 0.945 0.945 1 0.823 0.602
mushroom 0.99 1 0.995 0.988 1 1 0.524 0.967
page-blocks 0.57 0.778 0.021 0.801 0.82 0.811 0.115 0.64
phoneme 0.21 0.49 0.166 0.572 0.548 0.605 0.417 0.449
pima 0.46 0.48 0.232 0.462 0.483 0.393 0.36 0.357
post-operative -0.17 -0.161 -0.097 -0.148 -0.029 -0.221 -0.043 -0.029
saheart 0.1 0.163 0.12 0.293 0.333 0.187 0.25 0.21
spectfheart 0.2 0.565 0.229 0.343 0.411 0.39 0 0
splice 0.89 0.552 0.822 0.914 0.902 0.882 0.001 0.377
tae 0.25 0 0 0 0 0.341 0.174 0
tic-tac-toe 1 1 0.945 0.952 0.706 0.949 0.313 0.34
titanic 0.25 0.255 0.045 0.433 0.433 0.373 0.455 0.433
vehicle 0.69 0.59 0.415 0.604 0.597 0.615 0.17 0.4
wine 0.98 0.982 0.973 0.982 0.866 0.86 0.91 0.726
winequality-white 0 0.009 0 0.289 0.313 0.319 0.04 0.162
wisconsin 0.91 0.92 0.917 0.903 0.915 0.927 0.854 0.841

Results for complexity

Datasets cba cba2 cpar c45 ripper
appendicitis 18.11 11.41 33.61 8 46.2
australian 343.63 327.93 670.96 444 1439.2
banana 0 8.28 370.25 594 297.6
breast 131.79 159.53 559.67 504 2282
cleveland 81.97 131.05 322.05 440 3411.2
contraceptive 22.25 45.44 1146.82 1260 18631.9
flare 3 65.82 1233.16 2349 44761.5
german 754.64 968.18 2734.47 3844 3630
hayes-roth 4 4 58.09 24 159.6
heart 164.21 148.44 307.26 341 294
iris 13.7 5.68 49.95 9 10.2
lymphography 119.84 94.81 256.86 240 192
magic 760.81 1195.78 28644.7 784700 20997.9
mammographic 44.83 38.05 263.56 50 2373.3
monk-2 4 4 56.14 12 8
mushroom 34.13 53.97 132.72 630 6
page-blocks 525.47 708.7 1438.25 17325 5611.2
phoneme 20.92 77.04 4225.44 24864 8522.5
pima 92.1 96.94 557.19 308 1800
post-operative 49.2 87.99 111.34 32 832.5
saheart 28.64 21.96 232.78 60 1236.6
spectfheart 0 67.03 287.86 286 155.2
splice 597.58 198 1964.57 69185 5959.5
tae 0 0 0 0 1320
tic-tac-toe 27 27 490.03 8700 679.4
titanic 4.2 4.2 117.81 6 103.2
vehicle 241.45 193.29 2173.12 9898 4429.6
wine 15.16 15.16 79.25 91 26
winequality-white 0 75.58 16805.92 80375 115628.8
wisconsin 123.48 87.26 157.08 220 117.6

Results for time

Datasets CBA CBA2
appendicitis 623.66 1265.29
australian 35850.5 39850.18
banana 1563 1658.86
breast 1300.5 1552.12
cleveland 1340.61 1489.87
contraceptive 1816.87 1919.75
flare 3892.43 37613.55
german 229999 164956.76
hayes-roth 561.41 578.84
heart 2500 2625.54
iris 3789 49144
lymphography 9260.17 9907.56
magic 13716.98 108535.81
mammographic 15897 11971.96
monk-2 9000 9980.83
mushroom 52899 28932.19
page-blocks 75585 108800
phoneme 3295.25 4543.03
pima 1276.21 1336.04
post-operative 954.12 1730.27
saheart 957.54 983.08
spectfheart 798759 806434
splice 87800 15584.62
tae 900 1100
tic-tac-toe 1513.13 2115.06
titanic 3563.22 2012.97
vehicle 204390.61 2001850.57
wine 15509.69 159143.47
winequality-white 4497.72 25046.86
wisconsin 1812.8 1949.86

Algorithms

Download

CBA

Pseudocode from this algorithm can be found in this link. Pseudocode of generation of rules:

Generation of rules step1:

Generation of rules step2:

CPAR

Pseudocode from this algorithm can be found in this link