ur-CAIM: Improved CAIM Discretization for Unbalanced and Balanced Data

This website contains additional material to the paper titled ur-CAIM: Improved CAIM Discretization for Unbalanced and Balanced Data published in Soft Computing.

Table of Contents
  1. Abstract
  2. Software
  3. Data sets
  4. Algorithms
  5. Experimental Settings
  6. Results
  7. References

Abstract

Supervised discretization is one of basic data preprocessing techniques used in data mining. CAIM (Class-Attribute Interdependence Maximization) is a discretization algorithm of data for which the classes are known. However, new arising challenges such as the presence of unbalanced data sets, call for new algorithms capable of handling them, in addition to balanced data. This paper presents a new discretization algorithm named ur-CAIM, which improves on the CAIM algorithm in three important ways. First, it generates more flexible discretization schemes while producing a small number of intervals. Second, the quality of the intervals is improved based on the data classes distribution, which leads to better classification performance on balanced and, especially, unbalanced data. Third, the runtime of the algorithm is lower than CAIM's. The algorithm has been designed free-parameter and it self-adapts to the problem complexity and the data class distribution. The ur-CAIM was compared with 9 well-known discretization methods on 28 balanced, and 70 unbalanced data sets. The results obtained were contrasted through non-parametric statistical tests, which show that our proposal outperforms CAIM and many of the other methods on both types of data but especially on unbalanced data, which is its significant advantage.

Software

The ur-CAIM algorithm is publicly available to download and use in WEKA software tool using the package manager (requires WEKA ≥ 3.7.5).

Data sets Information

The data sets employed were collected from the UCI machine learning and KEEL repository websites. These data sets are very different in terms of their complexity, number of classes, number of attributes, number of instances, and unbalance ratio (ratio of size of the majority class to minority class). The data sets are available to download (balanced and unbalanced).

Table: Balanced data sets information

Data set Instances Attributes Real Integer Nominal Classes
abalone 4174 8 7 0 1 28
arrhythmia 452 279 0 206 73 16
glass 214 9 9 0 0 7
heart 270 13 1 4 8 2
ionosphere 351 33 32 0 1 2
iris 150 4 4 0 0 3
jm1 10885 21 13 8 0 2
madelon 2600 500 0 500 0 2
mc1 9466 38 10 28 0 2
mfeat-factors 2000 216 0 216 0 10
mfeat-fourier 2000 76 76 0 0 10
mfeat-karhunen 2000 64 64 0 0 10
mfeat-zernike 2000 47 47 0 0 10
pc2 5589 36 13 23 0 2
penbased 10992 16 16 0 0 10
pendigits 10992 16 0 16 0 10
pima 768 8 8 0 0 2
satimage 6435 36 0 36 0 7
segment 2310 19 19 0 0 7
sonar 208 60 60 0 0 2
spambase 4597 57 57 0 0 2
spectrometer 531 102 100 0 2 48
texture 5500 40 40 0 0 11
thyroid 7200 21 6 0 15 3
vowel 990 13 11 0 2 11
waveform 5000 40 40 0 0 3
winequality-red 1599 11 11 0 0 11
winequality-white 4898 11 11 0 0 11

Table: Unbalanced data sets information

Data set Instances Attributes Real Integer Nominal IR
abalone19 4174 8 7 0 1 129.44
autos 159 25 15 0 10 16
balance 625 4 4 0 0 5.88
contraceptive 1473 9 6 0 3 1.89
dermatology 366 34 0 34 0 5.55
ecoli-0_vs_1 281 7 7 0 0 39.14
ecoli-0-1_vs_2-3-5 280 6 6 0 0 13
ecoli-0-1_vs_5 336 7 7 0 0 10.59
ecoli-0-1-3-7_vs_2-6 332 6 6 0 0 12.28
ecoli-0-1-4-6_vs_5 244 7 7 0 0 9.17
ecoli-0-1-4-7_vs_2-3 240 6 6 0 0 11
ecoli-0-1-4-7_vs_5-6 202 7 7 0 0 9.1
ecoli-0-2-3-4_vs_5 224 7 7 0 0 9.18
ecoli-0-2-6-7_vs_3-5 205 7 7 0 0 9.25
ecoli-0-3-4_vs_5 257 7 7 0 0 9.28
ecoli-0-3-4-6_vs_5 200 7 7 0 0 9
ecoli-0-3-4-7_vs_5-6 203 6 6 0 0 9.15
ecoli-0-4-6_vs_5 220 6 6 0 0 10
ecoli-0-6-7_vs_5 220 7 7 0 0 1.86
ecoli1 336 7 7 0 0 3.36
ecoli2 336 7 7 0 0 5.46
ecoli3 336 7 7 0 0 8.6
ecoli4 336 7 7 0 0 15.8
glass 214 9 9 0 0 8.44
glass-0-1-2-3_vs_4 214 9 9 0 0 2.06
glass-0-1-4-6_vs_2 214 9 9 0 0 3.2
glass-0-1-5_vs_2 205 9 9 0 0 11.06
glass-0-1-6_vs_2 172 9 9 0 0 9.12
glass-0-1-6_vs_5 192 9 9 0 0 10.29
glass-0-4_vs_5 184 9 9 0 0 19.44
glass-0-6_vs_5 92 9 9 0 0 9.22
glass0 108 9 9 0 0 11
glass1 214 9 9 0 0 1.82
glass2 214 9 9 0 0 11.59
glass6 214 9 9 0 0 6.38
habermanImb 306 3 0 3 0 2.78
hayes-roth 132 4 0 4 0 1.7
iris0 150 4 4 0 0 2
led7digit 443 7 7 0 0 10.97
new-thyroid 215 5 4 1 0 4.84
new-thyroid1 215 5 4 1 0 5.14
new-thyroid2 215 5 4 1 0 5.14
page-blocks-1-3 548 10 10 0 0 164
page-blocks0 5472 10 4 6 0 8.79
pageblocks 472 10 4 6 0 15.86
penbased 1100 16 16 0 0 1.95
pimaImb 768 8 8 0 0 1.87
shuttle-c0-vs-c4 1829 9 0 9 0 13.87
shuttle-c2-vs-c4 129 9 0 9 0 20.5
thyroid 720 21 6 0 15 36.94
vehicle1 846 18 0 18 0 2.9
vehicle2 846 18 0 18 0 2.88
vehicle3 846 18 0 18 0 2.99
vowel0 988 13 10 3 0 9.98
wine 178 13 13 0 0 1.5
wisconsinImb 683 9 0 9 0 1.86
yeast 1484 8 8 0 0 23.15
yeast-0-2-5-6 1004 8 8 0 0 9.14
yeast-0-2-5-7-9 1004 8 8 0 0 9.14
yeast-0-3-5-9_vs_7-8 506 8 8 0 0 9.12
yeast-0-5-6-7-9_vs_4 528 8 8 0 0 9.35
yeast-1_vs_7 1484 8 8 0 0 2.46
yeast-1-2-8-9_vs_7 947 8 8 0 0 30.57
yeast-1-4-5-8_vs_7 693 8 8 0 0 22.1
yeast-2_vs_4 459 7 7 0 0 14.3
yeast1 514 8 8 0 0 9.08
yeast3 1484 8 8 0 0 8.1
yeast4 1484 8 8 0 0 28.1
yeast5 1484 8 8 0 0 32.73
yeast6 1484 8 8 0 0 41.4

Algorithms used in the experimental study

The algorithms used in the experiments are publicly available in WEKA and KEEL software tools. These algorithms were used in Garcia et al. survey on discretization methods [17].

Discretization Algorithms

Classification Algorithms

Experimental Settings

Discretization Algorithms Parameters
EWNumber intervals = number distinct attribute values / (3 * number classes)
EFNumber intervals = number distinct attribute values / (3 * number classes)
IEMparameter-free
CAIMparameter-free
Amevaparameter-free
M-χ2parameter-free
HDDR coefficient: 0.8 (Adjusts the difference between the local maximum CAIM heuristic and the expected value).
CACCparameter-free
IDDNeighborhood: 3 (Size of the neighborhood).
WindowsSize: 3 (Size of the windows to compute votes).
Classification Algorithms Parameters
NaiveBayesparameter-free
SVMType: C-SVM (The type of SVM to use).
coef0: 0 (The SVM coefficient).
cost: 1 (The cost parameter for C-SVM).
KernelType: radial basis function (RBF) (The type of kernel to use).
degree: 3 (The degree of the kernel).
eps: 0.001 (The tolerance of the termination criterion).
KNNNumber of neighbors: 3 (The number of k-neighbors).
Distance function: Euclidean.
AdaBoostNumer of iterations: 10 (The number of iteratiosn to be performed).
Resampling: false.
JRipPruning: true.
Folds: 3 (Determines the amount of data used for pruning. One fold is used for pruning, the rest for growing the rules).
Optimizations: 2 (The number of optimization runs).
minNo: 2 (The minimum total weight of the instances in a rule).
PARTminNumObj: 2 (The minimum number of instances per rule).
ConfidenceFactor: 0.25 (The confidence factor used for pruning).
numFols: 3 (Determines the amount of data used for pruning. One fold is used for pruning, the rest for growing the rules).
Pruning: true.
C4.5minNumObj: 2 (The minimum number of instances per rule).
ConfidenceFactor: 0.25 (The confidence factor used for pruning).
numFols: 3 (Determines the amount of data used for pruning. One fold is used for pruning, the rest for growing the rules).
Pruning: true.
RandomForestmaxDepth: unlimited (The maximum depth of the trees).
numTrees: 10 (The number of trees to be generated).

Results

Full results for each discretization and classification algorithm, and for each data set are available to download in CSV format (.csv).
Discretized data sets are available to download for each discretization method (.zip)

Algorithm Balanced Data Unbalanced Data
EW download download
EF download download
IEM download download
CAIM download download
Ameva download download
M-χ2 download download
HDD download download
CACC download download
IDD download download
ur-CAIM download download

References

[1] D. Chiu, A. Wong, and B. Cheung, Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis, Knowledge Discovery in Databases, G. Piatesky-Shapiro and W.J. Frowley, ed., MIT Press, 1991.
[2] U.M. Fayyad and K.B. Irani. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. 13th International Joint Conference on Uncertainly in Artificial Intelligence (IJCAI93). Chambery (France, 1993) 1022-1029.
[3] L.A. Kurgan and K.J. Cios. CAIM Discretization Algorithm. IEEE Transactions on Knowledge and Data Engineering 16:2 (2004) 145-153.
[4] L. Gonzalez-Abril, F.J. Cuberos, F. Velasco, and J.A. Ortega. Ameva: An autonomous discretization algorithm. Expert Systems with Applications 36 (2009) 5327-5332.
[5] F.E.H. Tay and L. Shen. A Modified Chi2 Algorithm for Discretization. IEEE Transactions on Knowledge and Data Engineering 14:2 (2002) 666-670.
[6] P. Yang, J.-S. Li, and Y.-X. Huang. HDD: a hypercube division-based algorithm for discretisation. International Journal of Systems Science 42:4 (2011) 557-566.
[7] C.J. Tsai, C.-I. Lee, and W.-P. Yang. A discretization algorithm based on Class-Attribute Contingency Coefficient. Information Sciences 178:3 (2008) 714-731.
[8] F.J. Ruiz, C. Angulo, and N. Agell, IDD: A Supervised Interval Distance-Based Method for Discretization, IEEE Transactions on Knowledge and Data Engineering 20:9 (2008) 1230-1238.
[9] G. John and P. Langley, Estimating Continuous Distributions in Bayesian Classifiers, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (1995) 338-345.
[10] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, 2 (2011) 1-27.
[11] T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory, 13 (1967) 21-27.
[12] Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, Proceedings of the 13th International Conference on Machine Learning, (1996) 148-156.
[13] W W. Cohen, Fast effective rule induction, Proceedings of the 12th International Conference on Machine Learning, (1995) 115-123.
[14] E. Frank and I. Witten, Generating accurate rule sets without global optimization, Proceedings of the 15th International Conference on Machine Learning, (1998) 144-151.
[15] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kauffman Publishers, 1993.
[16] L. Breiman, Random forests, Machine Learning, 45 (2001) 5-32.
[17] S. Garcia, J. Luengo, J. Saez, V. Lopez, and F. Herrera, A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning, IEEE Transactions on Knowledge and Data Engineering 25:4 (2013) 734-750.