Weighted Data Gravitation Classification for Standard and Imbalanced Data

From KDIS Research Group - Wiki

Jump to: navigation, search

This website contains additional material to the paper titled Weighted Data Gravitation Classification for Standard and Imbalanced Data published in IEEE Transactions on Cybernetics.

Contents

[edit] Abstract

Gravitation is a fundamental interaction whose concept and effects applied to data classification become a novel data classification technique. The simple principle of data gravitation classification (DGC) is to classify data samples by comparing the gravitation between the different classes. However, the calculation of gravitation is not a trivial problem due to the different relevance of data attributes for distances computation, the presence of noisy or irrelevant attributes, and the class imbalance problem. This paper presents a gravitation-based classification algorithm which improves previous gravitation models and overcomes some of their issues. The algorithm proposed, called DGC+, employs a matrix of weights to describe the importance of each attribute in the classification of each class, which is used to weight the distance between data samples. It improves the classification performance by considering both global and local data information, especially in borderline decisions. The proposal is evaluated and compared to other well-known instance-based classification techniques, on 35 standard and 44 imbalanced data sets. The results obtained from these experiments show the great performance of the proposed gravitation model and they are validated using several non-parametric statistical tests.

[edit] Software

The DGC+ algorithm is available to download and run using JCLEC. To run the configuration example with the standard Iris data set just type:

   java -jar jclec4-DGC+.jar Iris.cfg

To run the configuration example with the imbalanced Iris data set just type:

   java -jar jclec4-DGC+.jar Iris0.cfg

The configuration file can be edited to execute the algorithm over any other data set, with different seeds, number of generations, manual population size, etc.

Use the <evaluator type="net.sf.jclec.problem.classification.dgc.DGCEvaluator"/> for standard problems.

Use the <evaluator type="net.sf.jclec.problem.classification.dgc.DGCEvaluatorAUC"/> for imbalanced problems.

[edit] WEKA plugin

The DGC+ algorithm can also be runned within the WEKA software tool. Download the package and import using the WEKA package manager (requires WEKA 3.7.5).

[edit] Algorithms used in the experimental study

Algorithms for standard datasets (stand-alone runnable jar files):

Algorithms for imbalanced datasets (stand-alone runnable jar files):

[edit] Data sets

The data sets employed have been selected from the KEEL repository website (standard and imbalanced). These data sets are very varied considering different degrees of complexity, number of classes, number of attributes, number of instances, and imbalance ratio. The data sets are available to download standard and imbalanced.

Table: Standard data sets information
Data set # Instances # Attributes # Classes
Appendicitis 106 7 2
Australian 690 14 2
Balance 625 4 3
Banana 5300 2 2
Bupa 345 6 2
Car 1728 6 4
Contraceptive 1473 9 3
Dermatology 366 34 6
Ecoli 336 7 8
Flare 1066 11 6
German 1000 20 2
Glass 214 9 7
Haberman 306 3 2
Hayes-Roth 160 4 3
Heart 270 13 2
Hepatitis 155 19 2
Ionosphere 351 33 2
Iris 150 4 3
Lymphography 148 18 4
Monk-2 432 6 2
New-thyroid 215 5 3
Nursery 12690 8 5
Page-blocks 5472 10 5
Phoneme 5404 5 2
Pima 768 8 2
Saheart 462 9 2
Sonar 208 60 2
Tae 151 5 3
Thyroid 7200 21 3
Tic-tac-toe 958 9 2
Vehicle 846 18 4
Vowel 990 13 11
Wine 178 13 3
Yeast 1484 8 10
Zoo 101 16 7


Table: Imbalanced data sets information
Data set # Instances # Attributes Imbalance ratio
Abalone19 4174 8 129.44
Abalone9-18 731 8 16.4
Ecoli-0-1-3-7_vs_2-6 281 7 39.14
Ecoli-0_vs_1 220 7 1.86
Ecoli1 336 7 3.36
Ecoli2 336 7 5.46
Ecoli3 336 7 8.6
Ecoli4 336 7 15.8
Glass-0-1-2-3_vs_4-5-6 214 9 3.2
Glass-0-1-6_vs_2 192 9 10.29
Glass-0-1-6_vs_5 184 9 19.44
Glass0 214 9 2.06
Glass1 214 9 1.82
Glass2 214 9 11.59
Glass4 214 9 15.47
Glass5 214 9 22.78
Glass6 214 9 6.38
Haberman 306 3 2.78
Iris0 150 4 2
New-Thyroid1 215 5 5.14
New-Thyroid2 215 5 5.14
Page-Blocks-1-3_vs_4 472 10 15.86
Page-Blocks0 5472 10 8.79
Pima 768 8 1.87
Segment0 2308 19 6.02
Shuttle-C0-vs-C4 1829 9 13.87
Shuttle-C2-vs-C4 129 9 20.5
Vehicle0 846 18 3.25
Vehicle1 846 18 2.9
Vehicle2 846 18 2.88
Vehicle3 846 18 2.99
Vowel0 988 13 9.98
Wisconsin 683 9 1.86
Yeast-0-5-6-7-9_vs_4 528 8 9.35
Yeast-1-2-8-9_vs_7 947 8 30.57
Yeast-1-4-5-8_vs_7 693 8 22.1
Yeast-1_vs_7 459 7 14.3
Yeast-2_vs_4 514 8 9.08
Yeast-2_vs_8 482 8 23.1
Yeast1 1484 8 2.46
Yeast3 1484 8 8.1
Yeast4 1484 8 28.1
Yeast5 1484 8 32.73
Yeast6 1484 8 41.4

[edit] Gravitation functions

Different gravitation functions have been considered and experimented. The linear transformation (first equation) achieved the best overall performance over the data sets from the experimental study.

Image:eq1.gif

Image:eq2.gif

Image:eq3.gif

Image:eq4.gif

Image:eq5.gif

Image:eq6.gif

[edit] Results

[edit] Standard data sets

These tables show the average accuracy and Cohen's kappa rate obtained from executing the algorithms over the standard data sets using 10-fold stratified cross validation. All experiments are repeated with 10 different seeds for stochastic methods and the average results from the test folds are shown.

[edit] Accuracy

Image:DGC-R1.png

Image:bonferroni_accuracy_standard.png

Figure: Bonferroni--Dunn test for accuracy and standard data sets

Image:DGC-R2.png

[edit] Cohen's kappa rate

Image:DGC-R3.png

Image:bonferroni_kappa_standard.png

Figure: Bonferroni--Dunn test for kappa rate and standard data sets

Image:DGC-R4.png

[edit] Imbalanced data sets

These tables show the area under the curve (AUC) accuracy and Cohen's kappa rate obtained from executing the algorithms over the imbalanced data sets using 10-fold stratified cross validation. All experiments are repeated with 10 different seeds for stochastic methods and the average results from the test folds are shown.

[edit] Area under the curve

Image:DGC-R5.png

Image:bonferroni_auc_imbalanced.png

Figure: Bonferroni--Dunn test for the area under the curve and imbalanced data sets

Image:DGC-R6.png

[edit] Cohen's kappa rate

Image:DGC-R7.png

Image:bonferroni_kappa_imbalanced.png

Figure: Bonferroni--Dunn test for kappa and imbalanced data sets

Image:DGC-R8.png

[edit] Two spirals data set predictions

These figures represent the prediction of the best ranked algorithms considered in the experimental study over the two spirals data set with gaussian noise.

                                                              Data set

                       DGC+                                                          DGC                                                           KNN

                       KNN-A                                                        DW-KNN                                                   SSMA+SFLSDE

[edit] Convergence analysis

This figure shows the convergence rate of the best fitness values over the data sets sonar (2 classes, 60 attributes), vowel (11 classes, 13 attributes), nursery (5 classes, 8 attributes), and dermatology (6 classes, 34 attributes).

[edit] Attribute-class weighting outputs

These figures show the attribute-class weights learned by CMA-ES for the data sets analysed in the previous section. The weights obtained for all the data sets are in this file.


[edit] Time performance: DGC+ vs DGC

The following table shows the original number of training instances, the number of artificial data particles created by DGC, and the evaluation time of a candidate classifier for both DGC+ and DGC over the data sets from the experimental study. Results show the average time of 100 evaluations.

The computational complexity of the gravitation calculation is similar for both methods. However in original DGC theory, the data particle is employed to obtain higher classification speed but at the cost of the accuracy, whereas DGC+ uses all train instances for the calculation of the gravitation. Therefore, the higher speed of the evaluation of a DGC classifier will depend on the size of the reduction of the data, which in turn depends on the original data distribution. Thus, while some data sets can be reduced to many fewer data particles, such as banana, page-blocks or thyroid, others can not be simplified. In this way, it would be recommended to apply a data reduction preprocessing prior to DGC+ classification when the computation time due to the large number of instances exceeds the target time of the researcher.

Table: Time performance: DGC+ vs DGC
      Evaluation Time (ms)
Data set Train Instances DGC Particles DGC+ DGC
Appendicitis 95 82 0.3248 0.2869
Australian 621 558 3.4040 3.1751
Balance 562 225 1.2662 0.5935
Banana 4770 16 8.1707 0.0784
Bupa 310 178 0.8770 0.5057
Car 1555 914 4.2234 3.1972
Contraceptive 1325 1027 5.9410 3.9952
Dermatology 323 321 5.3921 5.3739
Ecoli 302 234 1.3645 1.0586
Flare 959 339 5.6916 1.9710
German 900 897 8.2059 8.1356
Glass 191 100 0.9352 0.5338
Haberman 275 76 0.5011 0.1745
Hayes-roth 144 61 0.3507 0.1619
Heart 243 241 1.2681 1.2623
Hepatitis 74 73 0.5570 0.5551
Ionosphere 316 308 5.1187 4.8494
Iris 135 50 0.3381 0.1433
Lymphography 133 132 1.0340 1.0160
Monk-2 388 214 1.0772 0.6516
New-thyroid 193 62 0.5152 0.2236
Nursery 11664 9141 107.3627 61.6712
Page-blocks 4924 328 33.0672 1.7722
Phoneme 4863 361 16.2879 0.8975
Pima 691 632 2.2839 2.1357
Saheart 415 405 1.7139 1.7108
Sonar 187 187 8.0320 7.9831
Tae 135 63 0.3959 0.1983
Thyroid 6480 1179 87.0068 12.2842
Tic-tac-toe 863 863 3.9907 3.9885
Vehicle 761 755 6.5558 6.4839
Vowel 891 612 7.3149 4.4131
Wine 160 160 0.8732 0.8741
Yeast 1335 667 7.0530 3.2389
Zoo 89 47 0.6748 0.4077
Personal tools