APARM

From KDIS Research Group - Wiki

Jump to: navigation, search

Contents


[edit] Abstract

This paper treats the first approximation to the extraction of association rules by employing ant programming, a technique that has recently reported very promising results in mining classification rules. In particular, two different algorithms are presented, both guided by a context-free grammar, specifically suited to association rule mining, which defines the search space. The first proposal follows a single-objective approach in which a novel fitness function is used to evaluate the individuals mined. In contrast, the second algorithm considers individual evaluation from a Pareto-based point of view, measuring the confidence and support of the rules mined and assigning them a ranking fitness. Both algorithms are verified over 15 varied data sets, comparing their results to other association rule mining algorithms from several paradigms such as exhaustive search, genetic algorithms, and genetic programming. The results obtained are very promising, and they indicate that ant programming is a good technique for the association task of data mining, lacking of the drawbacks that exhaustive methods present.

[edit] Additional material

[edit] Influence of the alpha parameter

In this section, the influence of the \alpha parameter is studied, checking that good values are those comprised in the interval [0, 1]. If the exponent of the heuristic information belongs to this interval, the parameter does not seem to be very sensitive. In fact, the algorithm is able of obtaining good results just employing the pheromone information, if the number of generations is enough. This way, even in the case of not using any heuristic information, results are good.

On the other hand, it can be seen that the value of the heuristic exponent should not be higher than the value of the pheromone exponent. In the experiments performed, when alpha is set to 5.0, beta being 1.0, results vary drastically, usually obtaining much lower performance regarding one of the measures of interest, or both.

Notice that these results have been obtained by running the GBAP-ARM algorithm, but a similar behavior could be argued for the MOGBAP-ARM algorithm.

Sonar dataset

Alpha Support Confidence Coverage
0.0 0.8050 0.9485 97.11
0.1 0.8067 0.9478 97.11
0.2 0.8033 0.9404 97.11
0.4 0.8033 0.9421 97.11
1.0 0.8043 0.9474 97.11


Soybean dataset

Alpha Support Confidence Coverage
0.0 0.9685 0.9878 100
0.1 0.9688 0.9881 100
0.2 0.9659 0.9906 100
0.4 0.9685 0.9897 100
1.0 0.9674 0.9905 100


Nursery dataset

Alpha Support Confidence Coverage
0.0 0.7683 0.9915 100
0.1 0.7683 0.9915 100
0.2 0.7683 0.9915 100
0.4 0.7683 0.9915 100
1.0 0.7683 0.9915 100

[edit] Influence of the gamma parameter

We have performed a sensitivity analysis over four datasets: soybean, segment, nursery and sonar. Below are the results obtained for each of them varying the gamma parameter.

Sonar dataset

Gamma Support Confidence Coverage
0 0.8079 0.9269 97.11
1 0.8052 0.9415 97.11
2 0.7903 0.9503 96.15
3 0.7867 0.9608 96.15
4 0.7848 0.9629 96.15
5 0.7802 0.9687 96.15


Soybean dataset

Gamma Support Confidence Coverage
0 0.9685 0.9836 100
1 0.9688 0.9894 100
2 0.9568 0.9950 100
3 0.9049 0.9985 100
4 0.9262 0.9983 100
5 0.8836 0.9995 100


Segment dataset

Gamma Support Confidence Coverage
0 0.9807 0.9924 100
1 0.9804 0.9965 100
2 0.9732 0.9968 100
3 0.9582 0.9985 100
4 0.9345 0.9982 100
5 0.9368 0.9985 100


Nursery dataset

Gamma Support Confidence Coverage
0 0.7910 0.8862 100
1 0.7683 0.9915 100
2 0.7683 0.9915 100
3 0.7641 0.9923 100
4 0.7641 0.9923 100
5 0.7641 0.9923 100

[edit] Influence of the number of ants

In a previous work, we proposed the use of AP for the classification task. Then, we employed more than 20 ants does not significantly means that the algorithm behaves better. Because of that experience, even when the current work refers to a different mining task, we checked this behavior for the association rule mining algorithms presented here, confirming that they behave similarly. Here we present the results obtained for GBAP-ARM over three datasets, where the number of ants is varied from 1 to 100, checking that the significant improvement regarding support and confidence performance takes place when selecting 20 ants.

Sonar dataset

Ants Support Confidence Coverage
1 0.7745 0.8928 99.51
5 0.8031 0.9201 98.07
10 0.8033 0.9287 97.11
20 0.8016 0.9449 97.11
50 0.8028 0.9471 95.67
100 0.8028 0.9471 95.67


Soybean dataset

Ants Support Confidence Coverage
1 0.9077 0.9602 100
5 0.9592 0.9787 100
10 0.9664 0.9850 100
20 0.9696 0.9881 100
50 0.9696 0.9907 100
100 0.9701 0.9913 100

Nursery dataset

Ants Support Confidence Coverage
1 0.7633 0.8572 100
5 0.7649 0.9870 100
10 0.7649 0.9904 100
20 0.7684 0.9916 100
50 0.7684 0.9916 100
100 0.7684 0.9916 100

[edit] Complexity of the algorithms

The complexity of both GBAP-ARM and MOGBAP-ARM algorithm has been computed, checking that there exists a linear dependency regarding the number of individuals employed, the number of generations and the number of instances in the data set. Furthermore, there is no dependency regarding the number of attributes in the data set.

Execution time versus number of individuals
Execution time versus number of generations
Execution time versus number of instances
Execution time versus number of attributes

[edit] Comprehensibility in the knowledge discovery process

As it has been pointed out previously, the algorithms presented in this paper generate a set of association rules in the form of IF-THEN rules. The procedure is carried out automatically from the dataset or problem addressed and the parameter configuration employed, the output being the best set of rules achieved during the evolutionary process, fulfilling the grammar. Notice that the only operator considered in the grammar employed for the experimentation has been the AND operator. Nevertheless, the grammar can be modified to take into account other kind of logical operators, so that other kind of rules can be mined.

The rules obtained have several desirable properties: they are simple, intuitive, easily understandable and they provide representative information. As an example, some rules obtained by the GBAP-ARM algorithm over the soybean and the contraceptive method choice (cmc) datasets are displayed.

Sample rules for the soybean dataset:

IF (class != rhizoctonia-root-rot) THEN (mycelium = present)
IF (int-discolor != black AND class != diaporthe-pod-&-stem-blight) THEN (sclerotia = abset)


Sample rules for the cmc dataset:

IF (Meida_exposure != 1 AND standar-of-living-index != 1) THEN (Husbands_education!=1)
IF (Wifes_education!=1) THEN (Husbands_occupation!=4)
IF (Number_of_children_ever_born!=(-inf, 0.5]) THEN (Husbands_education!=1)

[edit] Running example

Since both AP algorithms proposed here are evolutionary algorithms that mimic the behavior of ants in nature, they are probabilistic algorithms that apply a transition rule to select the next movement of a given ant until reaching a solutions. This involves a random component and, therefore, it is not easy to provide a running example of the algorithms since they are not deterministic, as opposed to exhaustive methods such as Apriori or FP-Growth, which look for frequent item sets first and then extract association rules matching the minimum support and confidence thresholds.

Anyway, we include here a sample simulation of how a given ant is created during the first generation of the algorithm, to make easy the comprehension of the algorithm. To do so, we focus on the single objective algorithm proposed, GBAP-ARM, using the weather dataset as example.

[edit] Execution parameters

Parameter Value
Dataset weather
Random seed 1492
Number of ants 20
Number of generations 100
Maximum number of derivations 10
Initial pheromone amount 1.0
Evaporation rate 0.05
Minimum pheromone amount 0.1
Alpha 0.4
Beta 1.0
Minimum support 0

[edit] Grammar and space of states initialization

First of all, the grammar is initialized, as well as the data structure that represents the space of states.

[edit] NON TERMINAL SYMBOLS

<Consequent>
<Rule>
<Condition>
<Antecedent>

[edit] TERMINAL SYMBOLS

normal
sunny
cool
overcast
mild
-->
high
rainy
outlook
play
no
humidity
!=
hot
=
windy
temperature
AND
yes

[edit] PRODUCTION RULES

For each production rule available, the number on the left indicates the number of derivations, and the number on the right, how many final states can be reached from this production rule in that number of derivations.

<Rule> := --> <Antecedent> <Consequent> 
	0 -> 0
	1 -> 0
	2 -> 0
	3 -> 0
	4 -> 0
	5 -> 576
	6 -> 0
	7 -> 13824
	8 -> 0
	9 -> 331776
       10 -> 0
<Antecedent> := AND <Antecedent> <Condition> 
	0 -> 0
	1 -> 0
	2 -> 0
	3 -> 0
	4 -> 576
	5 -> 0
	6 -> 13824
	7 -> 0
	8 -> 331776
	9 -> 0
       10 -> 7962624
<Antecedent> := <Condition> 
	0 -> 0
	1 -> 0
	2 -> 24
<Consequent> := <Condition> 
	0 -> 0
	1 -> 0
	2 -> 24
<Condition> := = outlook rainy 
	0 -> 0
	1 -> 1
<Condition> := != outlook rainy 
	0 -> 0
	1 -> 1
<Condition> := = outlook sunny 
	0 -> 0
	1 -> 1
<Condition> := != outlook sunny 
	0 -> 0
	1 -> 1
<Condition> := = outlook overcast 
	0 -> 0
	1 -> 1
<Condition> := != outlook overcast 
	0 -> 0
	1 -> 1
<Condition> := = temperature cool 
	0 -> 0
	1 -> 1
<Condition> := != temperature cool 
	0 -> 0
	1 -> 1
<Condition> := = temperature mild 
	0 -> 0
	1 -> 1
<Condition> := != temperature mild 
	0 -> 0
	1 -> 1
<Condition> := = temperature hot 
	0 -> 0
	1 -> 1
<Condition> := != temperature hot 
	0 -> 0
	1 -> 1
<Condition> := = humidity normal 
	0 -> 0
	1 -> 1
<Condition> := != humidity normal 
	0 -> 0
	1 -> 1
<Condition> := = humidity high 
	0 -> 0
	1 -> 1
<Condition> := != humidity high 
	0 -> 0
	1 -> 1
<Condition> := = windy yes 
	0 -> 0
	1 -> 1
<Condition> := != windy yes 
	0 -> 0
	1 -> 1
<Condition> := = windy no 
	0 -> 0
	1 -> 1
<Condition> := != windy no 
	0 -> 0
	1 -> 1
<Condition> := = play yes 
	0 -> 0
	1 -> 1
<Condition> := != play yes 
	0 -> 0
	1 -> 1
<Condition> := = play no 
	0 -> 0
	1 -> 1
<Condition> := != play no 
	0 -> 0
	1 -> 1

[edit] Generation 1

Now we present how an ant is created in the first generation of the algorithm, from the initial state, applying the transition rule to select each movement until reaching a final state or solution.

[edit] Create 1st ant

[edit] 1st movement

Current state: <Rule>

Available States:

--> <Antecedent> <Consequent>  --> cardinality probability: 1.0

Roulette:

Roulette point Destination state Transition probability
1.0 --> <Antecedent> <Consequent> 1.0

Random number generated: 0.6689774990340402

Node selected: --> <Antecedent> <Consequent>
[edit] 2nd movement

Current state:

--> <Antecedent> <Consequent> 

Available States:

--> AND <Antecedent> <Condition> <Consequent>  --> cardinality probability: 0.9983361064891847
--> <Condition> <Consequent> --> cardinality probability: 0.0016638935108153079

Roulette:

Roulette point Destination state Transition probability
0.5636225441427081 --> AND <Antecedent> <Condition> <Consequent> 0.5636225441427081
1.0 --> <Condition> <Consequent> 0.4363774558572918

Random number generated: 0.5652586514323275

Node selected:

--> <Condition> <Consequent>
[edit] 3rd movement

Current state:

--> <Condition> <Consequent> 

Available States:

--> != temperature mild <Consequent>  --> cardinality probability: 0.041666666666666664
--> != windy no <Consequent> --> cardinality probability: 0.041666666666666664
--> = temperature hot <Consequent> --> cardinality probability: 0.041666666666666664
--> != outlook overcast <Consequent> --> cardinality probability: 0.041666666666666664
--> = humidity normal <Consequent> --> cardinality probability: 0.041666666666666664
--> != humidity high <Consequent> --> cardinality probability: 0.041666666666666664
--> = play no <Consequent> --> cardinality probability: 0.041666666666666664
--> = outlook rainy <Consequent> --> cardinality probability: 0.041666666666666664
--> = outlook sunny <Consequent> --> cardinality probability: 0.041666666666666664
--> != play no <Consequent> --> cardinality probability: 0.041666666666666664
--> != play yes <Consequent> --> cardinality probability: 0.041666666666666664
--> = temperature mild <Consequent> --> cardinality probability: 0.041666666666666664
--> = outlook overcast <Consequent> --> cardinality probability: 0.041666666666666664
--> = windy no <Consequent> --> cardinality probability: 0.041666666666666664
--> != windy yes <Consequent> --> cardinality probability: 0.041666666666666664
--> = play yes <Consequent> --> cardinality probability: 0.041666666666666664
--> != temperature cool <Consequent> --> cardinality probability: 0.041666666666666664
--> = humidity high <Consequent> --> cardinality probability: 0.041666666666666664
--> = temperature cool <Consequent> --> cardinality probability: 0.041666666666666664
--> != outlook sunny <Consequent> --> cardinality probability: 0.041666666666666664
--> != temperature hot <Consequent> --> cardinality probability: 0.041666666666666664
--> != humidity normal <Consequent> --> cardinality probability: 0.041666666666666664
--> != outlook rainy <Consequent> --> cardinality probability: 0.041666666666666664
--> = windy yes <Consequent> --> cardinality probability: 0.041666666666666664

(Notice that here all the available next states have the same cardinality probability since all of them allow to reach the same number of solutions)

Roulette:

Roulette point Destination state Transition probability
0.044391299540119 --> != temperature mild <Consequent> 0.044391299540119
0.08395731937637962 --> != windy no <Consequent> 0.039566019836260624
0.11759963343732152 --> = temperature hot <Consequent> 0.03364231406094189
0.16613539676754333 --> != outlook overcast <Consequent> 0.0485357633302218
0.20821785298374212 --> = humidity normal <Consequent> 0.042082456216198806
0.25030030919994095 --> != humidity high <Consequent> 0.042082456216198806
0.2870835394738635 --> = play no <Consequent> 0.03678323027392253
0.32386676974778605 --> = outlook rainy <Consequent> 0.03678323027392253
0.3606500000217086 --> = outlook sunny <Consequent> 0.03678323027392253
0.40718276595592984 --> != play no <Consequent> 0.04653276593422124
0.4439659962298524 --> != play yes <Consequent> 0.03678323027392253
0.483532016066113 --> = temperature mild <Consequent> 0.039566019836260624
0.5171743301270549 --> = outlook overcast <Consequent> 0.03364231406094189
0.5615656296671739 --> = windy no <Consequent> 0.044391299540119
0.6059569292072929 --> != windy yes <Consequent> 0.044391299540119
0.6524896951415142 --> = play yes <Consequent> 0.04653276593422124
0.701025458471736 --> != temperature cool <Consequent> 0.0485357633302218
0.7431079146879348 --> = humidity high <Consequent> 0.042082456216198806
0.7767502287488767 --> = temperature cool <Consequent> 0.03364231406094189
0.823282994683098 --> != outlook sunny <Consequent> 0.04653276593422124
0.8718187580133198 --> != temperature hot <Consequent> 0.0485357633302218
0.9139012142295185 --> != humidity normal <Consequent> 0.042082456216198806
0.9604339801637398 --> != outlook rainy <Consequent> 0.04653276593422124
1.0 --> = windy yes <Consequent> 0.039566019836260624

Random number generated: 0.28323309899780924

Node selected:

--> = play no <Consequent>
[edit] 4th movement

Current state:

--> = play no <Consequent> 

Available States:

--> = play no <Condition>  --> cardinality probability: 1.0 

Roulette:

Roulette point Destination state Transition probability
1.0 --> = play no <Condition> 1.0

Random number generated: 0.8882236642325438

Node selected:

--> = play no <Condition>
[edit] 5th movement

Current state:

--> = play no <Condition> 

Available States:

--> = play no = outlook overcast  --> cardinality probability: 0.041666666666666664 
--> = play no = outlook sunny --> cardinality probability: 0.041666666666666664
--> = play no != play no --> cardinality probability: 0.041666666666666664
--> = play no != temperature mild --> cardinality probability: 0.041666666666666664
--> = play no != outlook rainy --> cardinality probability: 0.041666666666666664
--> = play no != outlook overcast --> cardinality probability: 0.041666666666666664
--> = play no = temperature hot --> cardinality probability: 0.041666666666666664
--> = play no != windy no --> cardinality probability: 0.041666666666666664
--> = play no = humidity high --> cardinality probability: 0.041666666666666664
--> = play no = play yes --> cardinality probability: 0.041666666666666664
--> = play no != temperature hot --> cardinality probability: 0.041666666666666664
--> = play no = windy no --> cardinality probability: 0.041666666666666664
--> = play no = humidity normal --> cardinality probability: 0.041666666666666664
--> = play no = temperature mild --> cardinality probability: 0.041666666666666664
--> = play no != play yes --> cardinality probability: 0.041666666666666664
--> = play no = outlook rainy --> cardinality probability: 0.041666666666666664
--> = play no = windy yes --> cardinality probability: 0.041666666666666664
--> = play no = temperature cool --> cardinality probability: 0.041666666666666664
--> = play no != humidity normal --> cardinality probability: 0.041666666666666664
--> = play no != outlook sunny --> cardinality probability: 0.041666666666666664
--> = play no != humidity high --> cardinality probability: 0.041666666666666664
--> = play no = play no --> cardinality probability: 0.041666666666666664
--> = play no != windy yes --> cardinality probability: 0.041666666666666664
--> = play no != temperature cool --> cardinality probability: 0.041666666666666664

Roulette:

Roulette point Destination state Transition probability
0.033642314060941876 --> = play no = outlook overcast 0.033642314060941876
0.07042554433486439 --> = play no = outlook sunny 0.03678323027392252
0.11695831026908561 --> = play no != play no 0.04653276593422122
0.1613496098092046 --> = play no != temperature mild 0.04439129954011898
0.20788237574342583 --> = play no != outlook rainy 0.04653276593422122
0.25641813907364763 --> = play no != outlook overcast 0.048535763330221776
0.2900604531345895 --> = play no = temperature hot 0.033642314060941876
0.3296264729708501 --> = play no != windy no 0.03956601983626061
0.3717089291870489 --> = play no = humidity high 0.04208245621619879
0.41824169512127013 0.4667774584514919 0.048535763330221776
0.5111687579916109 --> = play no = windy no 0.04439129954011898
0.5532512142078096 --> = play no = humidity normal 0.04208245621619879
0.5928172340440703 --> = play no = temperature mild 0.03956601983626061
0.6296004643179928 --> = play no != play yes 0.03678323027392252
0.6663836945919153 --> = play no = outlook rainy 0.03678323027392252
0.7059497144281759 --> = play no = windy yes 0.03956601983626061
0.7395920284891178 --> = play no = temperature cool 0.033642314060941876
0.7816744847053165 --> = play no != humidity normal 0.04208245621619879
0.8282072506395377 --> = play no != outlook sunny 0.04653276593422122
0.8702897068557365 --> = play no != humidity high 0.04208245621619879
0.907072937129659 --> = play no = play no 0.03678323027392252
0.951464236669778 --> = play no != windy yes 0.04439129954011898
1.0 --> = play no != temperature cool 0.048535763330221776

Random number generated: 0.7832965968756976

Node selected:

--> = play no != outlook sunny 

New ant created:

--> <Antecedent> <Consequent>  ----> --> <Condition> <Consequent>  ----> --> = play no <Consequent>  ----> --> = play no <Condition>  ----> --> = play no != outlook sunny
[edit] Evaluate ant

The ant is evaluated using the fitness measure proposed, which is the harmonic mean between confidence and support.

support=0.14285714285714285

confidence=0.4

fitness=0.21052631578947364

[edit] Add created individual to the list ants

[edit] Create more ants following the same procedure until creating 20 ants

[edit] Sort ants in descending order by fitness, including them in the list externalPopulation if there is no other equivalent individual, until it reaches the limit size

Ant number Rule encoded Fitness
1 --> != outlook sunny != play no support=0.5 , confidence=0.7777777777777778 , fitness=0.6086956521739131
2 --> != temperature hot = play yes support=0.5 , confidence=0.7 , fitness=0.5833333333333334
3 --> != play no != windy yes support=0.42857142857142855 , confidence=0.6666666666666666 , fitness=0.5217391304347826
4 --> AND != outlook sunny != temperature cool = humidity high support=0.2857142857142857 , confidence=0.6666666666666666 , fitness=0.4
5 --> AND != temperature cool = humidity high != windy yes support=0.2857142857142857 , confidence=0.5714285714285714 , fitness=0.38095238095238093
6 --> != humidity normal != outlook sunny support=0.2857142857142857 , confidence=0.5714285714285714 , fitness=0.38095238095238093
7 --> != windy yes = humidity normal support=0.2857142857142857 , confidence=0.5, fitness=0.36363636363636365
8 --> = outlook rainy = play yes support=0.21428571428571427 , confidence=0.6, fitness=0.3157894736842105
9 --> = humidity high = windy yes support=0.21428571428571427 , confidence=0.42857142857142855 , fitness=0.2857142857142857
10 --> = play yes = humidity high support=0.21428571428571427 , confidence=0.3333333333333333 , fitness=0.2608695652173913
11 --> AND != outlook sunny != outlook rainy != humidity high support=0.14285714285714285 , confidence=0.5 , fitness=0.22222222222222224
12 --> AND != temperature cool != temperature mild = play yes support=0.14285714285714285 , confidence=0.5 , fitness=0.22222222222222224
13 --> = play no != outlook sunny support=0.14285714285714285 , confidence=0.4 , fitness=0.21052631578947364
14 --> = windy no = temperature cool support=0.14285714285714285 , confidence=0.25 , fitness=0.18181818181818182
15 = play yes = temperature hot support=0.14285714285714285 , confidence=0.2222222222222222 , fitness=0.17391304347826086
16 --> AND AND = outlook overcast = temperature cool = windy yes != humidity high support=0.07142857142857142 , confidence=1.0 , fitness=0.13333333333333333
17 --> AND = temperature mild = play no = windy yes support=0.07142857142857142 , confidence=0.5 , fitness=0.125
18 --> AND AND != windy no = humidity normal != temperature mild != outlook overcast support=0.07142857142857142 , confidence=0.5 , fitness=0.125
19 --> AND AND = windy yes != temperature mild = play no != temperature cool support=0.0 , confidence=0.0 , fitness=0.0
20 --> AND AND != temperature hot = windy yes != outlook overcast = outlook sunny support=0.0 , confidence=0.0 , fitness=0.0

[edit] For each individual in list ants, reinforce pheromone amount in their path if their fitness>0.5

[edit] Evaporation and normalization

[edit] ants <- externalPopulation individuals

...

[edit] Repeat until last generation

[edit] Return individuals in externalPopulation list as association rules discovered

Personal tools