G3PARM

From KDIS Research Group - Wiki

Jump to: navigation, search

Contents


[edit] Abstract

This paper presents a proposal for the extraction of association rules called G3PARM (Grammar Guided Genetic Programming for Association Rule Mining) which makes the knowledge extracted more expressive and flexible. This algorithm allows a context-free grammar to be adapted and applied to each specific problem or domain and eliminates the problems raised by discretization. This proposal keeps the best individuals (those that exceed a certain threshold of support and confidence) obtained with the passing of generations in an auxiliary population of fixed size n. G3PARM obtains solutions within specified time limits and does not require the large amounts of memory that the exhaustive search algorithms in the field of association rules do.

Our approach is compared to exhaustive search (Apriori and FP-Growth) and genetic (QuantMiner and ARMGA) algorithms for mining association rules, and performs an analysis of the mined rules. Finally, a series of experiments serve to contrast the scalability of our algorithm. The proposal obtains a small set of rules with high support and confidence, over 90% and 99% respectively. Moreover, the resulting set of rules closely satisfies all the dataset instances. These results illustrate that our proposal is highly promising for the discovery of association rules in different types of datasets.

[edit] Paper versions

[edit] Draft

J.M. Luna, J.R. Romero and S. Ventura. Design and Behaviour Study of a Grammar Guided Genetic Programming Algorithm for Mining Association Rules. Knowledge and Information Systems Journal. 2011

[edit] Paper published online

J.M. Luna, J.R. Romero and S. Ventura. Design and Behaviour Study of a Grammar Guided Genetic Programming Algorithm for Mining Association Rules. Knowledge and Information Systems Journal. 2012, Volume 32, Issue 1, pp 53-76

[edit] G3PARM Algorithm

The G3PARM algorithm is a proposal to obtain association rules for any domain or problem. This algorithm makes use of grammar guided genetic programming (G3P) to define expressive and understandable individuals. These individuals are defined by using a context-free grammar (CFG) that establishes the syntax constraints for each one, allowing both categorical and quantitative attributes to be defined and obtaining feasible solutions without requiring large amounts of memory. This algorithm allows numerical rules to be obtained by eliminating the problem of choosing the number of bins in the discretization step.

[edit] Encoding

Each problem solution is composed of two distinct components: (a) a genotype, represented by a derivation syntax-tree, and (b) a phenotype, that represents the rule. The phenotype represents the complete rule consisting of an antecedent and a consequent. Both the antecedent and consequent of each rule are formed by a series of conditions that contains the values of certain attributes that must all be satisfied.

Figure 1 shows the CFG through which the population individuals are encoded. Individuals are defined as a derivation syntax-tree where the root is the symbol S, the internal nodes contain only non-terminal symbols and the leaf nodes contain only tokens. Each individual is a sentence generated by the grammar and represented by a derivation syntax-tree. To generate the sentence, the derivation syntax-tree is obtained by applying a series of derivation steps.

Figure 1: Context-free grammar expressed in Extended BNF notation

The use of a CFG allows individuals to be obtained for each specific problem or domain. The chromosome encodes its expression using a pre-order traversal of the parse tree. It should be noted that the terminal grammar symbol ‘name’ is determined in the feasible attributes. For each one, the assigned value is determined among their available values. For example, using the well-known weather dataset, a set of avaliable values is obtained for each attribute as shown in Table 1.

Table 1: Metadata defined in the weather dataset

Figure 2 shows a sample representation of an individual in the G3PARM algorithm using the wheather dataset (see Table 1). The syntax-tree structure is obtained according to the grammar defined. As mentioned above, the phenotype represents the rule and is obtained by eliminating non-terminal genotype symbols.

Figure 2:How to represent the individuals

The generation of an individual is carried out from the start symbol of the grammar. From this symbol, the algorithm searches for a feasible solution through the random application of production rules belonging to set P, until a valid derivation chain is reached. The maximum number of derivations is determined by a value provided in the algorithm configuration parameters.

To carry out the derivation of the non-terminal symbols that appear in the grammar, we use the cardinality concept which is defined as the number of derivation steps necessary for creating only terminal symbols. If a non-terminal symbol can be derived in several ways, the cardinality of this non-terminal symbol will be determined by the sum of the cardinalities of each of the possible derivations of this symbol. In the derivation process, a production between the minimum and maximum number of derivation steps is randomly selected. Once a production is chosen, the maximum number of derivation steps is decreased by one. The higher the cardinality of a symbol, the higher the probability of choosing this symbol as a production, taking into account the maximum number of derivations.

[edit] Genetic operators

In this section, two new genetic operators, developed to generate new individuals in a given generation of the EA process, are presented. With these genetic operators, we try to obtain rules with higher support than the original ones. The support of an association rule depends on the frequency of its attributes, so the greater the frequency of occurrence of rule attributes, the greater the probability of increasing the support of the entire rule. However, increasing the frequency of occurrence of a rule attribute does not always imply an increase in the support of the entire rule.

[edit] Crossover

The pseudocode of this genetic operator is shown in Algorithm 1, and Figure 3 shows this process graphically. Like crossover operators in most GP approaches, the main idea of this operator is to obtain two new individuals from two different parents. In this proposal, the attribute with the lowest frequency of occurrence in one parent is swapped for the highest one in the other parent. This enables us to obtain at least one individual whose attributes appear more frequently than at least one of the parents.

Figure 3: Example of crossover operation

The operator creates new individuals by exchanging the derivation subtrees of two parents (Parent1 and Parent2) from two selected nodes with the same non-terminal symbol in each of them. To select a node from Parent1, the attribute of the rule with the highest support is chosen for exchange and then we take the non-terminal symbol Comparison (<cmp>) from which this attribute is derived. To select a node from Parent2, the attribute of the rule with lowest support is chosen to be exchanged and the non-terminal symbol Comparison is taken. In this example, we consider the frequency of occurrence of the attributes are: 25 for ‘att1 != value1’, 53 for ‘att2 >= value2’, 51 for ‘att3 > value3’, 27 for ‘att4 < value4’, and 17 for ‘att5 = value5’.

[edit] Mutation

The mutator operator process is shown graphically in Figure 4 and the pseudocode is shown in Algorithm 2. The main idea of this genetic operator is an attempt to obtain an individual with greater support. In this way, the attribute with the lowest frequency of occurrence is mutated to try to obtain an attribute with a higher frequency. Based on this attribute, a node from this subtree is selected randomly and the next action is based on the symbol type. There are two possibilities: if the node is a non-terminal symbol (e.g. Comparison), a new derivation is performed from this symbol. On the other hand, if the node selected is a terminal symbol, it changes the value of the terminal symbol at random. In this example, we consider that the frequencies of attribute occurrence are 53 for ‘att1 != value1’ and 25 for ‘att2 = value2’.

Figure 4: Example of mutation operation

[edit] Dealing with categorical attributes

Categorical attributes are commonly used in most algorithms for mining association rules. A categorical attribute A of an association rule is an attribute that has a discrete unordered domain D. In our proposal, an expression of the form A = u or A != u is allowed, where A is a categorical attribute and u is a value in the domain D of A. The expression A != u indicates that A takes any value in D\{u}.

For a domain D of a categorical attribute, the support of any value u in this domain might be very low. Using the operator != it is possible to obtain a higher support for this attribute. For example, for a categorical attribute A in a domain D = {a, b, c, d}, the support of A = a is 0.14, whereas the support of A != a will be 0.86.

[edit] Dealing with numerical attributes

Numerical attributes are usually defined within a wide range of numerical values. It is useless to work on all possible numerical values, as done with categorical values, because in most cases, a given numerical value will not appear frequently and each numerical attribute might have an enormous number of values appearing in the dataset. A classical way to deal with numerical attributes is to divide their domains into different intervals by the discretization pre-processing operation. This process may create new problems, e.g., the greater the number of intervals, the greater the execution time since the number of items per record soars. There is a tradeoff between faster execution time with fewer intervals and reducing information loss with more intervals.

The algorithm proposed uses a new alternative to deal with numerical attributes using different equal-width points in the range of values and using these points as the value of the numerical attribute. The upper and lower bounds of the range of values will not be taken into account to avoid rules that always occur, which will therefore be uninteresting rules since they do not provide new information. For example, having the numerical attribute A, if we obtain five points from the range of values [1000, 2000], then we will have as possible values: 1250, 1500 and 1750. Using the CFG (see Figure 1), for this example we could obtain the following expressions: A >= 1250, A < 1250, A > 1250, etc. A rule always occurs if we take the uppermost (value 2000) or lowest (value 1000) bounds of the range (e.g., A >= 1000).

One of the main problems when dividing the range of values into intervals is the correct definition of the width of each interval. Remember that if we define small intervals then the support for any single interval may be low and rules involving this attribute may not be found. On the other hand, if we define big intervals then these rules might be uninteresting. Our proposal avoids this problem since it allows interval rules to be obtained by defining a different interval width in one same attribute. For example, using the attribute A mentioned above, we can obtain the interval A > 1250 AND A < 1500 for (1250, 1500) or A > 1250 AND A <= 1500 for (1250, 1500].

Unlike the discretization method, our proposal deals directly with numerical attributes and does not divide the numerical attributes into intervals by assigning a categorical value to each interval. This way of using numerical attributes avoids the problem of obtaining small intervals with possibly low support and where rules involving this attribute might be not found. Therefore, for any distance between points in the range of an attribute’s values, it is possible to obtain support that exceeds the minimum support threshold. In addition, it is not necessary to perform a previous discretization step (data preprocessing) to use the algorithm, which is another advantage of our proposal.

[edit] Evaluation

Evaluation is a process that comprises different activities. Firstly, individual decoding must be carried out by finding the association rule that corresponds to the genotype. This process consists of building an in-depth expression of the syntax tree, removing non-terminal symbols that appear and then verifying that the individuals do not have equal attributes in the antecedent and consequent. If the antecedent and consequent of an individual are not disjoint itemsets, then a 0 fitness function value is assigned.

[edit] Algorithm

The algorithm proposed follows the classical generational scheme. It uses two populations with a fixed size, one for the individuals obtained throughout generations and the other one (the auxiliary population) for the individuals that exceed a certain minimum quality threshold. This auxiliary population is updated in each generation keeping the best individuals throughout the evolutionary process. The pseudocode of the algorithm is shown in Algorithm 4.

The algorithm starts producing the population by randomly generating individuals from the CFG defined in Figure 1 and not exceeding the maximum number of derivations. In the initial generation, the auxiliary population will be empty. A set of individuals is selected via a binary tournament from the merging of the current population and the auxiliary population. This selector works by selecting two individuals at random from the current population and after comparing them, it keeps the best of both. Individuals are selected to act as parents with a certain probability of crossover and a new set of individuals is obtained. The next step is to perform the mutation of the resulting set of individuals obtained in the previous crossover step, with a certain probability as the crossover operator. Once we have the new population by crossover and mutation, the auxiliary population is updated by merging the previous auxiliary population and the current population. Then, individuals are ranked according to their support and those that are equal are eliminated. The G3PARM algorithm considers that two individuals are equal if, despite having different genotypes, they are composed of the same attributes. For example, rules IF (Condition1 AND Condition2) THEN Condition3 and IF (Condition2 AND Condition1) THEN Condition3 are considered equal. If two individuals have the same attributes, the one with the lowest fitness is removed. If these individuals have the same attributes and the same fitness, the the least restrictive one is removed, e.g., the rule IF A >= 3 THEN B < 4 is removed because the rule IF A > 3 THEN B < 4 is more restrictive but both have the same fitness. Comparing two conditions, it is considered that the most restrictive one represents the smallest number of values. Using the rules mentioned above, an attribute A having the [A-initial ,A-final] range of values, the condition A >= 3 represents the interval [3, A-final], whereas the condition A > 3 states for (3,A-final], so the latter represents the smallest interval and, in consequence, it is the most restrictive. From the resulting set, the individuals that exceed a certain threshold of support and confidence are selected. The algorithm terminates once it reaches a certain number of generations, returning auxiliary population individuals.

[edit] Additional material

[edit] Experimental study

In order to obtain the optimal parameters that allow us to reach the best results using the evolutionary proposals, a series of experiments were carried out. For the G3PARM algorithm, the best results are obtained with a population size of 50 individuals, 100 generations, 70% crossover probability, 14% mutation probability, a maximum derivation number of 24, an external population of size 20, a 90% external confidence threshold and a 70% external support threshold. Because most of the algorithms for mining association rules have only one attribute in the consequent, we restrict the G3PARM algorithm in order to obtain rules with only one consequent.

For the ARMGA algorithm, the optimal parameters are: a population size of 50 individuals, 100 generations, 100% selection probability, a 90% crossover probability, 1% mutation probability, a maximum rule length of 4 attributes and a 90% confidence threshold. ARMGA does not use a support threshold. Finally, the optimal parameters of the QuantMiner algorithm are: a population size of 250 individuals, 100 generations, 40% mutation probability and 50% crossover probability. Like G3PARM and ARMGA, this algorithm uses a 90% confidence threshold and a 70% support threshold. For the sake of a fair comparison with G3PARM, QuantMiner has been configured to return 20 rules at most, those with the best fitness from the final population.

For the Apriori and FP-Growth algorithms, the same support threshold is used as in G3PARM and QuantMiner (70%), and the same confidence threshold as in the aforementioned evolutionary algorithms (90%).

The results obtained by each evolutionary algorithm are the average results obtained running each one ten times using different seeds each time for the evolutionary algorithms.

[edit] Datasets

Dataset # Instances # Attributes Type of Attributes
Automobile Performance 392 8 Numerical
Cpu_act 8192 22 Numerical
Credit 1000 21 Categorical and Numerical
House 16H 22784 17 Numerical
Wisconsin Breast Cancer 683 11 Categorical and Numerical
Wisconsin Diagnostic Breast Cancer 569 31 Categorical and Numerical

[edit] Software

Runnable jar file

Configuration file

How to execute the algorithm

java -jar G3PARM.jar G3PARM.cfg

The configuration file must be modified with the desired dataset.

[edit] WEKA plugin

The G3PARM algorithm can also be runned within the WEKA software tool. Download the package and import using the WEKA package manager (requires WEKA 3.7.5).


[edit] Results

[edit] Analysis of nominal datasets

As mentioned above, real-world datasets usually consist of numerical values, so exhaustive search algorithms for ARM focused on datasets cannot be directly used with categorical values. However, algorithms like Apriori or FP-Growth are based on categorical attributes, so in order to use these algorithms, numerical data must be discretized. Equal-width and equal-frequency methods are two well-known unsupervised discretization proposals. Notice that the equal-frequency discretization method requires some previous information about the data distribution, so in this analysis, the equal-width method is used. It determines the minimum and maximum values of each attribute and then divides the range into a number of discrete equal width intervals.

In this field, several experiments were carried out with 4, 5, 6, 7 and 8 intervals per numerical attribute. Table 2 shows the results obtained by four algorithms (Apriori, FP-Growth, ARMGA and G3PARM) with these intervals, where Average supp is the average support of the rule set; Average conf refers to the average confidence of the rule set; %Instances states the percentage of instances covered by the rules in all the instances in the dataset (expressed on a per unit basis); Average Nrule is the average number of rules in the rule set; HH-N is the House 16H dataset discretized in N intervals; CPU-N is the Cpu_act dataset discretized in N intervals; W-N is the Wisconsin dataset discretized in N intervals; and Cr-N is the Credit dataset discretized in N intervals.

Analyzing the results presented in Table 2 (the best results for each measure are set in bold type-face), notice that the G3PARM algorithm obtains rules with better support than the others; and better confidence than Apriori and FP-Growth. The use of the operator ‘! =’ allows higher support to be obtained in datasets with infrequent attributes as mentioned above. The ARMGA algorithm obtains rules with high confidence since it tries to maximize the relative confidence regardless of the support. However, the G3PARM algorithm gets rules with confidence close to that obtained. For example, notice that using Cpu _act and Credit datasets, G3PARM gets rules with maximum confidence. Furthermore, G3PARM obtains rules that cover 100% of the dataset instances with few rules (a maximum of 20 given by the maximum auxiliary population size). Only using the Wisconsin dataset discretized in 8 intervals, G3PARM obtains a set of rules without completing the auxiliary population. Using this dataset, G3PARM obtains an average number of rules of 19.8. It should be noted that the results that are shown are the average results obtained running each algorithm ten times using different seeds each time. Moreover, it should be noted that ARMGA returns, from the regular population, those rules that satisfy a 90% confidence threshold (not requiring any support threshold). In every dataset, this algorithm returns the complete regular population (i.e., 50 individuals previously established in the configuration parameter stage) because all of them satisfy the confidence threshold. Finally, notice that ARMGA does not make a comparison between rules to determine the most restrictive one as G3PARM does. As shown in Table 2, in the Wisconsin and Credit datasets, Apriori and FP-Growth get fewer rules than the others because there are few rules that exceed the support threshold. However, with the same support threshold, the G3PARM algorithm obtains more rules by using the operator ‘! =’. The ARMGA algorithm does not have any support threshold, so it gets rules with a very low support. Only in the Cpu_act dataset does the ARMGA algorithm get rules with high support. This is because this dataset has a lot of frequent attributes.

In order to analyse the results obtained, a series of statistical tests were carried out. The Firedman test is used to compare the results obtained and to be able to precisely analyse whether there are significant differences among the three algorithms. This test first ranks the jth of k algorithms on the i-th of N datasets, and then calculates the average rank according to the F-distribution throughout all the datasets, and calculates the Friedman statistics. If the Friedman test rejects the null-hypothesis indicating that there are significative differences, then a Bonferroni–Dunn test is performed to reveal these differences. The performance of G3PARM is evaluated by comparing it to the other algorithms in terms of their average support, average confidence and the percentage of instances covered by each algorithm. The average ranking for each algorithm is also shown in Table 2.

Table 2: Results obtained by the algorithms by discretizing the datasets

The Friedman average ranking statistics for average support measures distributed according to F distribution with k − 1 and (k − 1)(N − 1) degrees of freedom is 171.000; 59.593 for an average confidence measure; and 39.461 for a coverage (percentage of instances covered) measure. None of them belong to the critical interval [0, 2.766]. Thus, we reject the null-hypothesis that all algorithms perform equally well for these three measures. In order to analyse if there are significant differences among the three algorithms, the Bonferroni–Dunn test is used to reveal the difference in performance, 0.868 being the critical difference (CD) value for p = 0.1; 0.977 for p = 0.05; and 1.198 for p = 0.01. The results indicate that for the support measure, at a significance level of p = 0.01 (i.e., with a probability of 99%), there are significant differences between G3PARM and the other algorithms, the performance of G3PARM being statistically better. Focusing on the confidence measure, at a significance level of p = 0.01, there are significant differences between G3PARM and the Apriori and FP-Growth algorithms. On the other hand, there are no significant differences between G3PARM and ARMGA for the confidence measure. Finally, if we focus on the coverage measure, at a significance level of p = 0.01 there are significant differences between G3PARM and ARMGA. Concerning exhaustive search algorithms for ARM (Apriori and FP-Growth), G3PARM remains as the control algorithm and is also competitive, obtaining the best ranking.

[edit] Analysis of numerical datasets

As mentioned above, an advantage of our proposal is that it provides understandable rules and allows a context-free grammar to be adapted and applied to each specific problem or domain and eliminates the problems raised by discretization. Since exhaustive search algorithms for ARM like FP-Growth or Apriori are based on categorical attributes, they cannot be used with the original dataset (if the dataset has numerical attributes). Therefore, we compare ARMGA, QuantMiner and G3PARM using real-world datasets without any previous pre-processing step to demonstrate that our proposal performs very well if applied directly to the original datasets. In this case, Table 3 shows the results obtained in terms of average support, average confidence, percentage of instances covered by the rules of total instances in the dataset using different algorithms (ARMGA, QuantMiner and G3PARM), and the average number of rules obtained. HH is the House 16H dataset; CPU is the Cpu_act dataset; W is the Wisconsin dataset; Cr is the Credit dataset; MPG is the Automobile Performance dataset; and WDiag is the Wisconsin Diagnostic dataset.

Table 3: Results obtained by the algorithms with the original datasets

On analysing the results presented, note that the G3PARM algorithm obtains rules with better support than QuantMiner or the ARMGA algorithm. The ARMGA algorithm does not have a support threshold, so it gets rules with a very low support but a higher confidence. As previously commented, only in dataset Cpu_act does the ARMGA algorithm get rules with high support because this dataset has a lot of frequent attributes. Furthermore, in most cases, ARMGA obtains rules that cover 100% of the dataset instances while G3PARM obtains close results. Focusing on the number of rules mined, the three algorithms obtain a homogeneous set of rules. Only when applying G3PARM to the Cpu_act dataset does the resultant set of rules comprise less than 19 rules. Moreover, note that ARMGA returns the complete regular population (50 rules), because all of them satisfy the confidence threshold. It should be noted that this population size was previously established during the set-up. On the other hand, QuantMiner returns the best 20 rules from the final population, i.e., those with the highest fitness value. However, if this algorithm is not constrained to obtain the best rules, it may return a huge set of rules based on the number of individuals. Finally, neither ARMGA nor QuantMiner compare the rules to determine the most restrictive one as G3PARM does.

Because of the small number of datasets, the Friedman rank test is used to analyse and compare the results obtained. A Bonferroni–Dunn test is performed to reveal the differences that exist when the Friedman rank test rejects the null-hypothesis. By using numerical attrributes, G3PARM and the other algorithms are compared in terms of average support, average confidence and the percentage of instances covered. The average ranking for each algorithm is also shown in Table 3.

The Friedman average ranking statistic for average support measures distributed according to F-distribution with k − 1 and (k − 1)(N − 1) degrees of freedom is 31.000; 1 for the average confidence measure; and 0.333 for the coverage measure. The Friedman rank test shows that for k = 3 algorithms and N = 6 datasets, the function has a value of 9.000 with alpha 0.01. Support and confidence measures have a higher value so we reject the null-hypothesis that all algorithms perform equally well for these two measures. The Bonferroni–Dunn test is used to reveal the difference in performance: the CD value is 1.131 considering p = 0.1 and 1.293 with p = 0.05. Finally, when focusing on the coverage measure, there are no significant differences between the three algorithms.

The results indicate that for support measures, at a significance level of p = 0.05 (i.e., with a probability of 95%), there are significant differences between G3PARM and the other algorithms, the performance of G3PARM being statistically better. Then, if focusing on the confidence measure, there are significant differences between QuantMiner and the ARMGA algorithm at a significance level of p = 0.05, ARMGA being statistically better than QuantMiner. On the other hand, ARMGA is better than the other algorithms although G3PARM obtains values close to ARMGA for the confidence measure.

[edit] Analysis of scalability

A set of different experiments were also carried out to analyse the computation time of the algorithms using the House 16H dataset discretized in four intervals. Figure 5 shows the relation between the runtime and the number of instances. The Y axis represents time in milliseconds, whereas the X axis stands for the percentage of instances using all attributes. In the same way, Figure 6 shows the relation between the runtime and the number of attributes. The Y axis represents time in milliseconds and the X axis, the number of attributes using 100% of the instances.

Figure 5: Relation between the runtime and the percentage of instances over the discretized House 16H dataset

Figure 6: Relation between the runtime and the number of attributes over the discretized House 16H dataset

Figure 5 clearly shows how the runtime of the exhaustive search algorithms for ARM increases as the size of the dataset increases compared to the runtime of the genetic algorithms. In other EA-based proposals, the execution time remains almost constant.

Figure 6 shows how the exhaustive search algorithms for ARM expend a large amount of time mining association rules when the number of attributes is high. By contrast, the results plotted in these figures show that genetic algorithms scale quite linearly for the dataset used in the experiment. G3PARM and ARMGA scale better than the larger number of attributes in comparison with FP-Growth and Apriori because the greater the number of attributes, the greater the combinatorial explosion caused to obtain frequent itemsets. Genetic algorithms are not influenced by this combinatorial explosion as are the exhaustive search algorithms for ARM. The computation time in genetic algorithms increases with the number of evaluations for each individual independently of the number of attributes.

Several experiments were also carried out to analyse the computation time of the algorithms using the House 16H dataset without any pre-processing step. Like above mentioned figures, Figure 7 and Figure 8 show the relation between the runtime and the number of instances and attributes. Examining these figures, we can see how the runtime scales quite linearly when using EAs over numerical attributes. The results plotted in this figure show that QuantMiner and the G3PARM algorithm scale better than ARMGA. However, EA proposals with numerical attributes have a computation time higher than those obtained with categorical attributes.

Figure 7: Relation between the runtime and the percentage of instances over the original House 16H dataset

Figure 8: Relation between the runtime and the number of attributes over the original House 16H dataset

Personal tools