Speeding up the evaluation phase of GP classification algorithms on GPUs

From KDIS Research Group - Wiki

Jump to: navigation, search

This Website contains additional material to the paper titled "Speeding up the evaluation phase of GP classification algorithms on GPUs".

Full paper A. Cano, A. Zafra, and S. Ventura. Speeding up the evaluation phase of GP classification algorithms on GPUs. Soft Computing, 16 (2), pages 187-202, 2012.

Contents

[edit] Abstract

The efficiency of evolutionary algorithms has become a studied problem since it is one of the major weaknesses in these algorithms. Specifically, when these algorithms are employed for the classification task, the computational time required by these algorithms grows excessively as the problem complexity increases. This paper proposes an efficient scalable and massively parallel evaluation model using the NVIDIA CUDA GPU programming model to speedup the fitness calculation phase and greatly reduce the computational time. Experimental results show that our model significantly reduces the computational time compared to the sequential approach, reaching a speedup of up to 820X. Moreover, the model is able to scale to multiple GPU devices and can be easily extended to any evolutionary algorithm.

[edit] Kernels

This section details the implementation of the two kernels with each one of the studied methods.

[edit] Falco et al. Algorithm

/**
 * Evaluation GPU kernel for Falco Algorithm
 * 
 * @param The result array, the instances array, the classes array, the number of attributes, the number of instances, the aligned number of instances and the actual class to be classified
 */
__global__ void kernelFalco(unsigned char* result, float* instancesData, int* instancesClass, int numberAttributes, int numberInstances, int numberInstances_A, int classifiedClass) 
{
	int instance = blockDim.y * blockIdx.y + threadIdx.y;
	
	if(instance < numberInstances)
	{
		int resultMemPosition = blockIdx.x * numberInstances_A + instance;
		
		if(covers(instance, &EXPR[MAX_EXPR_LEN * blockIdx.x], instancesData, numberInstances_A))
		{
			if(classifiedClass == instancesClass[instance])
				result[resultMemPosition] = 0; // HIT
			else
				result[resultMemPosition] = 1; // FAIL
		}
		else
		{
			if(classifiedClass != instancesClass[instance])
				result[resultMemPosition] = 0; // HIT
			else
				result[resultMemPosition] = 1; // FAIL	
		}
	}
}

/**
 * Reduction GPU Confusion Matrix kernel for Falco Algorithm
 * 
 * @param The result array, the fitness array, the number of instances and the aligned number of instances
 */
__global__ void MC_kernelFalco(unsigned char* result, jfloat* fitness, int numberInstances, int numberInstances_A) 
{
	__shared__ int MC[THREADS_EVAL_BLOCK];
	
	MC[threadIdx.y] = 0;
	
	int base = blockIdx.x*numberInstances_A + threadIdx.y;
	int top =  blockIdx.x*numberInstances_A + numberInstances - base;
	
	// Performs the reduction of the thread corresponding values
	for(int i = 0; i < top; i+=THREADS_EVAL_BLOCK)
	{
		MC[threadIdx.y] += result[base + i];
	}
	
	__syncthreads();

	// Calculates the final amount
	if(threadIdx.y == 0)
	{
		int fails = 0;
		for(int i = 0; i < THREADS_EVAL_BLOCK; i++)
		{
			fails += MC[i];
		}
       		// Set the fitness to the individual
		fitness[blockIdx.x] = 2*fails;
	}
}

[edit] Tan et al. Algorithm

/**
 * Evaluation GPU kernel for Tan Algorithm
 * 
 * @param The result array, the instances array, the classes array, the number of attributes, the number of instances, the aligned number of instances and the actual class to be classified
 */
__global__ void kernelTan(unsigned char* result, float* instancesData, int* instancesClass, int numberAttributes, int numberInstances, int numberInstances_A, int classifiedClass) 
{
	int instance = blockDim.y * blockIdx.y + threadIdx.y;
	
	if(instance < numberInstances)
	{
		int resultMemPosition = blockIdx.x * numberInstances_A + instance;
		
		if(covers(instance, &EXPR[MAX_EXPR_LEN * blockIdx.x], instancesData, numberInstances_A))
		{
			if(classifiedClass == instancesClass[instance])
				result[resultMemPosition] = 0; // TRUE POSITIVE
			else
				result[resultMemPosition] = 2; // FALSE POSITIVE
		}
		else
		{
			if(classifiedClass != instancesClass[instance])
				result[resultMemPosition] = 1; // TRUE NEGATIVE
			else
				result[resultMemPosition] = 3; // FALSE NEGATIVE	
		}
	}
}

/**
 * Reduction GPU Confusion Matrix kernel for Tan Algorithm
 * 
 * @param The result array, the fitness array, the number of instances, the aligned number of instances and the algorithm float values w1 and w2
 */
__global__ void MC_kernelTan(unsigned char* result, jfloat* fitness, int numberInstances, int numberInstances_A, float w1, float w2) 
{
	__shared__ int MC[THREADS_EVAL_BLOCK*4];
	
	MC[threadIdx.y] = 0;
	MC[threadIdx.y+THREADS_EVAL_BLOCK] = 0;
	MC[threadIdx.y+THREADS_EVAL_BLOCK*2] = 0;
	MC[threadIdx.y+THREADS_EVAL_BLOCK*3] = 0;
	
	int base = blockIdx.x*numberInstances_A + threadIdx.y;
	int top =  blockIdx.x*numberInstances_A + numberInstances - base;
	
	// Performs the reduction of the thread corresponding values
	for(int i = 0; i < top; i+=THREADS_EVAL_BLOCK)
	{
		MC[threadIdx.y*4 + result[base + i]]++;
	}
	
	__syncthreads();
	
	 // Calculates the final amount
	if(threadIdx.y < 4)
	{
		for(int i = 4; i < THREADS_EVAL_BLOCK*4; i+=4)
		{
			MC[0] += MC[i];     // Number of true positives
			MC[1] += MC[i+1];   // Number of true negatives
			MC[2] += MC[i+2];   // Number of false positives
			MC[3] += MC[i+3];   // Number of false negatives
		}
		
		if(threadIdx.y == 0)
	    	{
	    		float se, sp;
		  	int tp = MC[0], tn = MC[1], fp = MC[2], fn = MC[3];
		   
		   	if(tp + fn == 0)
				se = 1;
			else
				se = (float) tp / (tp + w1*fn);
	
			if(tn + fp == 0)
				sp = 1;
			else
				sp = (float) tn / (tn + w2*fp);
	
		   	// Set the fitness to the individual
   			fitness[blockIdx.x] = se * sp;
		}
	}
}

[edit] Bojarczuk et al. Algorithm

/**
 * Evaluation GPU kernel for Bojarczuk Algorithm
 * 
 * @param The result array, the instances array, the classes array, the number of attributes, the number of instances, the aligned number of instances and the number of classes of the dataset
 */
__global__ void kernelBojarczuk(unsigned char** result, float* instancesData, int* instancesClass, int numberAttributes, int numberInstances, int numberInstances_A, int numClasses) 
{
	int instance = blockDim.y * blockIdx.y + threadIdx.y;
	
	if(instance < numberInstances)
	{
		int resultMemPosition = blockIdx.x * numberInstances_A + instance;
		
		if(covers(instance, &EXPR[MAX_EXPR_LEN * blockIdx.x], instancesData, numberInstances_A))
		{
			for(int i = 0; i < numClasses; i++)
					result[i][resultMemPosition] = 2; // FALSE POSITIVE
			result[instancesClass[instance]][resultMemPosition] = 0; // TRUE POSITIVE
		}
		else
		{			
			for(int i = 0; i < numClasses; i++)
					result[i][resultMemPosition] = 1; // TRUE NEGATIVE
			result[instancesClass[instance]][resultMemPosition] = 3; // FALSE NEGATIVE			
		}
	}
}

/**
 * Reduction GPU Confusion Matrix kernel for Bojarczuk Algorithm
 * 
 * @param The result array, the number of instances, the se array, the sp array, the best class array, the aligned number of instances and the number of classes
 */
__global__ void MC_kernelBojarczuk(unsigned char** result, int numberInstances, jfloat* se, jfloat* sp, jint* bestClass, int numberInstances_A, int numClasses) 
{
	__shared__ int MC[THREADS_EVAL_BLOCK*4];
	
	float seAux, spAux;	

	int base = blockIdx.x*numberInstances_A + threadIdx.y;
	int top =  blockIdx.x*numberInstances_A + numberInstances - base;
	
	if(threadIdx.y == 0)
	{
		se[blockIdx.x] = -1;
		sp[blockIdx.x] = 1;
	}
	
	// Checks the best class for the individual
	for(int j = 0; j < numClasses; j++)
	{
		MC[threadIdx.y] = 0;
		MC[threadIdx.y+THREADS_EVAL_BLOCK] = 0;
		MC[threadIdx.y+THREADS_EVAL_BLOCK*2] = 0;
		MC[threadIdx.y+THREADS_EVAL_BLOCK*3] = 0;
		
		// Performs the reduction of the thread corresponding values
		for(int i = 0; i < top; i+=THREADS_EVAL_BLOCK)
		{
			MC[threadIdx.y*4 + result[j][base + i]]++;
		}
		
		__syncthreads();
		
		if(threadIdx.y < 4)
		{
			for(int i = 4; i < THREADS_EVAL_BLOCK*4; i+=4)
			{
			    MC[0] += MC[i];     // Number of true positives
			    MC[1] += MC[i+1];   // Number of true negatives
			    MC[2] += MC[i+2];   // Number of false positives
			    MC[3] += MC[i+3];   // Number of false negatives				
			}
		}
		
		if(threadIdx.y == 0)
		{		
			int tp = MC[0], tn = MC[1], fp = MC[2], fn = MC[3];
			
			if(tp + fn == 0)
				seAux = 1;
			else
				seAux = (float) tp / (tp + fn);
	
			if(tn + fp == 0)
				spAux = 1;
			else
				spAux = (float) tn / (tn + fp);
	
			// If the actual class is best for the individual we keep it
			if(seAux*spAux >= se[blockIdx.x]*sp[blockIdx.x])
			{
				se[blockIdx.x] = seAux;
				sp[blockIdx.x] = spAux;
				bestClass[blockIdx.x] = j;
			}
		}
		
		__syncthreads();
	}
}

[edit] Datasets

The datasets employed have been selected from the UCI machine learning website and the KEEL repository website. These datasets are very varied considering different degrees of complexity, number of classes, number of features and number of instances.

Dataset # Instances # Attributes # Classes
Iris 150 4 3
New-thyroid 215 5 3
Ecoli 336 7 8
Contraceptive 1473 9 3
Thyroid 7200 21 3
Penbased 10992 16 10
Shuttle 58000 9 7
Connect-4 67557 42 3
Kddcup 494020 41 23
Poker 1025010 10 10

[edit] Results

The execution time and the speedups of the three classification algorithms solving the various problems considered are shown in the three following tables, where each column is labeled with the execution configuration indicated from left to right as follows: the dataset, the execution time of the native sequential version coded in Java expressed in seconds, the speedup of the model proposed using JNI and one CPU thread, two CPU threads, four CPU threads, one GTX 285 GPU device, two GTX 285 GPU devices, and with one GTX 480 GPU device and two GTX 480 GPU devices. The results correspond to the executions of the algorithms with a population of 200 individuals and 100 generations.

[edit] Falco et al. Algorithm

Execution Time (s) Speedup
Dataset Java C 2C 4C 1 285 2 285 1 480 2 480
Iris 2,0 0,48 0,94 1,49 2,96 4,68 2,91 8,02
New-thyroid 4,0 0,54 1,03 1,99 4,61 9,46 5,18 16,06
Ecoli 13,7 0,49 0,94 1,38 6,36 10,92 9,05 17,56
Contraceptive 26,6 1,29 2,52 3,47 31,43 55,29 50,18 93,64
Thyroid 103,0 0,60 1,15 2,31 37,88 69,66 75,70 155,86
Penbased 1434,1 1,15 2,26 4,37 111,85 207,99 191,67 391,61
Shuttle 1889,5 1,03 2,02 3,87 86,01 162,62 182,19 356,17
Connect-4 1778,5 1,09 2,14 3,87 116,46 223,82 201,57 392,86
Kddcup 154183,0 0,91 1,77 3,30 136,82 251,71 335,78 653,60
Poker 108831,6 1,25 2,46 4,69 209,61 401,77 416,30 820,18

[edit] Tan et al. Algorithm

Execution Time (s) Speedup
Dataset Java C 2C 4C 1 285 2 285 1 480 2 480
Iris 2,6 0,44 0,80 1,01 2,94 5,44 4,90 9,73
New-thyroid 6,0 0,77 1,43 1,78 7,13 12,03 9,15 21,74
Ecoli 22,5 0,60 1,16 2,09 9,33 16,26 14,18 25,92
Contraceptive 39,9 1,28 2,44 3,89 40,00 64,52 60,60 126,99
Thyroid 208,5 1,10 2,11 2,66 64,06 103,77 147,74 279,44
Penbased 917,1 1,15 2,23 4,25 86,58 148,24 177,78 343,24
Shuttle 3558,0 1,09 2,09 3,92 95,18 161,84 222,96 431,80
Connect-4 1920,6 1,35 2,62 4,91 123,56 213,59 249,81 478,83
Kddcup 185826,6 0,87 1,69 3,20 83,82 138,53 253,83 493,14
Poker 119070,4 1,27 2,46 4,66 158,76 268,69 374,66 701,41

[edit] Bojarczuk et al. Algorithm

Execution Time (s) Speedup
Dataset Java C 2C 4C 1 285 2 285 1 480 2 480
Iris 0,5 0,50 0,96 1,75 2,03 4,21 2,39 5,84
New-thyroid 0,8 0,51 1,02 1,43 2,99 5,85 3,19 9,45
Ecoli 1,3 0,52 1,02 1,15 3,80 8,05 5,59 11,39
Contraceptive 5,4 1,20 2,40 2,47 14,58 31,81 26,41 53,86
Thyroid 27,0 0,56 1,11 2,19 25,93 49,53 56,23 120,50
Penbased 42,6 0,96 1,92 3,81 18,55 36,51 68,01 147,55
Shuttle 222,5 1,18 2,35 4,66 34,08 67,84 117,85 253,13
Connect-4 298,9 0,69 1,35 2,65 42,60 84,92 106,14 214,73
Kddcup 3825,8 0,91 1,78 3,42 35,53 71,09 155,62 352,25
Poker 6527,2 1,18 2,32 4,45 39,02 77,87 185,81 399,85

The results in the tables provide useful information that in some cases, the external CPU evaluation is inefficient for certain datasets such as Iris, New-thyroid or Ecoli. This is because the time taken to transfer the data from the Java virtual machine memory to the native memory is higher than just doing the evaluation in the Java virtual machine. However, in all the cases, regardless of the size of the dataset, the native GPU evaluation is always considerably faster. If we look at the results of the smallests datasets such as Iris, New-thyroid and Ecoli, it can be seen that its speedup is acceptable and specifically Ecoli performs up to 25X faster. Speeding up these small datasets would not be too useful because of the short run time required, but it is worthwhile for larger data sets. On the other hand, if we focus on complex datasets, the speedup is greater because the model can take full advantage of the GPU multiprocessors' offering them many instances to parallelize. Notice that Kddcup and Poker datasets perform up to 653X and 820X faster, respectively. We can also appreciate that the scalability of the proposal is almost perfect, since doubling the number of threads or the graphics devices almost halves the execution time. The figure summarizes the average speedup, depending on the number of instances.

GPU speedup

The fact of obtaining significant enhacements in all problem domains (both small and complex datasets) as has been seen is because our proposal is a hybrid model that takes advantage of both the parallelization of the individuals and the instances. A great speedup is not only achieved by classifying a large number of instances but by a large enough population. The classification of small datasets does not require many individuals but high dimensional problems usually require a large population to provide diversity in the population genetics. Therefore, a great speedup is achieved by maximizing both parameters. These results allow us to determine that the proposed model achieves a high speedup in the algorithms employed. Specifically, the best speedup is 820X when using the Falco et al. algorithm and the poker dataset, hence the execution time can be impressively reduced from 30 hours to only two minutes.

The results for the BioHEL system are shown in the following table, where the first column indicates the execution time of the serial version expressed in seconds, the second column shows the speedup of the CUDA version using a NVIDIA GTX 285 GPU, and the third column shows the speedup of the CUDA version using a NVIDIA GTX 480 GPU. These results show that for a dataset with a low number of instances the CUDA version of BioHEL performs slower than the serial version. However, the speedup obtained is higher when the number of instances increases, achieving a speedup of up to 34 times compared to the serial version.

[edit] BioHEL

Execution Time (s) Speedup
Dataset Serial CUDA 285 CUDA 480
Iris 0,5 0,64 0,66
New-thyroid 0,9 0,93 1,32
Ecoli 3,7 1,14 6,82
Contraceptive 3,3 3,48 3,94
Thyroid 26,4 2,76 8,70
Penbased 147,9 5,22 20,26
Shuttle 418,4 11,54 27,84
Connect-4 340,4 10,18 12,24
Kddcup 503,4 14,95 28,97
Poker 3290,9 11,93 34,02

[edit] CUDA Profiler Information

CUDA profiler 3.2 evaluation results for Poker dataset (1 million examples) and Tan et al. algorithm fitness function using 2 GPUS NVIDIA GTX 480, 128 individuals and 100 iterations.

[edit] References

  • I. De Falco, A. Della Cioppa, and E. Tarantino. Discovering interesting classification rules with genetic programming. Applied Soft Computing, 1(4):257-269, 2001.
  • Celia C. Bojarczuk, Heitor S. Lopes, Alex A. Freitas, and Edson L Michalkiewicz. A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets. Artificial Intelligence in Medicine, 30(1):27-48, 2004.
  • K. C. Tan, A. Tay, T. H. Lee, and C. M. Heng. Mining multiple comprehensible classification rules using genetic programming. In Proceedings of the Evolutionary Computation on 2002. CEC '02. Proceedings of the 2002 Congress, (2):1302-1307, Washington, DC, USA, 2002. IEEE Computer Society.
  • MarĂ­a A. Franco, Natalio Krasnogor, and Jaume Bacardit. Speeding up the evaluation of evolutionary learning systems using GPGPUs. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, GECCO '10, 1039-1046, New York, NY, USA, 2010. ACM.
Personal tools