An n-spheres based synthetic data generator for supervised classification

Javier Sánchez-Monedero, Pedro Antonio Gutiérrez, María Pérez-Ortiz and César Hervás-Martínez

Department of Computer Science and Numerical Analysis, University of Córdoba, Rabanales Campus, C2 building, 14004 - Córdoba, Spain.

e-mail: jsanchezm at uco.es, pagutierrez at uco.es, i82perom at uco.es, chervas at uco.es

This page provides Matlab's source code of the synthetic data generator presented in the conference article.

Abstract of the paper

Synthetic datasets can be useful in a variety of situations, specifically when new machine learning models and training algorithms are developed or when trying to seek the weaknesses of an specific method. In contrast to real-world data, synthetic datasets provide a controlled environment for analysing concrete critic points such as outlier tolerance, data dimensionality influence and class imbalance, among others. In this paper, a framework for synthetic data generation is developed with special attention to pattern order in the space, data dimensionality, class overlapping and data multimodality. Variables such as position, width and overlapping of data distributions in the n-dimensional space are controlled by considering them as n-spheres. The method is tested in the context of ordinal regression, a paradigm of classification where there is an order arrangement between categories. The contribution of the paper is the full control over data topology and over a set of relevant statistical properties of the data.

@INPROCEEDINGS{SanchezMonedero2013iwann,
author = {J. S\'anchez-Monedero and P.A. Guti\'errez and M. P\'erez-Ortiz and
C. Herv\'as-Mart\'inez},
title = {An {\itshape n}-Spheres Based Synthetic Data Generator for Supervised
Classification},
booktitle = {Advances in Computational Intelligence. 12th International Work-Conference
on Artificial Neural Networks, IWANN 2013},
year = {2013},
editor = {Ignacio Rojas and Gonzalo Joya and Joan Gabestany},
volume = {7902},
series = {Lecture Notes in Computer Science},
pages = {613--621},
publisher = {Springer},
isbn = {978-3-642-38678-7},
location = {Heidelberg}
}

Please, send bugs and feedback to jsanchezm at uco dot es. All the code is GPLv3 licenced with the exception of the external tools included (plot2svg and export_fig are included).

Matlab's source code download: synthetic_datasets_nspheres_v0.1.tar.gz.

 

Example figures of generated datasets with different dimensions (K), standard deviation (sigma) and number of modes for each class

K = 1, sigma = 0.125

Example image representing syntetic datasets

K = 1, sigma = 0.125

Example image representing syntetic datasets

K = 2, sigma = 0.250

Example image representing syntetic datasets

K = 3, sigma = 0.125 (without displaying the n-spheres)

Example image representing syntetic datasets

K = 3, sigma = 0.125 (including n-spheres)

Example image representing syntetic datasets

K = 3, sigma = 0.250 (without displaying the n-spheres)

Example image representing syntetic datasets

K = 3, sigma = 0.250
(including n-spheres)

Example image representing syntetic datasets