Graph-Based Approaches for Over-sampling in the context of Ordinal Regression

María Pérez-Ortiz¹, Pedro Antonio Gutiérrez¹, César Hervás-Martínez¹ and Xin Yao²

1. Department of Computer Science and Numerical Analysis, University of Córdoba, Rabanales Campus, C2 building, 14004 - Córdoba, Spain.

2. Centre of Excellence for Research in Computational Intelligence and Applications (CERCIA), School of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K.

e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

This page provides supplementary material for the paper entitled Graph-Based Approaches for Over-sampling in the context of Ordinal Regression accepted in IEEE Transactions on Knowledge and Data Engineering.

http://dx.doi.org/10.1109/TKDE.2014.2365780

Abstract of the paper

The classification of patterns into naturally ordered labels is referred to as ordinal regression or ordinal classification. Usually, this classification setting is by nature highly imbalanced, because there are classes in the problem that are a priori more probable than others. Although standard over-sampling methods can improve the classification of minority classes in ordinal classification, they tend to introduce severe errors in terms of the ordinal label scale, given that they do not take the ordering into account. A specific ordinal over-sampling method is developed in this paper for the first time in order to improve the performance of machine learning classifiers. The method proposed includes ordinal information by approaching over-sampling from a graph-based perspective. The results presented in this paper show the good synergy of a popular ordinal regression method (a reformulation of support vector machines) with the graph-based proposed algorithms, and the possibility of improving both the classification and the ordering of minority classes. A cost-sensitive version of the ordinal regression method is also introduced and compared with the over-sampling proposals, showing in general lower performance for minority classes.

The additional information provided for the experimental section is the following:

Datasets: The performance of the methods is analysed by using a set of 30 publicly available datasets. Regarding the experimental setup, the datasets have been partitioned by a stratified 30-holdout procedure. The following link contains the whole set of data partitions in Weka, LibSVM and matlab formats. 30HoldoutOrdinalImbalancedDatasets.zip
All the source code can be downloaded here: matlabCodeOrdinalOversampling.zip. The zip file includes a README explaining the different files. The package is mainly composed of the following files:

toy_example.m: code for configuring and running the methods using a single partition of a toy ordinal nonlinear dataset (the dataset files are toy-train.mat and toy-test.mat).

OGONI.m: Over-sample patterns using the OGO-NI method (ordinal graph-based over-sampling via neighbourhood information using a probability function for the intra-class edges).
OGOSP.m: Over-sample patterns using the OGO-SP method (ordinal graph-based over-sampling via shortest paths using a probability function for the intra-class edges).
OGOISP.m: Over-sample patterns using the OGO-ISP method (ordinal graph-based over-sampling via interior shortest paths).
computeNumberOfPatternstoOversample.m: Compute the imbalance ratio for all of the classes in the dataset and the number of patterns to be synthetically created for each class.