datapro4j

The data processing library for Java

 

 

The programmers guide

 

Revision: 1

 

 

 

 

 

 

Please, cite this document as:

 

J.R. Romero, J.M. Luna, S. Ventura (2012). datapro4j: the data processing library for Java. Dept. of Computer Science and Numerical Analysis, University of Córdoba (Spain). Available for download from http://www.uco.es/grupos/kdis/datapro4j

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Knowledge Discovery and Intelligent Systems

University of Córdoba, Spain

http://www.uco.es/grupos/kdis                                                                                         July 2012


 

 

 

CONTACT INFO

 

José Raúl Romero, PhD

Dept. Computer Science and Numerical Analysis

University of Córdoba, Spain

 

Email: jrromero@uco.es

Web: http://www.jrromero.net/en

 

 

 

PARTICIPANTS (BY ALPHABETICAL ORDER)

 

     de la Torre pez, José. BSc. [JTL]

     Luna, José María, MSc. [JML]

     Orozco Borrego, Mario. BSc. [MOB]

     Ramírez Quesada, Aurora. MSc. [ARQ]

 

 

 

PROJECT HISTORY

 

Version

Date

Description

Participants

0.1

July 2011

Initial version. Intruder algorithms.

ARQ, JTL, JML, JRR

0.2

September 2011

Strategies and columns

MOB, JML, JRR

0.3

April 2012

Refactoring, performance improvements

and testing

ARQ, JML, JRR

0.4

Under development

Weka wrappers for preprocessing, association, clustering and classification

JRR

0.5

Under development

New dataset sources from relational databases and noSQL databases

JRR

 

 

 

 

DOCUMENT HISTORY

 

Revision

Date

Description

Author

1

July 17, 2012

Initial version of this document

JRR

 

 

 

 

 

 

 

 

 

 

 

 


 

TABLE OF CONTENTS

 

 

TABLE OF FIGURES  6

Introduction   8

Purpose  8

Scope  8

License  8

Overview    9

To-do list  9

Package es::uco::kdis::datapro   10

Package es::uco::kdis::datapro::algorithm    11

Package es::uco::kdis::datapro::algorithm::base  12

Class DatasetStrategy  12

Package es::uco::kdis::datapro::algorithm::intruder   16

Class AverageAttack  16

Class BandwagonAttack  18

Class DatasetStatistics  21

Class IntruderAttack  22

Class LoveHateAttack  27

Class RandomAttack  29

Class ReverseBandwagonAttack  31

Class SegmentAttack  32

Package es::uco::kdis::datapro::algorithm::preprocessing   35

Package es::uco::kdis::datapro::algorithm::preprocessing:: discretization   36

Class EqualFrequencyDiscretization   39

Class EqualWidthDiscretization   36

Class MDLPDiscretize  40

Package es::uco::kdis::datapro::algorithm::preprocessing:: instance  43

Class RemoveDuplicates  43

Class RemovePercentage  44

Package es::uco::kdis::datapro::algorithm::validation   48

Class KFolds  48

Package es::uco::kdis::datapro::dataset   51

Class Dataset  51

Class FileDataset  64

Class InstanceIterator  68

Interface IIterator  70

Package es::uco::kdis::datapro::dataset::Column   72

Class ColumnAbstraction   72

Class ColumnImpl  79

Enumeration ColumnType  85

Class BinaryColumn   87

Class BinaryColumnImpl  89

Class CategoricalColumnImpl  95

Class DateColumn   100

Class DateColumnImpl  102

Class IntegerColumn   105

Class IntegerColumnImpl  108

Class NominalColumn   110

Class NominalColumnImpl  111

Class NumericalColumn   115

Class NumericalColumnImpl  119

Class RangeColumn   123

Class RangeColumnImpl  125

Package es::uco::kdis::datapro::dataset::Source  130

Class ArffDataset  130

Class CsvDataset  135

Class ExcelDataset  139

Class KeelDataset  142

Package es::uco::kdis::datapro::datatypes  146

Class InvalidValue  146

Class EmptyValue  147

Class MissingValue  147

Class NullValue  148

Class Range  149

Class DoubleRange  152

Package es::uco::kdis::datapro::exception   154

Class IllegalFormatSpecificationException   154

Class NoSuchCategoryException   155

Class NotAddedValueException   156

Appendix A: UML diagrams  157

Package es.uco.kdis.datapro.algorithm.base  157

Package es.uco.kdis.datapro.algorithm.preprocessing   158

Package es.uco.kdis.datapro.dataset columns  159

Package es.uco.kdis.datapro.dataset.Source  160

Appendix B: Extending the library   162

Project structure  162

Code documentation   163

Coding recommendations  164

 

 

 

 

 

 

 

 

 

 

 

 

 


 

 

TABLE OF FIGURES

 

 

Package es.uco.kdis.datapro.algorithm_ 11

Package es.uco.kdis.datapro.algorithm.base 12

Class DatasetStrategy 12

Package es.uco.kdis.datapro.algorithm.intruder_ 16

Package es.uco.kdis.datapro.algorithm.preprocessing_ 35

Package es.uco.kdis.datapro.algorithm.preprocessing.discretization_ 36

Class EqualFrequencyDiscretization_ 39

Class EqualWidthDiscretization_ 36

Class MDLPDiscretize 40

Package es.uco.kdis.datapro.algorithm.preprocessing.instance 43

Class RemoveDuplicates 43

Class RemovePercentage 45

Package es.uco.kdis.datapro.algorithm.validation_ 48

Package es.uco.kdis.datapro.dataset 51

Class Dataset 52

Class FileDataset 64

Class InstanceIterator_ 69

Interface IIterator_ 70

Package es.uco.kdis.datapro.dataset.Column_ 72

Abstract class ColumnAbstraction_ 73

Abstract class ColumnImpl 80

Enumeration ColumnType 86

Class BinaryColumn_ 87

Class BinaryColumnImpl 89

Class CategoricalColumn_ 92

Class CategoricalColumnImpl 95

Class DateColumn_ 101

Class DateColumnImpl 102

Class IntegerColumn_ 105

Class IntegerColumnImpl 108

Class NominalColumn_ 110

Class NominalColumnImpl 112

Class NumericalColumn_ 115

Class NumericalColumnImpl 119

Class RangeColumn_ 123

Class RangeColumnImpl 125

Package es.uco.kdis.datapro.dataset.Source 130

Class ArffDataset 130

Class CsvDataset 135

Class ExcelDataset 139

Class KeelDataset 143

Package es.uco.kdis.datapro.datatypes 146

Class InvalidValue 146

Class EmptyValue 147

Class MissingValue 148

Class NullValue 149

Class Range 150

Class DoubleRange 152

Package es.uco.kdis.datapro.exception_ 154

Class IllegalFormatSpecificationException_ 154

Class NoSuchCategoryException_ 155

Class NotAddedValueException_ 156

Class diagram: package overview_ 157

Class diagram: package es.uco.kdis.datapro.algorithm.base 157

Class diagram: Package es.uco.kdis.datapro.algorithm.preprocessing_ 158

Class diagram: Package es.uco.kdis.datapro.dataset.Column_ 159

Package es.uco.kdis.datapro.dataset.Source 160

Class diagram: Package es.uco.kdis.datapro.datatypes 161

Class diagram: Package es.uco.kdis.datapro.exception_ 161

 

 

 


 

Introduction

 

 

Purpose

 

This document provides class, interface, and enumeration specification for the datapro4j library. The specification provides the details of the types being modeled within the system.

The datapro4j library is conceived to provide fully support to the efficient handling of data sets from different sources and declaring different kind of data types. This task often takes too long to the Java programmer, especially in certain domains, such as analytical analysis or data mining. Notice that this library is not provided for a given application domain, just for those that require the handling of structured data in Java from diverse data sources.

Therefore, datapro4j can be used in data mining for handling inputs or preprocessing data, using both internal strategies (e.g. algorithms on instances, discretization, etc.) or external tools (e.g. Weka or any other application). It can be also used for handling outputs: for example, in migrating data to other different formats, rearrange results from external tools or algorithms, executing statistical tests, etc.

It is worth mentioning that datapro4j is conceived to be extended, adding new algorithms, data formats, column types, etc. All these aspects are independent of each other, so algorithms can be executed regardless of being introduced in diverse formats (stored in noSQL databases, as an ARFF file, or whichever).

 

Scope

 

This document is intended to define the class specification for the datapro4j library.

 

License

 

Copyright © 2012 The authors (University of Cordoba, Spain)

 

This software was developed by members of the Knowlegde Discovery and Intelligent Systems at the University of Córdoba, Spain. For further information on the library and modifications, please visit the URL http://www.uco.es/grupos/kdis/datapro4j

 

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.

 

 

Redistribution and use of binary forms, with or without modification, are permitted if the following conditions are met:

·         Redistributions of source code must retain the above copyright notice, this list of conditions and the disclaimer above.

·         Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

·         All advertising materials or publication mentioning features or use of this software must display the following acknowledgement: This product includes software developed by the KDIS Research Group at the University of Córdoba (Spain) and its contributors.” or cite the following reference:

 

J.R. Romero, J.M. Luna, S. Ventura (2012). datapro4j: the data processing library for Java. Dept. of Computer

Science and Numerical Analysis, University of Córdoba (Spain). Available for download from http://www.uco.es/grupos/kdis/datapro4j

 

·         Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

·         Commercial use of this software or part of it is not allowed without specific prior written permission.

·         Licensing and conditions are subject to change without notice.

Note: At the moment this software is provided in binary form as a Java library. Source code is not provided (we plan to release the Java source code in a near future).

 

Overview

This document provides a list of all packages with a summary for each. Each package has a section that contains a list of its classes, interfaces and enumeration type, with a summary for each. Class and Interface contains description, summary tables, detailed member descriptions, and relation table.

Private properties are omitted. Protected properties are shown when useful for external programmers.

 

To-do list

 

In the near future, this library will be updated with the following features (not necessarily in this order):

·         Listeners in strategies.

·         Graphical UI. (Some minor support is already provided).

·         Generation of synthetic datasets under precise constraints.

·         Multipart datasets: those datasets which are not possible to be fully stored in memory, so they need to be split and partially retrieved.

·         Different data mining support.

·         Wrappers for different datasets and tools.

o    A wrapper for Weka is under development.

·         Access to different databases.

o    Access thru JDBC to RDBMS engines (e.g. MySQL, Oracle) is under development.

o    Access to no-sql engines (e.g. Cassandra) is under development.

·         More dataset formats:

o    Currently, the following formats are supported: ARFF, KEEL, CSV, Excel

o    The following formats are under development: XRFF


 

Package es::uco::kdis::datapro

 

 

The library base package. The software is mainly divided into three different components:

·         Dataset and columns. The logical abstract representation of a dataset and its attributes.

·         Dataset and sources. The physical representation of a dataset, serialized in files, stored in databases or any other device.

·         Dataset and strategies. Any algorithm running on a single dataset, set of datasets or column.

 

Name

datapro

Qualified Name

es::uco::kdis::datapro


 

Package es::uco::kdis::datapro::algorithm

 

 

Only those public strategies are described here. Developers can easily provide their own strategies.

 

Figure 1. Package es.uco.kdis.datapro.algorithm

 

 

Name

algorithm

Qualified Name

es::uco::kdis::datapro::algorithm


 

Package es::uco::kdis::datapro::algorithm::base

 

 

Figure 2. Package es.uco.kdis.datapro.algorithm.base

 

 

Name

base

Qualified Name

es::uco::kdis::datapro::algorithm::base

 

 

Class DatasetStrategy

This class represents a generic strategy.

Strategies are a well-known design pattern, where algorithms are encapsulated into classes. Strategies should be executed using either a sequential or a step-by-step process. In general, every strategy is executed according to the following sequence:

·         Creation: the strategy constructor should collect all the parameters required by the algorithm to be initialized and executed for the first time. Build as many constructors as required.

·         Initialization: the method initialize() implements any preprocessing step required to the algorithm to be executed. This preprocessing is not a part of the algorithm itself but it should be executed for the first time that the algorithm is invoked.

·         Execution: the method execution()  runs the algorithm once using the parameters introduced when the constructor was invoked, and initialized afterwards. If the algorithm has finished and it could not be invoked any more, then the method setExecutable(false) should be called. On the contrary, the execution is allowed until the stop criteria are fulfilled.

·         Stop criteria: the method isExecutable returns true if the algorithm can be executed once more over the dataset; false, otherwise.

·         Post-execution: Any post-processing step has to be implemented by the method postexec().

·         Result collection: Final results are collected from the dataset, if changed, and returned from the method getResult().

 

Figure 3. Class DatasetStrategy

 

 

Name

DatasetStrategy

Qualified Name

es::uco::kdis::datapro::algorithm::base::DatasetStrategy

Visibility

public

Abstract

true

Base Classifier

 

Realized Interface

 

 

Attribute Detail

bExecutable

Execution flag. This is protected only for inheritance purposes, and should be never directly modified.

Type

boolean

Default Value

true

Visibility

protected

Multiplicity

 

 

oDataset

Dataset used by the strategy.

Type

Dataset

Default Value

 

Visibility

protected

Multiplicity

 

 

 

Operation Detail

execute

This method is invoked to execute the strategy.

 

Type

void

Visibility

public

Is Abstract

true

Parameter

 

 

getDataset

Getter method for the dataset attribute.

 

Type

Dataset

Visibility

protected

Is Abstract

false

Parameter

 

 

getResult

This method returns an object comprising the resulting Object of the process

 

Type

Object

Visibility

public

Is Abstract

true

Parameter

 

 

initialize

This method calls the Initialization process of the strategy.

 

 

Type

void

Visibility

public

Is Abstract

true

Parameter

 

 

isExecutable

This method returns true if the strategy is in an executable state.

 

Type

boolean

Visibility

public

Is Abstract

false

Parameter

 

 

postexec

This method should be invoked, if required, after the strategy execution.

 

Type

void

Visibility

public

Is Abstract

true

Parameter

 

 

setDataset

This method sets the dataset to be used by the strategy.

 

Type

void

Visibility

protected

Is Abstract

false

Parameter

     inout data : Dataset

 

setExecutable

This method sets the current executable state of the strategy.

 

Type

void

Visibility

protected

Is Abstract

false

Parameter

     in bExecutable : boolean

 

Relation Detail

Generalization

 

Name

 

Related Element

     EqualFrequencyDiscretization

 

Name

 

Related Element

     EqualWidthDiscretization

 

Name

 

Related Element

     MDLPDiscretize

 

 

Name

 

Related Element

     RemoveDuplicates

 

Name

 

Related Element

     IntruderAttack

 

Name

 

Related Element

     KFolds

 

Name

 

Related Element

     RemovePercentage

 

Name

 

Related Element

     DatasetStatistics


 

Package es::uco::kdis::datapro::algorithm::intruder

 

 

Figure 4. Package es.uco.kdis.datapro.algorithm.intruder

 

 

Name

intruder

Qualified Name

es::uco::kdis::datapro::algorithm::intruder

 

Class AverageAttack

This class implements the Average Attack. This attack strategy sets the maximum value (push attack) or the minimum value (nuke attack) to the target item. The filler items are selected randomly and their values are also randomly chosen over a Normal Distribution, using the mean and standard deviation of the own item.

For a further description see the following paper:

B. Mobasher, R. Burke, R. Bhaumik, and C. Williams. Toward Trustworthy Recommender Systems: An Analysis of Attack Models and Algorithm Robustness. ACM Trans. Internet Technol. vol. 7, no. 4, pp. 23, 2007.

 

Name

AverageAttack

Qualified Name

es::uco::kdis::datapro::algorithm::intruder::AverageAttack

Visibility

public

Abstract

false

Base Classifier

     IntruderAttack

Realized Interface

 

 

 

Operation Detail

AverageAttack

Parameterized Constructor.

     oDataset The original dataset

     iNumAttacks The number of attack instances

     bPush The attack type (true, push; false, nuke)

     iTarget The target item (The column attribute/item index)

     iNumFillers The size of filler item set: -1 for randomly size, >0 for fixed size

     dXRand The possibility of choose an item as selected/filler item

     iSeed The random seed

 

 

Type

 

Visibility

public

Is Abstract

false

Parameter

     in bPush : boolean

     in dXRand : double

     in iNumAttacks : int

     in iNumFillers : int

     in iSeed : int

     in iTarget : int

     inout oDataset : Dataset

 

chooseSelectedItems

The Average Attack does not use the selected item set.

 

Type

void

Visibility

protected

Is Abstract

false

Parameter

 

 

initialize

Initialization method.

 

Type

void

Visibility

public

Is Abstract

false

Parameter

 

 

setFillerValues

In the Average Attack, the values for filler items must be randomly generated by a Normal Distribution, using the mean and standard deviation of each item.

 

Type

void

Visibility

protected

Is Abstract

false

Parameter

 

 

setSelectedValues

The Average Attack does not use the selected item set.

 

Type

void

Visibility

protected

Is Abstract

false

Parameter

 

 

 

Relation Detail

Generalization

 

Name

 

Related Element

     IntruderAttack

 

 

Class BandwagonAttack

This class implements the Bandwagon Attack. This attack strategy sets the maximum value (push attack) to the target item. Then, a set of items, named selected items, are chosen between the most visibility items.

The visibility items are those having a high mean and high evaluation density. For a further description see the following paper:

B. Mobasher, R. Burke, R. Bhaumik, and C. Williams. Toward Trustworthy Recommender Systems: An Analysis of Attack Models and Algorithm Robustness. ACM Trans. Internet Technol. vol. 7, no. 4, pp. 23, 2007.

 

Name

BandwagonAttack

Qualified Name

es::uco::kdis::datapro::algorithm::intruder::BandwagonAttack

Visibility

public

Abstract

false

Base Classifier

     IntruderAttack

Realized Interface

 

 

Attribute Detail

dDensity

The density threshold, i.e. the minimum number of values in the column.

Type

double

Default Value

 

Visibility

protected

Multiplicity

 

 

dVisibility

The visibility threshold, i.e., the possibility of choose an item to act as selected item.

 

Type

double

Default Value

 

Visibility

protected

Multiplicity

 

 

rgdMeanSD

It stores the mean and standard deviation of the overall dataset.

 

Type

Double

Default Value

new ArrayList<Double>()

Visibility

protected

Multiplicity

0..*

 

rgoVisibilityColumns

The array of columns whose visibility exceed the thresholds dXVisibility and dXDensity.

Type

Integer

Default Value

new ArrayList<Integer>()

Visibility

package

Multiplicity

0..*

 

rgoVisibilityMeans

The array of mean columns whose visibility exceed the thresholds dXVisibility and dXDensity.

 

Type

Double

Default Value

new ArrayList<Double>()

Visibility

package

Multiplicity

0..*

 

 

Operation Detail

BandwagonAttack

Parameterized Constructor:

     oDataset The original dataset

     iNumAttacks The number of attack instances

     iTarget The target item (The column attribute/item index)

     iNumFillers The size of filler item set: -1 for randomly size, >0 for fixed size

     iNumSelected The size of selected item set

     dVisibility The visibility threshold (absolute value of column mean).

     dDensity The density threshold (absolute value of instances without counting null, empty or missing values in the column)

     dXRand The possibility of choose an item as filler item

     iSeed The random seed

 

 

Type

 

Visibility

public

Is Abstract

false

Parameter

     in dDensity : double

     in dVisibility : double

     in dXRand : double

     in iNumAttacks : int

     in iNumFillers : int

     in iNumSelected : int

     in iSeed : int

     in iTarget : int

     inout oDataset : Dataset

 

chooseSelectedItems

Create the set of selected items. The size is prefixed by iNumSelected property.

 

Type

void

Visibility

protected

Is Abstract

false

Parameter

 

 

initialize

Initialization method for the strategy.

 

 

 

 

Type

void

Visibility

public

Is Abstract

false

Parameter

 

 

orderArray

Order the columns using their mean as comparative metric. This method implements the QuickSort algorithm.

     iInit The initial position of the array

     iEnd The end position in the array

 

 

Type

void

Visibility

protected

Is Abstract

false

Parameter

     in iEnd : int

     in iInit : int

 

setFillerValues

In the Bandwagon Attack, the values for filler items must be randomly generated by a Normal Distribution, using the mean and standard deviation of the overall dataset.

Type

void

Visibility

protected

Is Abstract

false

Parameter

 

 

setSelectedValues

Set the values of selected items. In the Bandwagon Attack, each selected item has the maximum value.

 

Type

void

Visibility

protected

Is Abstract

false

Parameter

 

 

setVisibilityColumns

Select the columns that exceed the visibility and density threshold.

 

Type

void

Visibility

protected

Is Abstract

false

Parameter

 

 

 

Relation Detail

Generalization

 

 

Name

 

Related Element

     ReverseBandwagonAttack

 

 

 

Name

 

Related Element

     IntruderAttack

 

 

Class DatasetStatistics

 

Name

DatasetStatistics

Qualified Name

es::uco::kdis::datapro::algorithm::intruder::DatasetStatistics

Visibility

public

Abstract

false

Base Classifier

     DatasetStrategy

Realized Interface

 

 

Attribute Detail

 

All attributes are private.

 

Operation Detail

DatasetStatistics

Constructor. A parameter is required:

     data Dataset over which the statistical strategy will be executed.

 

Type

 

Visibility

public