datapro4j
The data processing library for Java
The programmer’s guide
Revision: 1
Please, cite this document as:
J.R. Romero, J.M. Luna, S.
Ventura (2012). datapro4j: the data processing library for Java. Dept. of
Computer Science and Numerical Analysis, University of
Córdoba (Spain). Available
for download from http://www.uco.es/grupos/kdis/datapro4j
Knowledge Discovery and Intelligent Systems
University of Córdoba,
Spain
http://www.uco.es/grupos/kdis July
2012
CONTACT INFO
José Raúl Romero, PhD
Dept. Computer Science and Numerical Analysis
University of Córdoba, Spain
Email: jrromero@uco.es
Web: http://www.jrromero.net/en
PARTICIPANTS (BY
ALPHABETICAL ORDER)
• de la Torre López, José. BSc. [JTL]
• Luna, José María, MSc. [JML]
• Orozco Borrego, Mario. BSc. [MOB]
• Ramírez Quesada, Aurora. MSc. [ARQ]
PROJECT HISTORY
Version |
Date |
Description |
Participants |
0.1 |
July 2011 |
Initial version. Intruder algorithms. |
ARQ, JTL, JML, JRR |
0.2 |
September 2011 |
Strategies and columns |
MOB, JML, JRR |
0.3 |
April 2012 |
Refactoring, performance improvements and testing |
ARQ, JML, JRR |
0.4 |
Under development |
Weka wrappers for preprocessing, association, clustering and classification |
JRR |
0.5 |
Under development |
New dataset sources from relational databases and
noSQL databases |
JRR |
DOCUMENT HISTORY
Revision |
Date |
Description |
Author |
1 |
July 17, 2012 |
Initial version of
this document |
JRR |
Package
es::uco::kdis::datapro
Package es::uco::kdis::datapro::algorithm
Package es::uco::kdis::datapro::algorithm::base
Package
es::uco::kdis::datapro::algorithm::intruder
Package es::uco::kdis::datapro::algorithm::preprocessing
Package es::uco::kdis::datapro::algorithm::preprocessing::
discretization
Class EqualFrequencyDiscretization
Class EqualWidthDiscretization
Package es::uco::kdis::datapro::algorithm::preprocessing::
instance
Package es::uco::kdis::datapro::algorithm::validation
Package es::uco::kdis::datapro::dataset
Package es::uco::kdis::datapro::dataset::Column
Package
es::uco::kdis::datapro::dataset::Source
Package
es::uco::kdis::datapro::datatypes
Package
es::uco::kdis::datapro::exception
Class IllegalFormatSpecificationException
Package es.uco.kdis.datapro.algorithm.base
Package
es.uco.kdis.datapro.algorithm.preprocessing
Package
es.uco.kdis.datapro.dataset columns
Package es.uco.kdis.datapro.dataset.Source
Appendix B: Extending the
library
Package es.uco.kdis.datapro.algorithm
Package
es.uco.kdis.datapro.algorithm.base
Package es.uco.kdis.datapro.algorithm.intruder
Package
es.uco.kdis.datapro.algorithm.preprocessing
Package
es.uco.kdis.datapro.algorithm.preprocessing.discretization
Class
EqualFrequencyDiscretization
Class
EqualWidthDiscretization
Package
es.uco.kdis.datapro.algorithm.preprocessing.instance
Package es.uco.kdis.datapro.algorithm.validation
Package
es.uco.kdis.datapro.dataset
Package
es.uco.kdis.datapro.dataset.Column
Abstract class
ColumnAbstraction
Package
es.uco.kdis.datapro.dataset.Source
Package
es.uco.kdis.datapro.datatypes
Package
es.uco.kdis.datapro.exception
Class
IllegalFormatSpecificationException
Class diagram: package
overview
Class diagram: package
es.uco.kdis.datapro.algorithm.base
Class diagram: Package
es.uco.kdis.datapro.algorithm.preprocessing
Class diagram: Package
es.uco.kdis.datapro.dataset.Column
Package
es.uco.kdis.datapro.dataset.Source
Class diagram: Package
es.uco.kdis.datapro.datatypes
Class diagram: Package
es.uco.kdis.datapro.exception
This document provides class, interface, and enumeration specification for the datapro4j library. The specification provides the details of the types being modeled within the system.
The datapro4j library is conceived to provide fully support to the efficient handling of data sets from different sources and declaring different kind of data types. This task often takes too long to the Java programmer, especially in certain domains, such as analytical analysis or data mining. Notice that this library is not provided for a given application domain, just for those that require the handling of structured data in Java from diverse data sources.
Therefore, datapro4j can be used in data mining for handling inputs or preprocessing data, using both internal strategies (e.g. algorithms on instances, discretization, etc.) or external tools (e.g. Weka or any other application). It can be also used for handling outputs: for example, in migrating data to other different formats, rearrange results from external tools or algorithms, executing statistical tests, etc.
It is worth mentioning that datapro4j is conceived to be extended, adding new algorithms, data formats, column types, etc. All these aspects are independent of each other, so algorithms can be executed regardless of being introduced in diverse formats (stored in noSQL databases, as an ARFF file, or whichever).
This document is intended to define the class specification for the datapro4j library.
Copyright Š 2012 The authors (University of Cordoba, Spain)
This software was developed by
members of the Knowlegde Discovery and Intelligent Systems at the University of Córdoba, Spain. For further information on the library and modifications, please
visit the URL http://www.uco.es/grupos/kdis/datapro4j
THE SOFTWARE IS PROVIDED "AS IS",
WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
Redistribution and use of binary forms, with or without modification, are permitted if the following conditions are met:
ˇ Redistributions of source code must retain the above copyright notice, this list of conditions and the disclaimer above.
ˇ Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
ˇ All advertising materials or publication mentioning features or use of this software must display the following acknowledgement: “This product includes software developed by the KDIS Research Group at the University of Córdoba (Spain) and its contributors.” or cite the following reference:
J.R. Romero, J.M. Luna, S. Ventura (2012). datapro4j: the data processing library for Java. Dept. of Computer
Science and Numerical Analysis, University of Córdoba (Spain). Available
for download from http://www.uco.es/grupos/kdis/datapro4j
ˇ Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
ˇ Commercial use of this software or part of it is not allowed without specific prior written permission.
ˇ Licensing and conditions are subject to change without notice.
Note: At the moment this software is provided in binary form as a Java library. Source code is not provided (we plan to release the Java source code in a near future).
This document provides a list of all packages with a summary for each. Each package has a section that contains a list of its classes, interfaces and enumeration type, with a summary for each. Class and Interface contains description, summary tables, detailed member descriptions, and relation table.
Private properties are omitted. Protected properties are shown when useful for external programmers.
In the near future, this library will be updated with the following features (not necessarily in this order):
ˇ Listeners in strategies.
ˇ Graphical UI. (Some minor support is already provided).
ˇ Generation of synthetic datasets under precise constraints.
ˇ Multipart datasets: those datasets which are not possible to be fully stored in memory, so they need to be split and partially retrieved.
ˇ Different data mining support.
ˇ Wrappers for different datasets and tools.
o A wrapper for Weka is under development.
ˇ Access to different databases.
o Access thru JDBC to RDBMS engines (e.g. MySQL, Oracle) is under development.
o Access to no-sql engines (e.g. Cassandra) is under development.
ˇ More dataset formats:
o Currently, the following formats are supported: ARFF, KEEL, CSV, Excel
o The following formats are under development: XRFF
The library base package. The software is mainly divided into three different components:
ˇ Dataset and columns. The logical abstract representation of a dataset and its attributes.
ˇ Dataset and sources. The physical representation of a dataset, serialized in files, stored in databases or any other device.
ˇ Dataset and strategies. Any algorithm running on a single dataset, set of datasets or column.
Name |
datapro |
Qualified Name |
es::uco::kdis::datapro |
Only those public strategies are described here. Developers can easily provide their own strategies.
Figure
1. Package es.uco.kdis.datapro.algorithm
Name |
algorithm |
Qualified Name |
es::uco::kdis::datapro::algorithm |
Figure
2. Package es.uco.kdis.datapro.algorithm.base
Name |
base |
Qualified Name |
es::uco::kdis::datapro::algorithm::base |
This class represents a generic strategy.
Strategies are a well-known design pattern, where algorithms are encapsulated into classes. Strategies should be
executed using either a sequential or a step-by-step process. In
general, every strategy is executed according to the
following sequence:
ˇ
Creation:
the strategy constructor should collect all the parameters required by the algorithm to be initialized and executed for the first time. Build as many constructors as required.
ˇ
Initialization:
the method initialize() implements any preprocessing step required to
the algorithm to be executed. This preprocessing is not a part of
the algorithm itself but it should be executed for the first time that the algorithm is
invoked.
ˇ
Execution:
the method execution() runs the algorithm once using the parameters introduced when the constructor was invoked, and initialized afterwards. If the algorithm has finished and it could not be
invoked any more, then the method setExecutable(false) should be called. On the contrary, the execution is
allowed until the stop criteria are fulfilled.
ˇ
Stop criteria: the method isExecutable returns true if the algorithm can be executed once more over the dataset; false, otherwise.
ˇ
Post-execution: Any post-processing step has to be implemented by the method postexec().
ˇ
Result collection: Final results are collected from the dataset, if changed, and returned from the method getResult().
Figure
3. Class DatasetStrategy
Name |
DatasetStrategy |
Qualified Name |
es::uco::kdis::datapro::algorithm::base::DatasetStrategy |
Visibility |
public |
Abstract |
true |
Base Classifier |
|
Realized Interface |
|
Execution flag. This is protected only for inheritance purposes, and should be never directly modified.
Type |
boolean |
Default Value |
true |
Visibility |
protected |
Multiplicity |
|
Dataset used by the strategy.
Type |
Dataset |
Default Value |
|
Visibility |
protected |
Multiplicity |
|
This method is invoked to execute the strategy.
Type |
void |
Visibility |
public |
Is Abstract |
true |
Parameter |
|
Getter method for the dataset attribute.
Type |
Dataset |
Visibility |
protected |
Is Abstract |
false |
Parameter |
|
This method returns an object comprising the resulting Object of the process
Type |
Object |
Visibility |
public |
Is Abstract |
true |
Parameter |
|
This method calls the Initialization process of the strategy.
Type |
void |
Visibility |
public |
Is Abstract |
true |
Parameter |
|
This method returns true if the strategy is in an executable state.
Type |
boolean |
Visibility |
public |
Is Abstract |
false |
Parameter |
|
This method should be invoked, if required, after the strategy execution.
Type |
void |
Visibility |
public |
Is Abstract |
true |
Parameter |
|
This method sets the dataset to be used by the strategy.
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
• inout data : Dataset |
This method sets the current executable state of the strategy.
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
• in bExecutable : boolean |
Name |
|
Related Element |
• EqualFrequencyDiscretization |
Name |
|
Related Element |
• EqualWidthDiscretization |
Name |
|
Related Element |
• MDLPDiscretize |
Name |
|
Related Element |
• RemoveDuplicates |
Name |
|
Related Element |
• IntruderAttack |
Name |
|
Related Element |
• KFolds |
Name |
|
Related Element |
• RemovePercentage |
Name |
|
Related Element |
• DatasetStatistics |
Figure
4. Package
es.uco.kdis.datapro.algorithm.intruder
Name |
intruder |
Qualified Name |
es::uco::kdis::datapro::algorithm::intruder |
This class implements the Average Attack. This attack strategy sets the maximum value (push attack) or the minimum value (nuke attack) to the target item. The filler items are selected randomly and their values are also randomly chosen over a Normal Distribution, using the mean and standard deviation of the own item.
For a further description see the following paper:
B. Mobasher, R. Burke, R. Bhaumik, and C. Williams. Toward Trustworthy Recommender Systems: An Analysis of Attack Models and Algorithm Robustness. ACM Trans. Internet Technol. vol. 7, no. 4, pp. 23, 2007.
Name |
AverageAttack |
Qualified Name |
es::uco::kdis::datapro::algorithm::intruder::AverageAttack |
Visibility |
public |
Abstract |
false |
Base Classifier |
• IntruderAttack |
Realized Interface |
|
Parameterized Constructor.
• oDataset The original dataset
• iNumAttacks The number of attack instances
• bPush The attack type (true, push; false, nuke)
• iTarget The target item (The column attribute/item index)
• iNumFillers The size of filler item set: -1 for randomly size, >0 for fixed size
• dXRand The possibility of choose an
item as selected/filler item
• iSeed The random seed
Type |
|
Visibility |
public |
Is Abstract |
false |
Parameter |
• in bPush : boolean • in dXRand : double • in iNumAttacks : int • in iNumFillers : int • in iSeed : int • in iTarget : int • inout oDataset : Dataset |
The Average Attack does not use the selected item set.
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
|
Initialization method.
Type |
void |
Visibility |
public |
Is Abstract |
false |
Parameter |
|
In the Average Attack, the values for filler items must be randomly generated by a Normal Distribution, using the mean and standard deviation of each item.
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
|
The Average Attack does not use the selected item set.
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
|
Name |
|
Related Element |
• IntruderAttack |
This class implements the Bandwagon Attack. This attack strategy sets the maximum value (push attack) to the target item. Then, a set of items, named selected items, are chosen between the most visibility items.
The visibility items are those having a high mean and high evaluation density. For a further description see the following paper:
B. Mobasher, R. Burke, R. Bhaumik, and C. Williams. Toward Trustworthy Recommender Systems: An Analysis of Attack Models and Algorithm Robustness. ACM Trans. Internet Technol. vol. 7, no. 4, pp. 23, 2007.
Name |
BandwagonAttack |
Qualified Name |
es::uco::kdis::datapro::algorithm::intruder::BandwagonAttack |
Visibility |
public |
Abstract |
false |
Base Classifier |
• IntruderAttack |
Realized Interface |
|
The density threshold, i.e. the minimum number of values in the column.
Type |
double |
Default Value |
|
Visibility |
protected |
Multiplicity |
|
The visibility threshold, i.e., the possibility of choose an item to act as selected item.
Type |
double |
Default Value |
|
Visibility |
protected |
Multiplicity |
|
It stores the mean and standard deviation of the overall dataset.
Type |
Double |
Default Value |
new ArrayList<Double>() |
Visibility |
protected |
Multiplicity |
0..* |
The array of columns whose visibility exceed the thresholds dXVisibility and dXDensity.
Type |
Integer |
Default Value |
new ArrayList<Integer>() |
Visibility |
package |
Multiplicity |
0..* |
The array of mean columns whose visibility exceed the thresholds dXVisibility and dXDensity.
Type |
Double |
Default Value |
new ArrayList<Double>() |
Visibility |
package |
Multiplicity |
0..* |
Parameterized Constructor:
• oDataset The original dataset
• iNumAttacks The number of attack instances
• iTarget The target item (The column attribute/item index)
• iNumFillers The size of filler item set: -1 for randomly size, >0 for fixed size
• iNumSelected The size of
selected item set
• dVisibility The visibility threshold (absolute value of
column mean).
• dDensity The density threshold (absolute value of
instances without counting null, empty or
missing values in the column)
• dXRand The possibility of choose an
item as filler item
• iSeed The random seed
Type |
|
Visibility |
public |
Is Abstract |
false |
Parameter |
• in dDensity : double • in dVisibility : double • in dXRand : double • in iNumAttacks : int • in iNumFillers : int • in iNumSelected : int • in iSeed : int • in iTarget : int • inout oDataset : Dataset |
Create the set of selected items. The size is prefixed by iNumSelected property.
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
|
Initialization method for the strategy.
Type |
void |
Visibility |
public |
Is Abstract |
false |
Parameter |
|
Order the columns using their mean as comparative metric. This method implements the QuickSort algorithm.
• iInit The initial position of
the array
• iEnd The end position in the array
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
• in iEnd : int • in iInit : int |
In the Bandwagon Attack, the values for filler items must be randomly generated by a Normal Distribution, using the mean and standard deviation of the overall dataset.
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
|
Set the values of selected items. In the Bandwagon Attack, each selected item has the maximum value.
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
|
Select the columns that exceed the visibility and density threshold.
Type |
void |
Visibility |
protected |
Is Abstract |
false |
Parameter |
|
Name |
|
Related Element |
• ReverseBandwagonAttack |
Name |
|
Related Element |
• IntruderAttack |
Name |
DatasetStatistics |
Qualified Name |
es::uco::kdis::datapro::algorithm::intruder::DatasetStatistics |
Visibility |
public |
Abstract |
false |
Base Classifier |
• DatasetStrategy |
Realized Interface |
|
All attributes are private.
Constructor. A parameter is required:
• data Dataset over which the statistical strategy will be executed.
Type |
|
Visibility |
public |