- July 26, 2012. Project launched! Initial version of the library with the complete programmer's guide.
datapro4j is a Java library (JSE6+) for processing and handling data from heterogeneous datasets, independently of their nature and the data source from where they are extracted. The base structure in datapro4j is the tabular dataset, where data instances are made of a number of values distributed into columns of different types: numerical/real, nominal, categorical, integer, date, range, etc. Using this tabular structure as a base, other more complex datasets are provided, e.g. multi-relational datasets, etc.
Programmers in different application domains often require long time to handle larger operations on data (e.g. reading or writing operations, running common algorithms or processing values). Commonly, this is an issue that requires a lot of effort from programmers. datapro4j makes it pretty easier, for example, allowing the migration among data formats, since the programmer only needs to read values from one given format and then write them to another different one: for example, the programmer could easily read data from a Cassandra database, process these data, and finally write them back into an ARFF file (from Weka) without need to worry about the internal structure of these data sources.
datapro4j has been constructed following five highly demanding design criteria:
- performance: operations on data are built following Java performance issues),
- interoperability (the library is conceived to be used with external data providers, services and tools
- integrability: datapro4j will avoid a huge amount of work, as usual required for programmers that need to access and process values from different data sources
- scalability: it can be easily extended to read new data formats, process more operations or access to other external interfaces and toolkits, in fact we are constantly increasing its functionality; and
- maintainability: the programmer can provide his own implementation of different processing parts, so adaptability to changeable requirements is guaranteed.
At present, datapro4j provides support for the following data sources, both for reading and for writing:
Further, datapro4j allows the programmer to easily develop their own algorithms on datasets. Some algorithms are currently provided or under development. Notice that the number of algorithms provided is constantly increasing:
- CSV (Comma-Separated Values) files.
- Excel files.
- ARFF files (for Weka).
- KEEL files.
- Further, the following data formats are under development, and will be released soon:
- XRFF files (for Weka).
- Cassandra databases (thru Hector and JDBC)
- JDBC databases (MySQL, Oracle, etc.)
No need to reinvent the wheel! If you are interested in using datapro4j into your code, this page contains great documentation and examples that will provide both an overall picture of and a more detailed information on the library and its advantages.
- Attack algorithms for rating-based collaborative recommender systems.
- Data preprocessing algorithms (e.g. discretization, instance elimination) and cross-validation.
- Accesing external libraries from datapro4j, e.g. JCLEC.
- Statistical analysis of data.
- Weka wrapping: datapro4j allows the programmer to access Weka's algorithms from code, and independently of the dataset used.
This project will be externally hosted in the future (including source code). By now, you can download here the lastest version of the library as a JAR file.
This library is built to be compiled under JSE 6+.
Further, consider the following external dependencies:
- Classes for handling Excel datasets [website|download]
- Classes for wrappering Weka3.6's algorithms (coming soon..)
- Classes for accesing JDBC datasets, if required (coming soon..)
- Classes for accessing Cassandra database thru Hector, if required (coming soon..)
Copyright (c) 2012 The authors (University of Cordoba, Spain)
This software was developed by members of the Knowlegde Discovery and Intelligent Systems at the University of Córdoba, Spain.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
Redistribution and use of binary forms, with or without modification, are permitted if the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the disclaimer above.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- All advertising materials or publication mentioning features or use of this software must display the following acknowledgement: "This product includes datapro4j, a software developed by the KDIS Research Group at the University of Córdoba (Spain) and its contributors." or cite the following reference:
- J.R. Romero, J.M. Luna, S. Ventura (2012). datapro4j: the data processing library for Java. Dept. of Computer
Science and Numerical Analysis, University of Córdoba (Spain). Available for download from http://www.uco.es/grupos/kdis/datapro4j
- Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
- Commercial use of this software or part of it is not allowed without specific prior written permission.
- Licensing and conditions are subject to change without notice.
At present, please cite the following reference if you make use of datapro4j:
J.R. Romero, J.M. Luna, S. Ventura (2012). datapro4j: the data processing library for Java. Dept. of Computer Science and Numerical Analysis, University of Córdoba (Spain). Available for download from http://www.uco.es/grupos/kdis/datapro4j
Notice that we freely but gladly share our effort with you, so please send us any extension or addition to the library or editor code. We also appreciate if you report any tool making using this library. This will cost you only a very short time, but would help us to improve the library in the future. Thank you!
Note: At present datapro4j is only distributed as a binary JAR file until the official download site opens (Google Code, SourceForge, etc.). Very soon!
- July 26, 2012
- Initial version of datapro4j [download]
You may download the full programmer's guide of datapro4j in PDF format, or browse the online version.
Code samples and tutorials will be available soon.
- Java technologies:
- Java SE downloads. [+]
- Apache POI. [+]
- Data sources:
- The KEEL project. [+]
- The Weka project. [+]
- The Comma-Separated Values (CSV) IETF specification. [+]
If you have any question, improvement or new development to notify using the library, please contact us at jrromero -at- uco.es.