last update: 2014  

This page contains a summary of resources of interest for research. It is not intended to provide an exhaustive list, but some references to start from. From this page you can also browse to further information on my software and research.

Specific data sets

Data set Owner Context Description
The SEER data [+] U.S. National Cancer Institute N/A Diagnosis
ICML 2009 dataset (2009) [+] Dept. of UCSD Computer Science and Engineering Web mining - Usage Detection of malicious URLs (spam, phishing, exploits, and so on)
The Public Terabyte Dataset Project (2010) [+] Bixo Labs / Amazon Web Mining - Content & Usage The data comes from a crawl of 50-200M pages from the 100K top (by traffic) English language domains.
The Internet Traffic Database (2008) [+] Lawrence Berkeley National Laboratory / ACM SIGCOMM Web mining / Usage HTTP requests on different servers
Web->KB Project [+] CMU World Wide Knowledge Base Web Mining - Content / MRDM It contains: (1) A data set consisting of classified Web pages. (2) A relational data set describing both pages and hyperlinks. (3) A subset of the 4 Universities dataset containing web pages and hyperlink data. (4) 20 newsgroups dataset (5) 7sectors dataset.
Syskill and Webert Web Page Ratings [+] UCI User Ratings To predict user ratings on web pages
KDD Cup 2005 [+] ACM SIGKDD N/A Query categorization (800,000 queries into 67 predefined categories)
KDD Cup 2007 [+] ACM SIGKDD User Ratings On predicting aspects of movie rating behavior.
MovieLens Data Sets [+] GroupLens Research User Ratings They currently have three datasets available: (1) 100,000 ratings for 1682 movies by 943 users (2) 1 million ratings for 3900 movies by 6040 users (3) 10 million ratings and 100,000 tags for 10681 movies by 71567 users
Anonymous Ratings from the online Jester Online Joke Recommender System [+] Jester 4.0 User Ratings Dataset 1: Over 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003. Dataset 2: Over 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 63,974 users: collected between November 2006 - May 2009.
Book crossing Dataset [+] IIF – Institüt für Informatik – Freiburg User Ratings Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.
Last.fm - Music Recommendation Datasets for Research (2010) [+] Óscar Celma, UOC User Ratings This dataset contains [user, artist, plays] tuples (for ~360,000 users) collected from Last.fm API.
Reuters 21578 [+] David Lewis Text mining Reuters 21578
Web Spam Detection [+] Yahoo! Research Barcelona Spam Detection WEBSPAM-UK2007 and WEBSPAM-UK2006, and older
The Enron dataset [+] CMU.edu Real e-mail content It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages.
CiteUlike Available Datasets [+] CiteULike N/A The file constitutes an anonymous dump of who posted what and when the posting took place. There is no data in this file which is not already available publicly through the web site, so there are no privacy implications for making it available.
Wikipedia Database Complete Dump [+] Wkipedia Text mining The latest complete dump of the English-language Wikipedia
The EUR-Lex datasets [+] TU Darmstadt Text mining The EUR-Lex text collection is a collection of documents about European Union law.? The most important categorization is provided by the EU­ROVOC descriptors, which form a topic hierarchy with almost 4000 categories regarding different aspects of European law.
The 4-Universities Dataset [+] CMU World Wide Knowledge Base Web Mining This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the Web->Kb project of the CMU text learning group. The 8,282 pages were manually classified into 7 categories.
The 4-Universities Dataset (Relational version) [+] CMU World Wide Knowledge Base Web Mining - ILP The data consists of relations suitable for providing to FOIL, as well as the complete text of all the web pages and also of anchors and the text surrounding anchors.

Data set repositories

Repository Owner
Datasets for Data Mining, Analytics and Knowledge Discovery [+] KD-nuggets
UC Irvine Machine Learning Repository [+][+] UCI KDD Archive
AWS Developer Community [+] Amazon Web Services
Intl. Network for Social Network Analysis [+] INSNA
Datasets for training [+] UCLA Statistics datasets
Trust network datasets (social network datasets) [+] TrustLet.org
Data for Research (by categories) [+] Daniel Lemire's
Information Retrieval Resources (Niraj) [+] Niraj Kumar
IR Multilingual Resources at UniNE (Stemming Dictionaries) [+] Université de Neuchâtel
Public Databases @ Bixo Labs [+] Bixo Labs
Frequent Itemset Mining Dataset Repository [+] FIMI workshops (2003/04)
The LUCS-KDD Discretised/Normalised (V2) ARM and CARM Data Library [+] Frans Coenen
Regression datasets [+] Luis Torgo (Univ. Porto)
PMML Sample Models [+] Data Mining Group
EDM datasets [+] PSLC DataShop
Data Mining and Exploration (for students) [+] The University of Edinburgh
SWEO Community Project: Linking Open Data on the Semantic Web [+] W3C
The Text REtrieval Conference (TREC) datasets [+] US National Institute of Standards and Technology
GoogleLabs Public Data Explorer [+] Google
The KEEL dataset repository [+] KEEL Spanish Research Project

Resources and tools

Resource Owner Description
WorkGenesis [+] J.R. Romero (Univ. Córdoba) Meta-tool for the quick construction of data intensive workflow management systems.
(In house) test set generator v3.2 [+] Frans Coenen (Univ. Liverpool) The data generator is intended to produce data sets for use in the testing of Association Rule Mining (ARM) algorithms, but may very well have other uses. Written in Java.
Software from KARYPIS Lab [+] University of Minnesota Software on partitioning, clustering, information retrieval, etc.
Orange [+] BioLab Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics.
Knime [+] Knime.com AG KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is used by professionals in both industry and academia in over 60 countries.
RapidMiner [+] Rapid-I It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Data Integration, Analytical ETL, Data Analysis, and Reporting in one single suite. Powerful but intuitive graphical user interface for the design of analysis processes. Repositories for process, data and meta data handling.
KEEL [+] Several spanish universities (Spanish National Projects TIC2002-04036-C05, TIN2005-08386-C05 and TIN2008-06681-C06) KEEL is an open source (GPLv3) Java software tool to assess evolutionary algorithms for Data Mining problems including regression, classification, clustering, pattern mining and so on. It contains a big collection of classical knowledge extraction algorithms, preprocessing techniques (training set selection, feature selection, discretization, imputation methods for missing values, etc.), Computational Intelligence based learning algorithms, including evolutionary rule learning algorithms based on different approaches (Pittsburgh, Michigan and IRL, ...), and hybrid models such as genetic fuzzy systems, evolutionary neural networks, etc. It allows us to perform a complete analysis of any learning model in comparison to existing ones, including a statistical test module for comparison.
Weka [+] University of Waikato Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
Frequent Pattern Mining Implementations (C++)[+] Bart Goethals Implementation of several ARM algorithms: Apriori, NDI, Eclat, FP-Growth, DIC, etc.
Frequent Itemset Mining Implementations Repository [+] FIMI Implementation of several algorithms for Frequent Itemset Mining