Multi-Label Classification Dataset Repository – Knowledge Discovery and Intelligent Systems – KDIS

In this website we provide a huge compilation of multi-label classification datasets, obtained from different sources. For further information, please contact Jose M. Moyano (jmoyano@uco.es).

Datasets

For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in the MLDA documentation. We also used MLDA for the characterization and partitioning of the datasets.

We include the following partitions for each dataset:

Original: dataset as originally provided by their authors.
Full dataset: the entire dataset in Mulan/Meka format.
Random train-test: the dataset was randomly partitioned into train and test files, using 67% of data for training and 33% for testing.
Stratified train-test: the partition in 67% train and 33% test was performed by following the Iterative Stratification method proposed by Sechidis et al. 2011.
Random 5-folds CV: a random partition in 5 folds was performed, and then they were joined in 5 different train-test partitions, where in each case 4 folds are used for training and the remaining one for testing. Thus, each train-test partition includes different data for testing.
Stratified 5-folds CV: a stratified partition in 5 folds was performed followint the Iterative Stratification method proposed by Sechidis et al. 2011, and then they were joined in 5 different train-test partitions, where in each case 4 folds are used for training and the remaining one for testing.
Stratified 10-folds: a stratified partition in 10 folds was performed by following the method proposed by J. Motl. In this case, the 10 folds partitions are directly provided, instead of giving the train-test partitions obtained as combination of folds. For further information about the 10-folds partition, please contact jan.motl@fit.cvut.cz.

In all cases, with exception of the original data, the datasets are provided in both Mulan and Meka formats.

Dataset	Domain	m	d	q	Card	Dens	Div	avgIR	rDep	m×q×d
Dataset	Domain	m	d	q	Card	Dens	Div	avgIR	rDep	m×q×d	20NG	Text	19300	1006	20	1.029	0.051	0.003	1.007	0.984	3.88E+08
3s-bbc1000	Text	352	1000	6	1.125	0.188	0.234	1.718	0.733	2.11E+06
3s-guardian1000	Text	302	1000	6	1.126	0.188	0.219	1.773	0.667	1.81E+06
3s-inter3000	Text	169	3000	6	1.142	0.190	0.172	1.766	0.400	3.04E+06
3s-reuters1000	Text	294	1000	6	1.126	0.188	0.219	1.789	0.667	1.76E+06
Bibtex	Text	7395	1836	159	2.402	0.015	0.386	12.498	0.111	2.16E+09
Birds	Audio	645	260	19	1.014	0.053	0.206	5.407	0.123	3.19E+06
Bookmarks	Text	87860	2150	208	2.028	0.010	0.213	12.308	0.315	3.93E+10
CAL500	Music	502	68	174	26.044	0.150	1.000	20.578	0.192	5.94E+06
CHD_49	Medicine	555	49	6	2.580	0.430	0.531	5.766	0.267	1.63E+05
Corel16k001	Image	13770	500	153	2.859	0.019	0.349	34.155	0.142	1.05E+09
Corel16k002	Image	13760	500	164	2.882	0.018	0.354	37.678	0.128	1.13E+09
Corel16k003	Image	13760	500	154	2.829	0.018	0.350	37.058	0.137	1.06E+09
Corel16k004	Image	13840	500	162	2.842	0.018	0.351	35.899	0.126	1.12E+09
Corel16k005	Image	13850	500	160	2.858	0.018	0.364	34.936	0.133	1.11E+09
Corel16k006	Image	13860	500	162	2.885	0.018	0.361	33.398	0.128	1.12E+09
Corel16k007	Image	13920	500	174	2.886	0.017	0.371	37.715	0.120	1.21E+09
Corel16k008	Image	13860	500	168	2.883	0.017	0.357	36.200	0.121	1.16E+09
Corel16k009	Image	13880	500	173	2.930	0.017	0.373	36.446	0.119	1.20E+09
Corel16k010	Image	13620	500	144	2.815	0.020	0.345	32.998	0.147	9.81E+08
Corel5k	Image	5000	499	374	3.522	0.009	0.635	189.568	0.030	9.33E+08
Delicious	Text	16110	500	983	19.020	0.019	0.981	71.134	0.143	7.92E+09
Emotions	Music	593	72	6	1.868	0.311	0.422	1.478	0.933	2.56E+05
Enron	Text	1702	1001	53	3.378	0.064	0.442	73.953	0.141	9.03E+07
EukaryoteGO	Biology	7766	12690	22	1.146	0.052	0.014	45.012	0.281	2.17E+09
EukaryotePseAAC	Biology	7766	440	22	1.146	0.052	0.014	45.012	0.281	7.52E+07
Eurlex-dc	Text	19350	5000	412	1.292	0.003	0.083			3.99E+10
Eurlex-ev	Text	19350	5000	3993	5.310	0.001	0.851			3.86E+11
Eurlex-sm	Text	19350	5000	201	2.213	0.011	0.129			1.94E+10
Flags	Image	194	19	7	3.392	0.485	0.422	2.255	0.381	2.58E+04
Foodtruck	Recommend.	407	21	12	2.290	0.191	0.285	7.095	0.409	1.03E+05
Genbase	Biology	662	1186	27	1.252	0.046	0.048	37.315	0.157	2.12E+07
GnegativeGO	Biology	1392	1717	8	1.046	0.131	0.074	18.448	0.536	1.91E+07
GnegativePseAAC	Biology	1392	440	8	1.046	0.131	0.074	18.448	0.536	4.90E+06
GpositiveGO	Biology	519	912	4	1.008	0.252	0.438	3.861	0.667	1.89E+06
GpositivePseAAC	Biology	519	440	4	1.008	0.252	0.438	3.861	0.667	9.13E+05
HumanGO	Biology	3106	9844	14	1.185	0.085	0.027	15.289	0.418	4.28E+08
HumanPseAAC	Biology	3106	440	14	1.185	0.085	0.027	15.289	0.418	1.91E+07
Image	Image	2000	294	5	1.236	0.247	0.625	1.193	0.900	2.94E+06
Imdb	Text	120900	1001	28	2.000	0.071	0.037	25.124	0.868	3.39E+09
Langlog	Text	1460	1004	75	1.180	0.016	0.208	39.267	0.035	1.10E+08
Mediamill	Video	43910	120	101	4.376	0.043	0.149	256.405	0.342	5.32E+08
Medical	Text	978	1449	45	1.245	0.028	0.096	89.501	0.039	6.38E+07
Ohsumed	Text	13930	1002	23	1.663	0.072	0.082	7.869	0.526	3.21E+08
Nus-Wide BoW	Image	269599	501	81	1.869	0.023	0.068			1.09E+10
Nus-Wide cVLADplus	Image	269600	129	81	1.869	0.023	0.068			2.82E+09
PlantGO	Biology	978	3091	12	1.079	0.090	0.033	6.690	0.318	3.63E+07
PlantPseAAC	Biology	978	440	12	1.079	0.090	0.033	6.690	0.318	5.16E+06
rcv1subset1	Text	6000	47240	101	2.88	0.029	0.171	54.492	0.202	2.86E+10
rcv1subset2	Text	6000	47240	101	2.634	0.026	0.159	45.514	0.179	2.86E+10
rcv1subset3	Text	6000	47240	101	2.614	0.026	0.157	68.333	0.183	2.86E+10
rcv1subset4	Text	6000	47230	101	2.484	0.025	0.136	89.371	0.163	2.86E+10
rcv1subset5	Text	6000	47240	101	2.642	0.026	0.158	69.682	0.170	2.86E+10
Reuters-K500	Text	6000	500	103	1.462	0.014	0.135	54.081	0.080	3.09E+08
Scene	Image	2407	294	6	1.074	0.179	0.234	1.254	0.933	4.25E+06
Slashdot	Text	3782	1079	22	1.181	0.054	0.041	19.462	0.273	8.98E+07
Stackex_chemistry	Text	6961	540	175	2.109	0.012	0.436	56.878	0.056	6.58E+08
Stackex_chess	Text	1675	585	227	2.411	0.011	0.644	85.790	0.030	2.22E+08
Stackex_coffee	Text	225	1763	123	1.987	0.016	0.773	27.241	0.017	4.88E+07
Stackex_cooking	Text	10490	577	400	2.225	0.006	0.609	37.858	0.034	2.42E+09
Stackex_cs	Text	9270	635	274	2.556	0.009	0.512	85.002	0.049	1.61E+09
Stackex_philosophy	Text	3971	842	233	2.272	0.010	0.566	68.753	0.040	7.79E+08
tmc2007	Text	28600	49060	22	2.158	0.098	0.047	15.157	0.818	3.09E+10
tmc2007-500	Text	28600	500	22	2.22	0.101	0.041	17.134	0.818	3.15E+08
VirusGO	Biology	207	749	6	1.217	0.203	0.266	4.041	0.400	9.30E+05
VirusPseAAC	Biology	207	440	6	1.217	0.203	0.266	4.041	0.400	5.46E+05
Water-quality	Chemistry	1060	16	14	5.073	0.362	0.778	1.767	0.473	2.37E+05
Yahoo_Arts	Text	7484	23150	26	1.654	0.064	0.080	94.738	0.338	4.50E+09
Yahoo_Business	Text	11210	21920	30	1.599	0.053	0.021	880.178	0.209	7.37E+09
Yahoo_Computers	Text	12440	34100	33	1.507	0.046	0.034	176.695	0.364	1.40E+10
Yahoo_Education	Text	12030	27530	33	1.463	0.044	0.042	168.114	0.199	1.09E+10
Yahoo_Entertainment	Text	12730	32000	21	1.414	0.067	0.026	64.417	0.367	8.55E+09
Yahoo_Health	Text	9205	30610	32	1.644	0.051	0.036	653.531	0.192	9.02E+09
Yahoo_Recreation	Text	12830	30320	22	1.429	0.065	0.041	12.203	0.455	8.56E+09
Yahoo_Reference	Text	8027	39680	33	1.174	0.036	0.034	461.863	0.169	1.05E+10
Yahoo_Science	Text	6428	37190	40	1.45	0.036	0.071	52.632	0.196	9.56E+09
Yahoo_Social	Text	12110	52350	39	1.279	0.033	0.030	257.704	0.189	2.47E+10
Yahoo_Society	Text	14510	31800	27	1.67	0.062	0.073	302.068	0.382	1.25E+10
Yeast	Biology	2417	103	14	4.237	0.303	0.082	7.197	0.670	3.49E+06
Yelp	Text	10810	671	5	1.638	0.328	1.000	2.876	0.700	3.63E+07

Description of the datasets

20NG [Lang 2008]: is a compilation of around 20000 post to 20Newsgroups. Around 1000 posts are available for each group.

3sources [Greene et al. 2009]: These datasets includes 948 news articles covering 416 distinct news stories from the period February–April 2009. They have been collected from 3 sources: BBC, Reuters and The Guardian. Of these stories, 169 were reported in all three sources, 194 in two sources, and 53 appeared in a single news source. Each story was manually annotated with one or more of the six topical labels: business, entertainment, health, politics, sport, technology. In this way, three datasets with the news from BBC, Reuters and The Guardian respectively are created. A feature selection method has been performed in order to reduce the feature space and achieve a better performance. Each dataset has been selected 1000 features. Also, a dataset with the intersection (3sources-inter3000) of these three datasets (news which are in all three sources) has been created with the union of the 1000 features of each one of the datasets. The 3soures-inter3000 dataset can be also considered as a Multi-View Multi-Label (MVML) dataset, since it includes features from 3 distinct sources. The original data has been downloaded from http://mlg.ucd.ie/datasets/3sources.html

Bibtex [Katakis et al. 2008]: This dataset is based on the data of the ECML/PKDD 2008 discovery challenge. It contains 7395 bibtex entries from the BibSonomy social bookmark and publication sharing system, annotated with a subset of the tags assigned by BibSonomy users.

Birds [Briggs et al. 2013]: It is a dataset to predict the set of birds species that are present, given a ten-second audio clip.

Bookmarks [Katakis et al. 2008]: Is based on the data of the ECML/PKDD 2008 discovery challenge and contains bookmark entries from the Bibsonomy system.

CHD_49 [Shao et al. 2013]: This dataset has information of coronary heart disease (CHD) in traditional Chinese medicine (TCM). This dataset has been filtered by specialist removing irrelevant features, keeping only 49 features.

CAL500 [Turnbull et al. 2008]: It is a music dataset, composed by 502 songs. Each one was manually annotated by at least three human annotators, who employ a vocabulary of 174 tags concerning to semantic concepts. These tags span 6 semantic categories: instrumentation, vocal characteristics, genres, emotions, acoustic quality of the song, and usage terms.

Corel5k [Duygulu et al. 2002]: Corel5k is a popular benchmark for image classification and annotation methods. It is based in 5000 Corel images.

Corel16k [Barnard et al. 2003] is derived from the popular benchmark dataset ECCV 2002 by eliminating less frequently appeared labels.

Delicious [Tsoumakas et al. 2008]: This dataset contains textual data of web pages along with their tags.

Emotions [Tsoumakas et al. 2008]: Also called Music in [Read 2010]. Is a small dataset to classify music into emotions that it evokes according to the Tellegen-Watson-Clark model of mood: amazed-suprised, happy-pleased, relaxing-calm, quiet-still, sad-lonely and angry-aggresive. It consists of 593 songs with 6 classes.

Enron [Read et al. 2008]: The Enron dataset is a subset of Enron email Corpus, labelled with a set of categories. It is based in a collection of email messages that were categorized into 53 topic categories, such as company strategy, humour and legal advice.

Eukaryote [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 7766 sequences for Eukaryote species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 22 subcellular locations (acrosome, cell membrane, cell wall, centrosome, chloroplast, cyanelle, cytoplasm, cytoeskeleton, endoplasmatic reticulum, endosome, extracell, golgi apparatus, hydrogenosome, lysosome, melanosome, microsome, mitochondrion, nucleus, peroxisome, spindle pole body, synapse and vacuole).

EUR-Lex [Loza and Fürnkranz 2008]: The EUR-Lex text collection is a collection of 19348 documents about European Union law. It contains many different types of documents, as treaties, legislation, case-law and legislative proposals, which are indexed according to several orthogonal categorization schemes to allow for multiple search facilities. The most important categorization is provided by the EUROVOC descriptors, which form a topic hierarchy with almost 4000 categories regarding different aspects of European law.

Flags [Gonçalves et al. 2013]: This dataset contains details of some countries and their flags, and the goal is to predict some of the features. The dataset was used the first time for Multi-label Classification in [Gonçalves et al. 2013], and the original dataset can be found at the UCI repository.

Foodtruck [Rivolli et al. 2017]: The food truck dataset was created from the answers provided by the 407 survey participants. They either were approached in fast food festivals and popular events or anonymously received a request to fill out a questionnaire, in Portuguese, describing their personal information and preferences when it comes to their selection from food trucks.

Genbase [Diplaris et al. 2005]: It is a dataset for protein function classification. Each instance is a protein and each label is a protein class. This dataset is small comparatively with the large number of labels.

Gnegative [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 1392 sequences for Gram negative bacterial (Gnegative) species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 8 subcellular locations (cell inner membrane, cell outer membrane, cytoplasm, extracellular, fimbrium, flagellum, nucleoid and periplasm).

Gpositive [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 519 sequences for Gram positive species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 4 subcellular locations (cell membrane, cell wall, cytoplasm and extracell).

Human [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 3106 sequences for Human species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 14 subcellular locations (centriole, cytoplasm, cytoskeleton, endoplasm reticulum, endosome, extracell, golgi apparatus, lysosome, microsome, mitochondrion, nucleus, peroxisome, plasma membrace, and synapse).

Image [Zhang and Zhou 2007]: This dataset is composed by 2,000 images. Concretely, each color image is firstly converted to the CIE Luv space, which is a more perceptually uniform color space such that perceived color differences correspond closely to Euclidean distances in this color space. After that, the image is divided into 49 blocks using a 7×7 grid, where in each block the first and second moments (mean and variance) of each band are computed, corresponding to a low-resolution image and to computationally inexpensive texture features respectively. Finally, each image is transformed into a 49×3×2 = 294-dimensional feature vector.

IMDB [Read 2010]: It contains 120919 movie plot tex summaries from the Internet Movie Database (www.imdb.com), labelled with one or more genres.

LangLog [Read 2010]: It was compiled from the Language Log Forum, which discussed various topics relating to language, and 75 topics represents the label space.

Mediamill [Snoek et al. 2006]: It is a multimedia dataset for generic video indexing, which was extracted tom the TRECVID 2005/2006 benchmark. This dataset contains 85 hours of international broadcast news data categorized into 100 labels and each video instance is represented as a 120-dimensional feature vector of numeric features.

Medical [Pestian et al. 2007]: The dataset is based on the data made available during the Computational Medicine Centers 2007 Medical Natural Language Processing Challenge 10 . It consists of 978 clinical free text reports labelled with one or more out of 45 disease codes.

Nus-Wide [Chua et al. 2009]: We provide two versions of the full NUS-WIDE dataset. In the first version, images are represented using 500-D bag of visual words features provided by the creators of the dataset [Chua et al. 2009]. In the second version, images are represented using 128-D cVLAD+ features described in [Spyromitros et al. 2014]. In both cases, the 1st attribute is the image id.

Ohsumed [Joachims 1998]: This collection includes medical abstracts from the MeSH categories of the year 1991. The specific task was to categorize the 23 cardiovascular diseases categories.

Plant [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 978 sequences for Plant species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 12 subcellular locations (cell membrace, cell wall, chloroplast, cytoplasm, endoplasmic reticulum, extracellular, golgi apparatus, mitochondrion, nucleus, peroxisome, plastid, and vacuole).

Reuters-RCV1 [Lewis et al. 2004]: This dataset is a well-known benchmark for text classification methods. It has 5 subsets, each one with 6000 articles assigned into one or more of 101 topics. The Reuters-K500 dataset was obtained by selecting 500 features by applying the method proposed in [Tsoumakas et al. 2007].

Scene [Boutell et al. 2004]: It is a image dataset, that contains 2407 images, annotated in up to 6 classes: beach, sunset, fall foliage, field, mountain and urban. Each image is described with 294 visual numeric features corresponding to spatial colour moments in the LUV space.

Slashdot [Read 2010]: It consists of article blurbs with subject categories representing the label space, mined from http://slashdot.org.

Stackex [Charte et al. 2015]: It is a collection of six datasets generated from the text collected in a selection of Stack Exchange forums. It includes stackex_chess, stackex_chemistry, stackex_coffee, stackex_cooking, stackex_cs and stackex_philosophy.

TMC2007 [Srivastava et al. 2005]: It is a subset of the Aviation Safety Reporting System dataset. It contains 28596 aviation safety free text reports that the fligth crew submit after each flight about events that took place during the flight. The goal is to label the documents with respect to what types of problem they describe. The dataset has 49060 discrete attributes corresponding to terms in the collection. The safety reports are provided with 22 labels, each of them representing a problem type that appears during a flight. Also the dataset TMC2007-500, which was obtained doing a features selection of the top-500, is included.

Virus [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 207 sequences for Virus species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 6 subcellular locations (viral capsid, host cell membrane, host endoplasm reticulum, host cytoplasm, host nucleus and secreted).

Water quality [Blockeel et al. 1999]: This dataset is used to predict the quality of water of Slovenian rivers, knowing 16 characteristics such as the temperature, ph, hardness, NO2 or C02.

Yahoo [Ueda and Saito 2002]: It is a dataset to categorize web pages and consists of 14 top-level categories, each one is classified into a number of second-level categories. By focusing in second-level categories, there were used 11 out of the 14 independent text categorization problems.

Yeast [Elisseeff and Weston 2001]: This dataset contains micro-array expressions and phylogenetic profiles for 2417 yeast genes. Each gen is annotated with a subset of 14 functional categories (e.g. Metabolism, energy, etc.) of the top level of the functional catalogue.

Yelp [Sajnani et al. 2013]: This dataset has been obtained from the user’s reviews and ratings about business and services on Yelp. It is used in order to categorize if the food, service, ambiance, deals and price of one of these business are good or not. It contains more than 10000 reviews of users. This dataset has been downloaded from http://www.ics.uci.edu/~vpsaini/.

Description of the format of the datasets

All the datasets included in the repository are in Mulan [Tsoumakas et al. 2011] and Meka [Read et al. 2016] formats, both based in Weka’s arff format [Hall et al. 2009].

Mulan dataset format

In Mulan each dataset consists of two files: a xml file and an arff file.

XML file: since in the arff file nowhere is indicated which the labels are, in the xml file the attributes that act like labels must be indicated. It has a simple format as shown in the following example:


<?xml version="1.0" encoding="utf-8"?>

	<labels xmlns="http://mulan.sourceforge.net/labels">

	   <label name="amazed-suprised"></label>

	   <label name="happy-pleased"></label>

	   <label name="relaxing-calm"></label>

	   <label name="quiet-still"></label>

	   <label name="sad-lonely"></label>

	   <label name="angry-aggresive"></label>

	</labels>

ARFF file: in this file the full set of attributes and labels (without distinguishing between them there) and the instances are exposed. The specific format of this file is as follows:
- The relation name goes in the first line, following the statement @relation.
- If the relation name has spaces or special characters, it must be between qoutes.
- Each attribute is defined in a different new line.
  - The attribute name must be between qoutes if it has spaces or special characters.
  - Labels are always binary {0, 1}.
  - Attribute types could be the following:
    - numeric (integer and real are treated as numeric).
    - <nominal-values>, between braces and separated by commas all the possible values.
    - string
    - date[<format>]
- Instances must start with the statement @data, and each one must be in a new different line. Each attribute value is separated by commas, and they must be in the same order they were declared.
  - Also there are a shorter way to write the instances, where attributes with zero value are not included. In this case, instances will be in braces, and separated by commas each pair compised by attribute and value.
- Comments are inserted with the character %.
A mulan arff file is shown in the following example:
@relation emotions_test @attribute att1 numeric @attribute att2 numeric @attribute att3 numeric ... @attribute att72 numeric @attribute amazed-suprised {0,1} @attribute happy-pleased {0,1} @attribute relaxing-calm {0,1} @attribute quiet-still {0,1} @attribute sad-lonely {0,1} @attribute angry-aggresive {0,1} %Starting with the data @data 0.094829,0.204498,0.082824, ..., 0.335371,1,0,0,0,1,1 0.065248,0.117975,0.08597, ..., 0.442898,0,0,0,1,0,0 0.101287,0.23254,0.078028, ..., 1.183461,1,1,0,0,0,0 ... 0.172427,0.378696,0.081777, ..., 1.294949,1,1,1,0,0,0

Meka dataset format

In meka format, only an arff file is necessary to define the dataset. In this case, the separation between attributes and labels is done in the arff file.
Meka’s arff format is similar to Mulan’s arff, except only on the relation name line, where the attributes that are labels are indicated. Following the relation name and separated by colon, it is indicated with “-C” and a integer positive number if the first q attributes are labels, or with an negative integer if the labels are the last q attributes.


@relation "relationName: -C q"

In the following example a dataset in Meka format is shown:


@relation "emotions: -C -6"


@attribute att1 numeric

@attribute att2 numeric

@attribute att3 numeric

   ...

@attribute att72 numeric

@attribute amazed-suprised {0,1}

@attribute happy-pleased {0,1}

@attribute relaxing-calm {0,1}

@attribute quiet-still {0,1}

@attribute sad-lonely {0,1}

@attribute angry-aggresive {0,1}


@data

0.094829,0.204498,0.082824, ..., 0.335371,1,0,0,0,1,1

0.065248,0.117975,0.08597, ..., 0.442898,0,0,0,1,0,0

0.101287,0.23254,0.078028, ..., 1.183461,1,1,0,0,0,0

   ...

0.172427,0.378696,0.081777, ..., 1.294949,1,1,1,0,0,0

Software

We have developed a tool for data exploration and analysis of multi-label and multi-view multi-label datasets called MLDA [Moyano et al. 2017]. It includes both a GUI tool and a Java API. It provides an easy to use tool for multi-label datasets analysis, including a wide set of characterization metrics, charts for measuring the imbalance and relationship among labels, several methods for data preprocessing and transformation, multi-view multi-label datasets characterization and allowing to load several datasets simultaneously. It has been created under the GPLv3 license.

More information about MLDA is available at the GitHub repository. Last release, version 1.2.4 is available here and the documentation in pdf is available here.

References

[Barnard et al. 2003]: Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan. Matching Words and Pictures. Journal of Machine Learning Research. 2003. Vol 3, pp 1107-1135.
[Blockeel et al. 1999]: H. Blockeel, S. Džeroski, and J. Grbovic. Simultaneous prediction of multiple chemical parameters of river water quality with tilde. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1704:32–40, 1999.
[Boutell et al. 2004]: Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771, September 2004.
[Briggs et al. 2013]: Forrest Briggs, Yonghong Huang, Raviv Raich, Konstantinos Eftaxias, Zhong Lei, William Cukierski, Sarah Frey Hadley, Adam Hadley, Matthew Betts, Xiaoli Z. Fern, Jed Irvine, Lawrence Neal, Anil Thomas, Gábor Fodor, Grigorios Tsoumakas, Hong Wei Ng, Thi Ngoc Tho Nguyen, Heikki Huttunen, Pekka Ruusuvuori, Tapio Manninen, Aleksandr Diment, Tuomas Virtanen, Julien Marzat, Joseph Defretin, Dave Callender, Chris Hurlburt, Ken Larrey, and Maxim Milakov. The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment. In IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2013, Southampton, United Kingdom, September 22-25, 2013, pages 1–8, 2013.
[Charte et al. 2015]: Francisco Charte and David Charte. Working with multilabel datasets in R: The mldr package. The R Journal, 7(2):149–162, 2015.
[Chua et al. 2009]: Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. “NUS-WIDE: A Real-World Web Image Database from National University of Singapore”, ACM International Conference on Image and Video Retrieval. Greece. Jul. 8-10, 2009.
[Diplaris et al. 2005]: Sotiris Diplaris, Grigorios Tsoumakas, Pericles Mitkas, and Ioannis Vlahavas. Protein Classification with Multiple Algorithms. In Panhellenic Conference on Informaticspages 448–456. 2005.
[Duygulu et al. 2002]: Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , 7th European Conference on Computer Vision, pp IV:97-112, 2002.
[Elisseeff and Weston 2001]: Andre Elisseeff and Jason Weston. A kernel method for multi-labelled classification. In In Advances in Neural Information Processing Systems 14, volume 14, pages 681–687, 2001.
[Gonçalves et al. 2013]: E.C. Goncalves, Alexandre Plastino, and Alex A. Freitas. A genetic algorithm for optimizing the label ordering in multi-label classifier chains. In IEEE 25th International Conference on Tools with Artificial Intelligence, pages 469–476. IEEE Computer Society Conference Publishing Services (CPS), 2013.
[Greene et al. 2009]: Derek Greene and Pádraig Cunningham. A matrix factorization approach for integrating multiple data views. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I, ECML PKDD ’09, pages 423–438, 2009.
[Hall et al. 2009]: Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann & Ian H Witten. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explor. Newsl., 11, 10-18.
[Joachims 1998]: Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec C., Rouveirol C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol 1398.
[Katakis et al. 2008]: Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. Multilabel Text Classification for Automated Tag Suggestion. In Proceedings of the ECML/PKDD 2008 Discovery Challenge, 2008.
[Lang2008]: K. Lang. 2008. The 20 newsgroup dataset. http://people.csail.mit.edu/jrennie/20Newsgroups/.
[Lewis et al. 2004]: David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.
[Loza and Fürnkranz 2008]: Loza and Johannes Fürnkranz. Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Disocvery in Databases (ECML-PKDD-2008), Part II, pages 50–65. 2008.
[Moyano et al. 2017]: Jose M. Moyano, Eva L. Gibaja, Sebastián Ventura, MLDA: A tool for analyzing multi-label datasets, Knowledge-Based Systems, Volume 121, 1 April 2017, Pages 1-3, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2017.01.018.
[Pestian et al. 2007]: John P. Pestian, Christopher Brew, Pawel Matykiewicz, D. J. Hovermale, Neil Johnson, K. Bretonnel Cohen, and Wodzislaw Duch. A shared task involving multi-label classification of clinical free text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP ’07), pages 97–104, 2007.
[Read et al. 2008]: Jesse Read, Bernhard Pfahringer, and Geoff Holmes. Multi-label Classification Using Ensembles of Pruned Sets. In ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, volume 0, pages 995–1000, Washington, DC, USA, 2008. IEEE Computer Society.
[Read 2010]: Jesse Read. Scalable multi-label classification. PhD Thesis, University of Waikato, 2010.
[Read et al. 2016]: Jesse Read, Peter Reutemann, Bernhard Pfahringer & Geoff Holmes (2016). MEKA: A Multi-label/Multi-target Extension to Weka. Journal of Machine Learning Research, 17, 1-5.
[Rivolli et al. 2017]: Adriano Rivolli, Larissa C. Parker, and Andre C.P.L.F. de Carvalho. Food Truck Recommendation Using Multi-label Classification. In EPIA 2017: Progress in Artificial Intelligence, pages 585–596, 2017.
[Sajnani et al. 2013]: H. Sajnani, V. Saini, K. Kumar , E. Gabrielova , P. Choudary, C. Lopes. 2013. Classifying Yelp reviews into relevant categories. http://www.ics.uci.edu/~vpsaini/.
[Sechidis et al. 2011]: K. Sechidis, G. Tsoumakas, I. Vlahavas. 2011. On the Stratification of Multi-label Data. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD ’11, pages 145–158, 2011.
[Shao et al. 2013]: H. Shao, G.Z. Li, G.P. Liu, and Y.Q. Wang. Symptom selection for multi-label data of inquiry diagnosis in traditional chinese medicine. Science China Information Sciences, 56(5):1–13, 2013.
[Snoek et al. 2006]: C.G.M. Snoek, M.Worring, J.C. van Gemert, J.-M. Geusebroek, A.W.M. Smeulders. 2006. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia, In Proceedings of ACM Multimedia, 421-430.
[Spyromitros et al. 2014]: E. Spyromitros-Xioufis, S. Papadopoulos, Y. Kompatsiaris, G. Tsoumakas, I. Vlahavas, “A Comprehensive Study over VLAD and Product Quantization in Large-scale Image Retrieval”, IEEE Transactions on Multimedia, 2014.
[Srivastava et al. 2005]: A. Srivastava, B. Zane-Ulman: Discovering recurring anomalies in text reports regarding complex space systems. In: 2005 IEEE Aerospace Conference. (2005).
[Tsoumakas et al. 2007]: Grigorios Tsoumakas and Ioannis Vlahavas. Random k-Labelsets: An Ensemble Method for Multilabel Classification, pages 406–417. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.
[Tsoumakas et al. 2008]: G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective and Efficient Multilabel Classification in Domains with Large Number of Labels. In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08), 2008.
[Tsoumakas et al. 2011]: G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, I. Vlahavas (2011) “Mulan: A Java Library for Multi-Label Learning”, Journal of Machine Learning Research, 12, pp. 2411-2414.
[Turnbull et al. 2008]: Douglas Turnbull, Luke Barrington, David Torres and Gert Lanckriet. Semantic Annotation and Retrieval of Music and Sound Effects, IEEE Transactions on Audio, Speech and Language Processing 16(2), pp. 467-476, 2008.
[Ueda and Saito 2002]: N. Ueda, K. Saito: Parametric mixture models for multi-labeled text, In Neural Information Processing Systems 15 (NIPS 15), MIT Press, pp. 737-744, 2002.
[Xu et al. 2016]: Jianhua Xu, Jiali Liu, Jing Yin, and Chengyu Sun. A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously. Knowledge-Based Systems, 98:172 — 184, 2016
[Zhang and Zhou 2007]: Min-Ling Zhang and Zhi-Hua Zhou. ML-kNN: a lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007.