ChaLearn

Dataset description

The data consist of a set of 672 queries and 240 diversification system outputs and is structured as following, according to the development - validation - testing procedure:

- devset (development data): contains two data sets, i.e. devset1 (with 346 queries and 39 system outputs) and devset2 (with 60 queries and 56 system outputs);

- validset (validation data): contains 139 queries with 60 system outputs;

- testset (testing data): contains two data sets, i.e. seenIR data (with 63 queries and 56 system outputs, it contains the same diversification system outputs as in the devset2 data) and unseenIR data (with 64 queries and 29 system outputs, it contains unseen, novel diversification system outputs).

All the data consists of redistributable Creative Commons licensed information from Flickr and Wikipedia, as well as content descriptors which are provided on an "as is" basis and were computed according to algorithms from the literature.

Download the data

To download the data you can use either a FTP client at your convenience (e.g., Filezilla, WinSCP, etc.) configured to connect to:

IP: 153.109.124.90

Encryption: No encryption

Host name: icprchallengero

Port number: 21

Password: !ch4LL3ng3-r0-icpr!

or the direct link with a HTTP client and the following credentials:

URL: http://fast.hevs.ch/icpr-challenge/

User: icpr-challenge

Pass: !ch4LL3ng3-r0-icpr!

Description of the provided data

Overall the provided data consists of the following information (which varies slightly depending on the dataset, as explained below):

images Flickr: the images retrieved from Flick for each query (img folder);
images Wikipedia: representative images from Wikipedia for each query (imgwiki folder);
metadata: various metadata for each image (xml folder);
content descriptors: various type of content descriptors (text-visual-social) computed on the data (desc<type_of _descriptors> folders);
ground truth: relevance and diversity annotations for the images (gt folder with rGT for relevance and dGT for diversity);
diversification system outputs: outputs of various diversifications systems (previous years participant runs; sys folder);
baseline: a baseline system, i.e. the Flickr initial retrieval results that uses the Flickr default relevance algorithm (<data_set_name>_baseline.txt), together with the evaluation script for computing the official metrics (div_eval.jar) and the list of queries in the data set (<data_set_name>_topics.xml).

Development data set (devset)

Devset1 contains single-topic, location related queries. For each location, the following information is provided: location name - is the name of the location and represents its unique identifier; location name query id - each location name has a unique query id code to be used for preparing the official runs; GPS coordinates - latitude and longitude in degrees; link to the Wikipedia webpage of the location; a representative photo retrieved from Wikipedia in jpeg format; a set of photos retrieved from Flickr in jpeg format (up to 150 photos per location - each photo is named according to its unique id from Flickr). Photos are stored in individual folders named after the location name; an xml file containing metadata from Flickr for all the retrieved photos; visual and textual descriptors; ground truth for both relevance and diversity.

Devset2 contains single-topic, location related queries. For each location, the following information is provided: location name - is the name of the location and represents its unique identifier; location name query id - each location name has a unique query id code to be used for preparing the official; GPS coordinates - latitude and longitude in degrees; link to the Wikipedia webpage of the location; up to 5 representative photos retrieved from Wikipedia in jpeg format; a set of photos retrieved from Flickr in jpeg format (up to 300 photos per location - each photo is named according to its unique id from Flickr). Photos are stored in individual folders named after the location name; an xml file containing metadata from Flickr for all the retrieved photos; visual, text and credibility descriptors; ground truth for both relevance and diversity.

Validation data set (validset)

Contains both single- and multi-topic queries related to locations and events. For each query, the following information is provided: query text formulation - is the actual query formulation used on Flickr to retrieve all the data; query title - is the unique query text identifier - this is basically the query text formulation from which spaces and special characters have been removed (please note that all the query related resources are indexed using this text identifier); query id - each location name has a unique query id code to be used for preparing the official runs; GPS coordinates - latitude and longitude in degrees (only for one-topic location queries); link to the Wikipedia webpage of the query (only when available); up to 5 representative photos retrieved from Wikipedia in jpeg format (only for one-topic location queries); a set of photos retrieved from Flickr in jpeg format (up to 300 photos per query - each photo is named according to its unique id from Flickr). Photos are stored in individual folders named after the query title; an xml file containing metadata from Flickr for all the retrieved photos; visual, text and credibility descriptors; ground truth for both relevance and diversity.

Test data set (testset)

SeenIR data contains single-topic, location related queries. It contains the same diversification system outputs as in the devset2 data. For each location, the provided information is the same as for devset2. No ground truth is provided for this data.

UnseenIR data contains multi-topic event related and general purpose queries. It contains unseen, novel diversification system outputs. For each query, the following information is provided: query text formulation - is the actual query formulation used on Flickr to retrieve all the data; query title - is the unique query text identifier - this is basically the query text formulation from which spaces and special characters have been removed (please note that all the query related resources are indexed using this text identifier); query id - each query has a unique query id code to be used for preparing the official runs; a set of photos retrieved from Flickr in jpeg format (up to 300 photos per query - each photo is named according to its unique id from Flickr). Photos are stored in individual folders named after the query title; an xml file containing metadata from Flickr for all the retrieved photos; visual, text and credibility descriptors. No ground truth is provided for this data.

XML metadata

Each query in devset, validset and testset is accompanied by an xml file (UTF-8 encoded) that contains all the retrieved metadata for all the photos. Each file is named after the query name, e.g. “igazu falls”. The information is structured as follows (please note that for devset1 there is no userid data):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

…

</photos>

The monument value is the unique query name identifier mentioned in the data set list. Then, each of the photos are delimited by a <photo /> statement. Among the photo information fields, please note in particular: description contains a detailed textual description of the photo as provided by author; id is the unique identifier of each photo from Flickr and corresponds to the name of the jpeg file associated to this photo (e.g., id="6646500589"); license is the Creative Common license of this picture; nbComments is the number of comments posted on Flickr about this photo; rank is the position ranking of the photo in the list retrieved from Flickr; tags are the tag keywords used for indexing purpose; title is a short textual description of the photo provided by the author; url_b is the url link of the photo location from Flickr (please note that by the time you use the dataset some of the photos may not be available anymore at the same query); username represent the photo owners name; views is the number of times the photo has been displayed on Flickr.

General purpose visual descriptors

Apart from the unseenIR data from testset, the rest of data sets have some general purpose visual descriptors, namely:

Global Color Naming Histogram (code CN - 11 values): maps colors to 11 universal color names: "black", "blue", "brown", "grey", "green", "orange", "pink", "purple", "red", "white", and "yellow";
Global Histogram of Oriented Gradients (code HOG - 81 values): represents the HoG feature computed on 3 by 3 image regions;
Global Color Moments on HSV Color Space (code CM - 9 values): represent the first three central moments of an image color distribution: mean, standard deviation and skewness;
Global Locally Binary Patterns on gray scale (code LBP - 16 values);
Global Color Structure Descriptor (code CSD - 64 values): represents the MPEG-7 Color Structure Descriptor computed on the HMMD color space;
Global Statistics on Gray Level Run Length Matrix (code GLRLM - 44 dimensions): represents 11 statistics computed on gray level run-length matrices for 4 directions: Short Run Emphasis (SRE), Long Run Emphasis (LRE), Gray-Level Non-uniformity (GLN), Run Length Non-uniformity (RLN), Run Percentage (RP), Low Gray-Level Run Emphasis (LGRE), High Gray-Level Run Emphasis (HGRE), Short Run Low Gray-Level Emphasis (SRLGE), Short Run High Gray-Level Emphasis (SRHGE), Long Run Low Gray-Level Emphasis (LRLGE), Long Run High Gray-Level Emphasis (LRHGE);
Spatial pyramid representation (code 3x3): each of the previous descriptors is computed also locally. The image is divided into 3 by 3 non-overlapping blocks and descriptors are computed on each patch. The global descriptor is obtained by the concatenation of all values.

CNN descriptors

The CNN descriptors are available only for validset and unseenIR data from testset. For each query and photo, we provide also some convolutional neural network-based descriptors, namely:

CNN generic (code cnn_gen - 4,096 values): descriptor based on the reference convolutional (CNN) neural network model provided along with the Caffe framework. This model is learned with the 1,000 ImageNet classes used during the ImageNet challenge. The descriptors are extracted from the last fully connected layer of the network (named fc7);
CNN adapted (code cnn_ad - 4,096 values): descriptor based on a CNN model obtained with an identical architecture to that of the Caffe reference model. This model is learned with 1,000 tourist points of interest classes whose images were automatically collected from the Web. Similar to CNN generic, the descriptors are extracted from the last fully connected layer of the network (named fc7).

Text descriptors

For each data set, with small variations on how the data is organized, the following text descriptors as provided. Three types of textual models were created using tag and title words of the Flickr metadata. A list of English stop words was used to exclude them from the models. Single character words were also removed. We extracted the following models:

probabilistic model (code probabilistic): estimates the probability of association between a word and a given location by dividing the probability of occurrence of the word in the metadata associated to the location by the overall occurrences of that word;
TF-IDF weighting (code tfidf): term frequency inverse document frequency is a numerical statistic which reflects how important a word is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others;
Social TF-IDF weighting (code social-tfidf): is an adaptation of TF-IDF to the social space (documents with several identified contributors). It exploits the number of different users that tag with a given word instead of the term count at document level and the total number of users that contribute to a document's description. At the collection level, we exploit the total number of users that have used a document instead of the frequency of the word in the corpus. This measure aims at reducing the effect of bulk tagging (i.e., tagging a large number of photographs with the same words) and to put forward the social relevancy of a term through the use of the user counts.

All three models use the entire collection to derive term background information, such as the total number of occurrences for the probabilistic model, the inverse document frequency for tf-idf or the total number of users for social_tfidf.

Text descriptors are provided on per dataset basis. For each set, the text descriptors are computed on: per image basis (file [dataset]_textTermsPerImage.txt), per location basis (file [dataset]_textTermsPerPOI.txt) and per user basis, respectively (file [dataset]_textTermsPerUser.txt).

File format. In each file, each line represents an entity with its associated terms and their weights. For instance, in the devset2 per image basis descriptor file (devset2_textTermsPerImage.txt) a line will look like:

9067739127 "acropoli" 2 299 0.006688963210702341 "athen" 3 304 0.009868421052631578 "entrance" 1 130 0.007692307692307693 "greece" 1 257 0.0038910505836575876 "view" 1 458 0.002183406113537118 ...

The first token is the id of the entity, in this case the unique Flickr id of the image. Following that is a list of 4-tuples ("term" TF DF TF-IDF), where "term" is a term which appeared anywhere in the description, tags or title of the image from the metadata, TF is the term frequency (the number of occurrences of the term in the entity's text fields), the DF is the document frequency (the number of entities which have this term in their text fields) and finally the TF-IDF is simply TF/DF. The information from the location-based text descriptors is the same as in the image-based case except for the fact that the entity here is the location query. Its textual description is taken to be the set of all texts of all of its images. Additionally, in this case we provide also a set of files which include also the location name apart from the location query (file [dataset]_textTermsPerPOI.wFolderNames.txt). Here is an example where "acropolis_athens" is the location name and "acropolis athens" is the location query:

acropolis_athens acropolis athens "0005" 1 3 0.3333333333333333 "0006" 1 3 0.3333333333333333 "0012" 1 2 0.5 ...

The information from the user-based text descriptors is also similar except for the fact that the entity here is the photo user id from Flickr. Its textual description is taken to be the set of all texts of all of her images, regardless of the location.

User annotation credibility descriptors

The user annotation credibility descriptors are available for devset2, validset and testset (with some variations on the individual descriptors computed). We provide user tagging credibility descriptors that give an automatic estimation of the quality of tag-image content relationships. The aim of credibility descriptors is to give participants an indication about which users are most likely to share relevant images in Flickr (according to the underlying task scenario). These descriptors are extracted by visual or textual content mining:

visualScore: descriptor obtained through visual mining using over 17,000 of ImageNet visual models obtained by learning a binary SVM per ImageNet concept. Visual models are built on top of overfeat, a powerful convolutional Neural Network feature. At most 1,000 images are downloaded for each user in order to compute visualScores. For each Flickr tag which is identical to an ImageNet concept, a classification score is predicted and the visualScore of a user is obtained by averaging individual tag scores. The intuition here is that the better the predictions given by the classifiers are, the more relevant a user's images should be. Scores are normalized between 0 and 1, with higher scores corresponding to more credible users;
faceProportion: descriptor obtained using the same set of images as for visualScore. The default face detector from OpenCV is used here to detect faces. faceProportion, the percentage of images with faces out of the total of images tested for each user is computed. The intuition here is that the lower faceProportion is, the better the average relevance of a user's photos is. faceProportion is normalized between 0 and 1, with 0 standing for no face images;
tagSpecificity: descriptor obtained by computing the average specificity of a user's tags. Tag specificity is calculated as the percentage of users having annotated with that tag in a large Flickr corpus (~100 million image metadata from 120,000 users);
locationSimilarity: descriptor obtained by computing the average similarity between a user's geotagged photos and a probabilistic model of a surrounding cell of approximately 1 km² geotagged images. These models were created for MediaEval 2013 Placing Task and reused as such here. The intuition here is that the higher the coherence between a user's tags and those provided by the community is, the more relevant her images are likely to be. locationSimilarity is not normalized and small values stand for the lowest similarity;
photoCount: descriptor which accounts for the total number of images a user shared on Flickr. This descriptor has a maximum value of 10,000;
uniqueTags: proportion of unique tags present in a user's vocabulary divided by the total number of tags of the user. uniqueTags ranges between 0 and 1;
uploadFrequency: average time between two consecutive uploads in Flickr. This descriptor is not normalized;
bulkProportion: the proportion of bulk taggings in a users stream (i.e., of tag sets which appear identical for at least two distinct photos). The descriptor is normalized between 0 and 1;
meanPhotoViews: the mean value of the number of times a user's image has been seen by other members of the community. This descriptor is not normalized;
meanTitleWordCounts: the mean value of the number of words found in the titles associated with users' photos. This descriptor is not normalized;
meanTagsPerPhoto: the mean value of the number of tags users put for their images. This descriptor is not normalized;
meanTagRank: the mean rank of a user's tags in a list in which the tags are sorted in descending order according the the number of appearances in a large subsample of Flickr images. We eliminate bulk tagging and obtain a set of 20,737,794 unique tag lists out of which we extract the tag frequency statistics. To extract this descriptor we take into consideration only the tags that appear in the top 100,000 most frequent tags. This descriptor is not normalized;
meanImageTagClarity: this descriptor is based on an adaptation of the Image Tag Clarity score. The clarity score for a tag is the KL-divergence between the tag language model and the collection language model. We use the same collection of 20,737,794 unique tag lists to extract the language models. The collection language model is estimated by the relative tag frequency in the entire collection. Unlike, for the individual tag language model, we use a tf/idf language model, more in the lines of a classical language models. For a target tag, we consider a "document" the subset of tag lists that contain the target tag. For a tag, its clarity score is an indicator on the diversity of contexts the tag is used. A low clarity score suggest a tag is generally used together with the same tags. meanImageTagClarity is the mean value of the clarity score of a user's tags. To extract this descriptor, we take into consideration only the tags that appear in the top 100,000 most frequent tags. This descriptor is not normalized.

Ground truth

The ground truth data consists on relevance ground truth (code rGT) and diversity ground truth (code dGT). Ground truth was generated by a small group of expert annotators with advanced knowledge of the query characteristics. For more information on ground truth statistics, see the recommended bibliography on the source datasets (Div400, Div150Cred, Div150Multi and Div150Adhoc).

Relevance ground truth was annotated using a dedicated tool that provided the annotators with one photo at a time. A reference photo of the query could be also displayed during the process. Annotators were asked to classify the photos as being relevant (score 1), non-relevant (score 0) or with don’t know answer (score -1). The definition of relevance was available to the annotators in the interface during the entire process. The annotation process was not time restricted. Annotators were recommended to consult any additional information about the characteristics of the location (e.g., from Internet) in case they were unsure about the annotation. Ground truth was collected from several annotators and final ground truth was determined after a majority voting scheme.

File format. Ground truth is provided to participants on a per query basis. We provide individual txt files for each query. Files are named according to the query name followed by the ground truth code, e.g., Abbey of Saint Gall rGT.txt refers to the relevance ground truth (rGT) for the location Abbey of Saint Gall. Each file contains photo ground truth on individual lines. The first value of each line is the unique photo id followed by the ground truth value separated by comma. Lines are separated by an end-of-line character (carriage return). An example is presented below:

3338743092,1

3338745530,0

3661394189,1

...

Diversity ground truth was also annotated with a dedicated tool. The diversity is annotated only for the photos that were judged as relevant in the previous step. For each query, annotators were provided with a thumbnail list of all the relevant photos. The first step required annotators to get familiar with the photos by analysing them for about 5 minutes. Next, annotators were required to re-group the photos into similar visual appearance clusters. Full size versions of the photos were available by clicking on the photos. The definition of diversity was available to the annotators in the interface during the entire process. For each of the clusters, annotators provided some keyword tags reflecting their judgments in choosing these particular clusters. Similar to the relevance annotation, the diversity annotation process was not time restricted. In this particular case, ground truth was collected from several annotators that annotated distinct parts of the data set.

File format. Ground truth is provided to participants on a per query basis. We provide two individual txt files for each query: one file for the cluster ground truth and one file for the photo diversity ground truth. Files are named according to the query name followed by the ground truth code, e.g., Abbey of Saint Gall dclusterGT.txt and Abbey of Saint Gall dGT.txt refer to the cluster ground truth (dclusterGT) and photo diversity ground truth (dGT) for the location Abbey of Saint Gall. In the dclusterGT file each line corresponds to a cluster where the first value is the cluster id number followed by the cluster user tag separated by comma. Lines are separated by an end-of-line character (carriage return). An example is presented below:

1,outside statue

2,inside views

3,partial frontal view

4,archway

...

In the dGT file the first value on each line is the unique photo id followed by the cluster id number (that corresponds to the values in the dclusterGT file) separated by comma. Each line corresponds to the ground truth of one image and lines are separated by an end-of-line character (carriage return). An example is presented below:

3664415421,1

3665220244,1

...

3338745530,2

3661396665,2

...

3662193652,3

3338743092,3

3665213158,3

...

News

There are no news registered in Multimedia Information Processing for Personality & Social Networks Analysis Challenge - DivFusion I