Dataset description

More than 14,000 gestures are drawn from a vocabulary of 20 Italian sign gesture categories. The emphasis of this third track is on multi-modal automatic learning of a set of 20 gestures performed by several different users, with the aim of performing user independent continuous gesture spotting.

For each sample, RGB, depth, user segmentation and skeleton information are provided:

Chalearn LAP 2014. Track 3 data modalities

Data description

The data is organized as a set of samples, each one unically identified by an string SampleXXXX, where XXXX is a 4 integer digit number. Each sample is provided as a single ZIP file named with its identifier (eg.

Each sample ZIP file contains the following files:

  • SampleXXXX_color.mp4: Video with the RGB data.
  • SampleXXXX_depth.mp4: Video with the Depth data.
  • SampleXXXX_user.mp4: Video with the user segmentation mask.
  • SampleXXXX_data.csv: CSV file with general information about the video (number of frames, fps, and the maximum depth value.
  • SampleXXXX_skeleton.mp4: CSV with the skeleton information for each frame of the viedos. Each line corresponds to one frame. Skeletons are encoded as a sequence of joins, providing 9 values per join [Wx, Wy, Wz, Rx, Ry, Rz, Rw, Px, Py] (W are world coordinats, R rotation values and P the pixel coordinats). The order of the joins in the sequence is: 1.HipCenter, 2.Spine, 3.ShoulderCenter, 4.Head,5.ShoulderLeft, 6.ElbowLeft,7.WristLeft, 8.HandLeft, 9.ShoulderRight, 10.ElbowRight, 11.WristRight, 12.HandRight, 13.HipLeft, 14.KneeLeft, 15.AnkleLeft, 16.FootLeft, 17.HipRight, 18.KneeRight, 19.AnkleRight, and 20.FootRight.
  • SampleXXXX_labels: CSV file with the ground truth for the sample (only for labelled data sets). Each line corresponds to a gesture. Information provided is the gestureID, the initial frame and the last frame. The gesture identifiers are the ones provided in the gesture table at the begining of this page.


The metrics for the Chalearn LAP 2014 Track 2: Action Recognition on RGB challenge and Track 3: Multimodal Gesture Recognition will follow the trend in Track 1, evaluating the recognition performance using the Jaccard Index (for action/interaction and gesture spotting evaluation). In this sense, for each one of the n≤11 actions categories labeled for each RGB sequence s (Track 2) or the n≤20 gesture categories labeled for each RGBD sequence s (Track 3), the Jaccard Index is defined as follows:

Where As,n is the ground truth of action/gesture n at sequence s, and Bs,n is the prediction for such action/gesture at sequence s. For both Tracks 2 and 3, As,n and Bs,nare binary vectors where 1-value entries denote frames in which the n-th action/gesture is being performed.

In the case of false positives (e.g predicting a action that is not on the ground truth), the Jaccard Index will be automatically 0 for that action prediction and that action class will count in the mean Jaccard Index computation. In other words n equals the intersection of action categories appearing in the ground truth and in the predictions.

Participants will be evaluated upon mean Jaccard Index among all actions/gestures categories for all sequences, where all action/gesture categories are independent but not mutually exclusive (in a certain frame more than one action/gesture class can be active). In addition, when computing the mean Jaccard Index all gesture categories will have the same importance. Finally, the participant with the highest mean Jaccard Index will be the winner. An example of the calculation for a single sequence and two action/gesture categories is show in Figure 2.

Figure 2: Example of mean Jaccard Index calculation for different instances of gestures/actions categories in a sequence (single red lines denote ground truth annotations and double red lines denote predictions). In the top part of the image we see the ground truth annotations for actions/gestures walk and fight at a sequence s. In the center part of the image a prediction is evaluated obtaining a Jaccard Index of 0.72. In the bottom part of the image the same procedure is performed with the action fight and the obtained Jaccard Index is 0.46. Finally, the mean Jaccard Index is computed obtaining a value of 0.59.


There are no news registered in Multimodal Gesture Recognition: Montalbano V2 (ECCV '14)