More than 235 actions are drawn from a vocabulary of 11 human actions/interactions. The emphasis of this second track is on automatic learning of a set of 11 actions performed by different users, with the aim of performing user independent continuous gesture spotting.
For each sequence we provide with a RGB video and a csv file with start and end points of each action instance:
The data is organized as a set of sequence, each one unically identified by an string SeqXX, where XX is a 2 integer digit number. Each sequence is provided as a single ZIP file named with its identifier (eg. SeqXX.zip).
Each sample ZIP file contains the following files:
- SeqXX_color.mp4: Video with the RGB data.
- SeqXX_data.csv: CSV file with the number of frames of the video.
- SeqXX_labels: CSV file with the ground truth for the sample (only for labelled data sets). Each line corresponds to an action instance. Information provided is the actionID, the start frame and the end frame of the action instance. The actions identifiers are the ones provided in the gesture table at the begining of this page.
The metrics for the Chalearn LAP 2014 Track 2: Action Recognition on RGB challenge and Track 3: Multimodal Gesture Recognition will follow the trend in Track 1, evaluating the recognition performance using the Jaccard Index (for action/interaction and gesture spotting evaluation). In this sense, for each one of the n≤11 actions categories labeled for each RGB sequence s (Track 2) or the n≤20 gesture categories labeled for each RGBD sequence s (Track 3), the Jaccard Index is defined as follows:
Where As,n is the ground truth of action/gesture n at sequence s, and Bs,n is the prediction for such action/gesture at sequence s. For both Tracks 2 and 3, As,n and Bs,nare binary vectors where 1-value entries denote frames in which the n-th action/gesture is being performed.
In the case of false positives (e.g predicting a action that is not on the ground truth), the Jaccard Index will be automatically 0 for that action prediction and that action class will count in the mean Jaccard Index computation. In other words n equals the intersection of action categories appearing in the ground truth and in the predictions.
Participants will be evaluated upon mean Jaccard Index among all actions/gestures categories for all sequences, where all action/gesture categories are independent but not mutually exclusive (in a certain frame more than one action/gesture class can be active). In addition, when computing the mean Jaccard Index all gesture categories will have the same importance. Finally, the participant with the highest mean Jaccard Index will be the winner. An example of the calculation for a single sequence and two action/gesture categories is show in Figure 2.
Figure 2: Example of mean Jaccard Index calculation for different instances of gestures/actions categories in a sequence (single red lines denote ground truth annotations and double red lines denote predictions). In the top part of the image we see the ground truth annotations for actions/gestures walk and fight at a sequence s. In the center part of the image a prediction is evaluated obtaining a Jaccard Index of 0.72. In the bottom part of the image the same procedure is performed with the action fight and the obtained Jaccard Index is 0.46. Finally, the mean Jaccard Index is computed obtaining a value of 0.59.