Challenge data and
Final evaluation data
(Updated Jan 2013)
Translated, scaled, and occluded data
Data collection software + demo kit
(June 2012: new examples)
Color rendering of Kinect depth images
|DOWNLOAD THE DATASET CGD2011 of 50,000 gestures from one of our data mirrors.
To cite the data please use:
"ChaLearn Gesture Dataset (CGD2011), ChaLearn, California, 2011"
Copyright (c) ChaLearn - 2011
We are portraying a single user in front of a fixed camera, interacting with a computer by performing gestures to
- play a game,
- remotely control appliances or robots, or
- learn to perform gestures from an educational software.
KinectTM has revolutionized the field of gesture recognition because it is an affordable device (a high end webcam) providing both RGB and depth images. Depth images facilitate image segmentation considerably. We have collected a large dataset of 50,000 gestures with KinectTM. We provide MatlabTM code to browse though the data and process it to create a sample submission. The data can also be viewed with most video viewers, see the README file for details.
We provide both the RGB image and the depth image as in the example below. View more examples.
Example M_31 from valid04
Example K_31 from valid04Gray scale rendering of depth images
The data are organized in batches
final01 (final evaluation data for round 1)
final21 (final evaluation data for round 2, not published yet)
Each batch includes 100 recorded gestures grouped in sequences of 1 to 5 gestures performed by the same user. The gestures are drawn from a small vocabulary of 8 to 15 unique gestures, which we call a "lexicon" (see a few examples of the lexicons we used).
We selected lexicons from nine categories corresponding to various settings or application domains; they include (1) body language gestures (like scratching your head, crossing your arms), (2) gesticulations performed to accompany speech, (3) illustrators (like Italian gestures), (4) emblems (like Indian Mudras), (5) signs (from sign languages for the deaf), (6) signals (like referee signals, diving signals, or mashalling signals to guide machinery or vehicle), (7) actions (like drinking or writing), (8) pantomimes (gestures made to mimic actions), and (9) dance postures.
During the challenge, we do not disclose the identity of the lexicons and of the users. They will be revealed (after user anonymization) at the end of the challenge. Although the gesture classes are different from batch to batch, we represent the class label within each batch by a number between 1 and 15.
Goal of the challenge: one-shot-learning
For the develXX batches, we provide all the labels. For the validXX and finalXX batches, we provide labels only for one examle of each class. The goal is to predict the gesture class labels for the remaining gesture sequences.
During the development period, performance feed-back will be provided on-line on the validXX batches. The final evaluation will be carried out on the finalXX batches and the final results will be revealed only when the challenge is over.
What is easy about the data:
- Fixed camera
- Availability of depth data
- Single user within a batch
- Homogeneous recording conditions within a batch
- Small vocabulary within a batch
- Gestures separated by returning to a resting position
- Gestures performed mostly by arms and hands
- Camera framing mostly the upper body (some exceptions)
What is hard about the data:
- Only one labeled example of each unique gestures
- Variations in recording conditions (various backgrounds, clothing, skin colors, lighting, temperature, resolution)
- Some parts of the body may be occluded
- Some users are less skilled than others
- Some users made errors or omissions in performing the gestures
We provide some data annotations, including temporal segmentation into isolated gestures and body part annotations (head, shoulders, elbows, and hands).