Challenge description

ChaLearn LAP Challenge @FG2020 "Identity-preserving Human Detection (IPHD)"


Human detection in images/video is a challenging computer vision problem with applications in human-computer interaction [1,2,3,4], patient monitoring [5,6,7], surveillance [8,9,10] and autonomous driving [11,12,13], just to mention a few. In some applications, however, keeping people's privacy is a big concern for both users and companies/institutions involved. Most notably, unintended identity revelation of subjects is perhaps the greatest peril. While video data from RGB cameras are massively available to train powerful detection models, the nature of these data may also allow unpermitted third parties to access such data to try to identify observed subjects. We argue that moving away from visual sensors that capture identity information in the first place is the safest bet. However, the lack of these more privacy-safe data affects the ability to train big deep-learning models, thus affecting negatively the popularity of these sensors.

In this competition, we offer a freshly-recorded multimodal image dataset consisting of over 100K spatiotemporally aligned depth-thermal frames of different people recorded in public and private spaces: street, university (cloister, hallways, and rooms), a research center, libraries, and private houses. In particular, we used RealSense D435 for depth and FLIR Lepton v3 for thermal. Given the noisy nature of such commercial depth camera and the thermal image resolution, the subjects are hardly identifiable. The dataset contains a mix of close-range in-the-wild pedestrian scenes and indoor ones with people performing in scripted scenarios, thus covering a larger space of poses, clothing, illumination, background clutter, and occlusions. The scripted scenarios include basic actions such as: sit on the sofa, lay on the floor, interacting with kitchen appliances, cooking, eating, working on the computer, talking on the phone, and so on. The camera position is not necessarily static, but sometimes held by a person. The data were originally collected as videos from different duration (from seconds to hours) and skipping frames where no movement was observed. The ordering of frames is removed to make it an image dataset (the only information provided is the video ID).


Contest description

For the competition, we ask the participants to perform human detection in the visual modalities provided. There are three (3) tracks associated to this contest:

  1. Depth track. Given the provided depth frames (and bounding box groundtruth annotations), the participants will be asked to develop their depth-based human detection method. Depth cameras are cost-effective devices that provide geometric information of the scene at a resolution and frame acquisition speed that is comparable to RGB cameras. The downside is their noisiness at large real distances. The method developed by the participants will need to output a list of bounding boxes (along with their associated confidences scores) per frame containing each person in it. The performance on depth image-based human detection will be evaluated.

Figure. IPHD dataset sample. First & third columns are depth frames. Second & fourth are their corresponding (aligned-to-depth) thermal frames.

  1. Thermal track. Given the provided thermal frames (and bounding box groundtruth annotations), the participants will be asked to develop their thermal-based human detection method. Thermal cameras provide temperature readings from the scene. They are less noisy than depth cameras, but at a comparable price they offer a much lower image resolution. The method developed by the participants will need to output a list of bounding boxes (along with their associated confidences scores) per frame containing each person in it. The performance on depth image-based human detection will be evaluated.

  2. Depth-Thermal Fusion track. Given the provided aligned depth-thermal frames (and bounding box groundtruth annotations), the participants will be asked to develop their multimodal (depth and thermal) human detection method. Both modalities have been temporally and spatially aligned and, hence, so they will try to exploit their potential complementarity with a proper fusion strategy. The participants will need to output a list of bounding boxes (along with their associated confidences scores) per frame containing each person in it. The performance on depth image-based human detection will be evaluated.

The metric for evaluation will be average precision at IoU=0.50, i.e. AP@0.50, computed from precision-recall curves. For more details on the metrics refer to the IPHD dataset. The scripts for calculation of this metric will be provided to the participants in a “starting kit”.


How to enter the competitions and how to evaluate them

The competition will be run on the CodaLab platform. The participants will register through the platform, where they will be able to access data, evaluation scripts, submit their predicitions on the validation data to enter the leaderboard, etc.

These are the links to the CodaLab tracks:

  1. ChaLearn LAP IPHD Track 1: Depth
  2. ChaLearn LAP IPHD Track 2: Thermal
  3. ChaLearn LAP IPHD Track 3: Depth-Thermal Fusion

And the dataset:


Data will be made available to participants in different stages as follows

  1. Development (training) and Validation data (without groundtruth) will be made available at the beginning of the competition. Participants will be able to submit predictions to the CodaLab platform and receive immediate feedback on their performance in the validation data set (there will be a validation leaderboard in the platform).

  2. Validation groundtruth will be released.

  3. Final evaluation (test) data will be made available to participants one week before the end of the challenge. Participants will have to submit their predictions in these data to be considered for the final evaluation (no groundtruth will be released at this point).


Verification of results

Once the competition has ended, the organization team will need to verify the results. For this, we will ask the participants to submit a link to a compressed for each track they participated in to containing the following:

  • Code
    • Separate learning & evaluation program/scripts and dependencies.
    • Detailed instructions* on how to run learning & evaluation.
  • Material (required to run the code)
    • Trained models
      • Final evaluation model that is input to the evaluation program/script.
      • Pre-trained models used in the learning program/script.
    • Data (only if it has been modified) or, preferably, a modification program/script that can process the original data.
    • Groundtruth (only if has been modified).
  • Fact sheet

(*) Apart from the instructions, we strongly encourage the participants to share their codes in a docker image including required libraries and material. The original data and groundtruth could be mapped as a volume within the docker container. Hopefully, this will reduce the amount of back-and-forth communication/disruption.

We also kindly ask to the participants to be available during the verification period (from Feb 8th to Feb 11th) for any assistance that might be required via the communication e-mail specified in the Fact sheet.


Dissemination of results

The participants will be invited to submit a paper to its associated FG 2020 workshop via CMT. who obtain the best results in the challengeSubmitted papers will need to follow the IEEE 2020 FG guideliness.

Accepted workshop papers will appear at IEEE FG 2020 proceedings.



[1] Sun, Li, et al. "3DOF pedestrian trajectory prediction learned from long-term autonomous mobile robot deployment data." 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018.

[2] Che, Yuhang, et al. "Facilitating Human-Mobile Robot Communication via Haptic Feedback and Gesture Teleoperation." ACM Transactions on Human-Robot Interaction (THRI) 7.3 (2018): 20.

[3] Kim, Kwan Suk, Travis Llado, and Luis Sentis. "Full-body collision detection and reaction with omnidirectional mobile platforms: a step towards safe human–robot interaction." Autonomous Robots 40.2 (2016): 325-341.

[4] Dondrup, Christian, et al. "Real-time multisensor people tracking for human-robot spatial interaction." (2015).

[5] Chen, Guang, et al. "Multi-cue event information fusion for pedestrian detection with neuromorphic vision sensors." Frontiers in neurorobotics 13 (2019): 10.

[6] Jalal, Ahmad, Shaharyar Kamal, and Daijin Kim. "A Depth Video-based Human Detection and Activity Recognition using Multi-features and Embedded Hidden Markov Models for Health Care Monitoring Systems." International Journal of Interactive Multimedia & Artificial Intelligence 4.4 (2017).

[7] Alper, Mehmet Akif, Daniel Morris, and Luan Tran. "Remote detection and measurement of limb tremors." 2018 5th International Conference on Electrical and Electronic Engineering (ICEEE). IEEE, 2018.

[8] Brown, Matt, et al. "Multi-Modal Detection Fusion on a Mobile UGV for Wide-Area, Long-Range Surveillance." 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019.

[9] Aguilar, Wilbert G., et al. "Cascade classifiers and saliency maps based people detection." International Conference on Augmented Reality, Virtual Reality and Computer Graphics. Springer, Cham, 2017.

[10] Hattori, Hironori, et al. "Synthesizing a scene-specific pedestrian detector and pose estimator for static video surveillance." International Journal of Computer Vision 126.9 (2018): 1027-1044.

[11] Li, Jianan, et al. "Scale-aware fast R-CNN for pedestrian detection." IEEE Transactions on Multimedia 20.4 (2017): 985-996.

[12] Mao, Jiayuan, et al. "What can help pedestrian detection?." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

[13] Du, Xianzhi, et al. "Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection." 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, 2017.


There are no news registered in