For this competition track, we ask the participants to perform human detection by exploiting the combination of both depth and thermal modalities. Depth cameras are cost-effective devices that provide geometric information of the scene at a resolution and frame acquisition speed that is comparable to RGB cameras. The downside is their noisiness at large real distances. Thermal cameras provide temperature readings from the scene. They are less noisy than depth cameras, but at a comparable price they offer a much lower image resolution.
Given the provided spatiotemporally aligned depth and thermal (and bounding box groundtruth annotations), the participants will be asked to exploit the combination of the two modalities in an automatic human detection method. The method will need to output a list of bounding boxes (along with associated confidence scores) per frame containing each person in it. The performance of image-based human detection methods will be evaluated in terms of average precision.