ChaLearn Challenges on Image and Video Inpainting @WCCI18
Inpainting, the topic of our proposed challenge, refers to replacing missing parts in an image (or video). The problem of dealing with missing data or incomplete data in machine learning arises in many applications. Recent strategies make use of generative models to impute missing or corrupted data. Advances in computer vision using deep generative models have found applications in image/video processing, such as denoising , restoration , super-resolution , or inpainting [4,5]. We focus on image and video inpainting tasks, that might benefit from novel methods such as Generative Adversarial Networks (GANs) [6,7] or Residual connections [8,9]. Solutions to the inpainting problem may be useful in a wide variety of computer vision tasks. We chose two examples: human pose estimation and video de-captioning. Regarding our first choice: it is challenging to perform human pose recognition in images containing occlusions; since tackling human pose recognition is a prerequisite for human behaviour analysis in many applications, replacing occluded parts may help the whole processing chain. Regarding our second choice: in the context of news media and video entertainment, broadcasting programs from various languages, such as news, series or documentaries, there are frequently text captions or encrusted commercials or subtitles, which reduce visual attention and occlude parts of frames that may decrease the performance of automatic understanding systems. Despite recent advances in machine learning, it is still challenging to aim at fast (real time) and accurate automatic text removal in video sequences.
We propose a challenge that aims at compiling the latest efforts and research advances from the computational intelligence community in creating fast and accurate inpainting algorithms. The methods will be evaluated on two large, newly collected and annotated datasets related to two realistic scenario in visual inpainting. The challenge will be organized in two tracks:
- Image inpaining to recover missing parts of human body (track 1),
- Video inpainting to remove overlayed text in video clips (track 2).
In both cases, the main goal is to generate the visually best possible set of pixels to obtain a complete image or video clip. Some samples are shown in Fig. 1 and Fig. 2 for both tracks.
Figure 1. Track1: inpainting missing parts of body and the results for human pose before/after reconstruction
Figure 2. Track 2: Inpainting to remove overlayed text.
The competition will be run in the CodaLab platform (http://competitions.codalab.org), an open-source platform, co-developed by the organizers of this challenge together with Stanford and Microsoft Research (https://github.com/codalab/codalab-competitions/wiki/Project_About_CodaLab). The participants will register through the platform, where they will be able to get all necessary data to start the competition.
Both tracks will be evaluated using standards visual inpainting metrics: MSE, PSNR and Structural dissimilarity Index (DSSIM)  computed on reconstructed missing parts in still images task (track 1) and on full images for the video task (track 2). These metrics estimate the distance from a generated sample to the original. We will also consider tracks specific metrics, reported in Table 1. To encourage fast methods we will impose a maximum execution time for prediction on final test data. The number of processed samples will be used to produce secondary ranking in track 2. We will produce different separate rankings:
- Primary ranking (both tracks): MSE, PSNR, and DSSIM will be computed for both tracks. All participants will have a separate rank for each metric and the mean of ranks for all metrics will define the final ranking.
- Secondary ranking (track 1): the performance of a current state of the art method for deep-based pose estimation over the reconstructed images in terms of distances between predicted joint locations and real joint locations. Concrete method will not be explained to participants so that they are not able to tune reconstruction of missing data based on posterior human pose estimation performance.
- Secondary ranking (track 2): the speed of video clips generation will be measure to define a secondary ranking. After computing standard inpainting measures from participants' predictions, we will run the proposed algorithms on a limit amount of time. The number of processed video clips will be used to generate this ranking.
Secondary rankings define tasks specific rankings that will be averaged with reconstruction metric ranks to define the final winners of the two tracks.
The participants will develop solutions for predicting missing parts in images and videos. In track 1, we provide a set of images with multiple blocks of black pixels occluding the original image, and a mask indicating the position and size of these blocks. The goal is to restore the masked parts of the image in a way that resembles the original content and looks plausible to a human. In track 2, we provide a set of 5 seconds video clips that contain overlayed text (with various font size, color, position, background), and the corresponding target video clips without text overlays. The goal is to remove text overlays (and inpaint their content) as fast as possible. Table 1 synthetically describes specifications of the two proposed tracks.
Table 1. Data specifications in both tracks.
In figure 1 we show a visual example of track 1 task and dataset. Applying a state of the art human pose estimation method on a naively reconstructed image obtains a more accurate human pose than the estimated pose in the image with missing parts. We expect the participants to recover missing parts of still images containing humans so that posterior human pose estimation approximates the most to the pose estimated on the original image. Figure 2 illustrates few samples showing the diversity of video track dataset, style and resolution, and the various fonts transformation and subtitles overlays.
Starting kit and baseline methods
To simplify participating in the competition we provide a docker that contains the datasets and codes capable to train and evaluate all the provided baselines or methods prepared by the organizers.
We implemented recent state of the art methods for image inpainting tasks. In particular, considered baselines are adaptations of [7,10,11] that all involve deep convolutional architectures. As one can see in figures 3 and 4, the task of reconstructing missing parts performs quite well on small parts but visual quality must be improved for inferring bigger regions. Recent GAN methods improve the "authenticity" of generated parts but computational cost is high for training. Image generation can also be costly which makes scaling unpractical in realistic scenarios.
Figure 3: Qualitative and quantitative samples of chosen baselines for track 1 (resized). DSSIM to the original has been added below the reconstructions.
Figure 4 shows one frame from three generated video clips on track 2. The first baseline is obtained by training a large convolutional autoencoder where missing data is recovered using the context from complete frames. The second baseline processes frames by patches of 32x32 pixels, allowing to reduce the number of parameters. A patch autoencoder is jointly learned with a patch classifier that detect the presence of text. At inference, a frame is first split into non-overlapped patches, the classifier informs about the presence of text and if so the patch is passed through the decoder. Patches classified as non-text are simply copied to the output generated frame. This approach is faster to train and efficient to remove text with no background overlays. However, since it ignore the context, it cannot performs well on patches with missing values only (ie: opaque overlays). Those two baselines do not consider temporal structure, reported metrics are obtained by averaging MSE over the 125 full frames of video clips. Processing time on a GPU Nvidia K80 equipped machine for generating one single video clip is between 4 and 8 seconds according to baseline methods and size of text overlays.
Figure 4: Track 2 qualitative and quantitative results on two baselines based on adaptation of [7,10,11].
The simplest baseline will be provided with the starting kit and the more advanced baselines will be published on the leaderboard.
The participants may compete in either or both tracks. Each track will have two phases: feed-back phase and final evaluation phase.
Data will be made available to participants in different stages as follows:
- Development (training) data with ground truth for all of the images and video clips are part of the Docker, distributed as starting kit.
- Validation data without missing parts will be also provided to participants at the beginning of the competition. Participants will be able to submit predictions to the CodaLab platform, and receive immediate feedback on their performance on the validation data set (there will be a validation leaderboard in the platform), during the feed-back phase.
- Final evaluation (test) data will be made available to participants one week before the end of the competition (see schedule), for the final evaluation phase.
Dataset sizes are given as the total of collected/pre-processed images and videos as indicated in Table 1. We split the datasets in each track in 60%-20%-20% for train/validation/test phases. For the feed-back phase, the participants will submit prediction results through the CodaLab platform. For the test phase, we will ask for both for prediction results and code (that will be run on the provided Docker).
 V. Jain and S. Seung, “Natural image denoising with convolutional networks,” in Advances in Neural Information Processing Systems, 2009, pp. 769–776.
 L. Xu, J. S. Ren, C. Liu, and J. Jia, “Deep convolutional neural network for image deconvolution,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 1790–1798.
 C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2016.
 J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 341–349.
 A. Newson, A. Almansa, M. Fradet, Y. Gousseau, and P. P´erez, “Video inpainting of complex scenes,” SIAM Journal on Imaging Sciences, vol. 7, no. 4, pp. 1993–2019, 2014.
 I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 D. Pathak, P. Kr¨ahenb¨uhl, J. Donahue, T. Darrell, and A. Efros, “Context encoders: Feature learning by inpainting,” in Computer Vision and Pattern Recognition (CVPR), 2016.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 X.-J. Mao, C. Shen, and Y.-B. Yang, “Image Restoration Using Convolutional Auto-encoders with Symmetric Skip Connections,” ArXiv e-prints, Jun. 2016.
 C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High-resolution image inpainting using multi-scale neural patch synthesis,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 R. A. Yeh, C. Chen, T. Y. Lim, S. A. G., M. Hasegawa-Johnson, and M. N. Do, “Semantic image inpainting with deep generative models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.