The focus of this track is to estimate future 2D facial landmarks, hand, and upper body pose of a target individual in a dyadic interaction for 2 seconds (50 frames), given an observed time window of at least 4 seconds of both interlocutors, from two individual views. Participants are expected to exploit context-aware information which may affect the way individuals behave. The labels used for this track will be automatically generated, i.e., they will be treated as soft labels, obtained using state-of-the-art methods for 3D facial landmarks, hand, and upper body pose estimation. We assume the training data may contain some noisy labels due to some small failures of the recognition methods. However, we will manually clean and fix wrong automatically obtained annotations in the validation and test sets in order to provide a fair evaluation. Challenge participants can also use context information and utterance-level transcriptions for this track.