The first impressions data set, comprises 10000 clips (average duration 15s) extracted from more than 3,000 different YouTube high-definition (HD) videos of people facing and speaking in English to a camera. The videos are split into training, validation and test sets with a 3:1:1 ratio. People in videos show different gender, age, nationality, and ethnicity.
Videos are labeled with personality traits variables. Amazon Mechanical Turk (AMT) was used for generating the labels. A principled procedure was adopted to guarantee the reliability of labels. The considered personality traits were those from the Five Factor Model (also known as the Big Five), which is the dominant paradigm in personality research. It models human personality along five dimensions: Extraversion, Agreeableness, Conscientiousness, Neuroticism and Openness. Thus each clip has ground truth labels for these five traits represented with a value within the range [0, 1]. More details about dataset is available here.
The first work studying the big-5 traits from audio-visual cues in thin-slices of YouTube personal videos using MTurk annotations and automatic inference of personality traits was presented in:
J.-I. Biel, O. Aran, and D. Gatica-Perez, You Are Known by How You Vlog: Personality Impressions and Nonverbal Behavior in YouTube in Proc. AAAI Int. Conf. on Weblogs and Social Media (ICWSM), Barcelona, Jul. 2011
J.-I. Biel and D. Gatica-Perez, The YouTube Lens: Crowdsourced Personality Impressions and Audiovisual Analysis of Vlogs, IEEE Trans. on Multimedia, Vol. 15, No. 1, pp. 41-55, Jan. 2013
Here, we also introduce an extension of this dataset. Specifically, we supplement the dataset with new language data (transcriptions), which complements the existing sensory data (videos) as well as a new job-interview variable (interview annotations), which complements the existing personality trait variables (trait annotations).
Transcriptions. All words in the video clips were transcribed by the professional transcription service Rev. In total, 435984 words were transcribed (183861 non-stopwords), which corresponds to 43 words per video on average (18 non-stopwords). Among these words, 14535 were unique (14386 non-stopwords).
Interview annotations. In addition to labeling the apparent personality traits, AMT workers labeled each video with a variable indicating whether the person should be invited or not to a job interview (the "job-interview variable"). This variable is also represented with a value within the range [0, 1].
Groundtruth file format
The annotations and transcriptions are stored in pickled dictionaries. There should be one file for annotations and one file for transcriptions per phase.
Each video has one transcription (if there was nothing to transcribe in a video, its corresponding transcription will be an empty string). Each transcription is a unicode object. The transcription file is a single dictionary. That is, its keys are the names of the videos, and their values are the corresponding transcriptions. For example:
transcription['a_video_name'] would give the transcriptions of the video called 'a_video_name'.
Each video has also six annotations (five traits and one interview). Each annotation is a value between zero and one. The annotation file is a dictionary of dictionaries. That is, the keys of the outer dictionary are the names of the annotations and their values are dictionaries. The keys of the inner dictionaries are the names of the videos and their values are the actual annotations corresponding to the keys of the outer dictionaries. For example:
annotation['interview']['a_video_name'] would give the value for the interview annotation of the video called 'a_video_name'.
annotation['openness']['another_video_name'] would give the value for the openness annotation of the video called 'another_video_name'.
A sample prediction file for the test phase (quantitative) can be found here.
A sample prediction file for the second phase (qualitative) can be found here.
Encryption key for validation groundtruth and test set (without groundtruth) is "zeAzLQN7DnSIexQukc9W".
Encryption key for files test80_01.zip to test80_25.zip is ".chalearnLAPFirstImpressionsSECONDRoundICPRWorkshop2016.".
Gender and ethnicity annotations
We are making avaiable gender and ethnicity annotations for the first impressions data set. The labels were made avaiable by Heysem Kaya and Albert Ali Salah.
Please cite the following papers for making reference to such annotations:
- Escalante, H. J.; Kaya, H.; Salah, A. A.; Escalera, S.; Gucluturk, Y.; Guclu, U.; Baró, X.; Guyon, I.; Jacques Junior, J. C. S.; Madadi, M.; Ayache, S.; Viegas, E.; Gurpinar, F.; Wicaksana, A.S.; Liem, C.C.S.; van Gerven, M. A. J.; van Lier, R. "Modeling, Recognizing, and Explaining Apparent Personality from Videos," IEEE Transactions on Affective Computing (TAC), 2020.
The labels are as follows:
Gender: Male=1, Female=2
Group ID: Age Range
The original raw pairwise annotations of the FI dataset have been realeased. You can find them here.
Predicted attributes (soft-labels)