Performance is going to be assessed for both diversity and relevance. We compute Cluster Recall at X (CR@X) - a measure that assesses how many different clusters from the ground truth are represented among the top X results (only relevant images are considered), Precision at X (P@X) - measures the number of relevant photos among the top X results and F1-measure at X (F1@X) - the harmonic mean of the previous two. Various cut off points are to be considered, e.g., X=5, 10, 20, 30, 40, 50.
Official ranking metrics will be the CR@20 images. This metric simulates the content of a single page of a typical web image search engine and reflects user behavior, i.e., inspecting the first page of results in priority.
Metrics are to be computed individually on each test data set, i.e., seenIR data and unseenIR data. Final ranking will be based on overall mean values for CR@20, followed by P@20 and then F1@20.