Content of review 1, reviewed on May 31, 2018

Summary

This article is a very well written and informative paper. It reports on a study of the effects of users' internal relevance models and order effect on human relevance assessments. The authors also investigate the concept of "need for cognition" as a possible influence on assessments. I have never come across this concept in this context before, so this study seems to be of high currency and also high relevance in the fields of relevance behaviour and information searching behaviour.

Strengths

The authors investigate order effects in relevance assessments, which highly contributes to the research field, as there has not been much research on this topic. The background section covers related research thoroughly, with key studies referenced as well as references recent enough.

Weaknesses

The study used a robust methodology that is well described. However, there are some concerns, which I address in the comments below. Overall, my concerns form a critique reasoned in criticism of TREC style assessments in general.

Major points

  1. There has been much research on the concept and meaning of relevance. Despite acknowledging the fact that relevance has been seen as having classes or manifestations, as cited by Saracevic, the authors focus on topical relevance. They argue, that "this is the type of relevance that is modelled by the relevance assessments that accompany most test collections" (p.624). Although the authors admit topical relevance to be an intellectual assessment, and although they are "also concerned with cognitive relevance", these types of relevance do not consider relevance judged from a subjective point of view. This point is also reflected by the use of a 4-point scale (see comment below), which allows not for graded relevance assessments.
  2. The application of a 4-point ordinal scale (0-3) may be a step towards a user-oriented and subjective view. However, the scale introduced by Sormunen consists of both a qualitative and quantitative description for each of the levels in accordance to the TREC context (which is reasonable since the tasks participants performed were taken from TREC topics). Nonetheless, this seems to neglect a subjective relevance judgment somewhat. The authors should add some information on why they chose this instrument and why they used topics, tasks and documents from TREC collections.

Minor points

  1. In the results section, the groups for high and low need for cognition are split based on a median. This split seems to be a rather artificial boundary, which makes me wonder: Is there no evidence in other research on the concept, which would suggest what is a high or low value more precisely? If not, maybe looking at the distribution of all need for cognition values and clustering them into three groups of low, middle, and high values, and then removing the middle group, leaving both ends of the distribution for comparison, would be more appropriate.

Source

    © 2018 the Reviewer.

References

    Falk, S., Diane, K., Wan-Ching, W., S., L. H., William, W. 2013. The Effect of Threshold Priming and Need for Cognition on Relevance Calibration and Assessment. International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR.