Content of review 1, reviewed on December 23, 2013
Very interesting paper - well written and thought out. It applies theory on communication and behavioural observation to structure assessments of patient-doctor communications. It investigates the reliability of applying a rating scale to examine simulated patient- doctor interactions. It generally finds the scale to work.
I think the paper is good, but needs some work in terms of developing the rationale underlying the research questions, and also consideration of the type of reliability being assessed.
The introduction sets up the paper well, and makes a good case for developing more reliable assessments of patient-doctor interactions. Some more thoughts on how exactly structured feedback might be used to improve skills might be given. The key problem with the introduction is that the five questions that are posed, which whilst relevant, do not really emerge from the literature review. Why is each of the questions being asked - i.e. what do they contribute. For example, on question "a" (relating to fixed differences) a clear logic to the question is not developed (i.e. the need to answer this question should be developed in the introduction. Or, question "e" on the order of consultations - why should this influence performance or be important? I think the introduction needs to better link to the questions being posed, and to describe why these factors are important in validating a tool. Some comment on the work that has influenced these questions (i.e. are you emulating reliability testing elsewhere) might also be useful.
We need a bit more detail on what a typical consultation scenario looks like (i.e. what was the procedure of them), and also how are differences in the assertiveness of the patient (the actor) expected to shape the interactions (and ratings) - this isn't considered in the research questions yet appears important (e.g. are some scenarios are more challenging than others?).
Some comment is need on the 3 point scale used to assess the behaviours of doctors. It seems quite narrow, and does this scale borrow from another scale elsewhere? What exactly do "adequate" and "good" look like? The lack of definition for these may underlie some of the inconsistencies found between raters. This is quite a common problem in behavioural assessment (the behavioural anchors used to assess behaviour), and there is no single solution or perfect method. However, some justification for the method used here would be good.
I must say, I got a little lost in the 'fixed difference' analysis - although I can see you have put a lot of effort into explaining it. This links back to point 1 (explaining the importance of the fixed difference - which is in effect as I understand it trying to establish individual norms for rating). You might be better served by giving an example of some sort.
I think it would very useful for you to provide a bar graph of some sort on the breakdown of responses for each item on the scale. I.e. to show ratings for each item (e.g. what proportion of ratings were good, poor). You have two raters for each scenario, so you would have to figure out how to capture this. None the less, it would be useful to see for each question, what the average score is (are some always good, others always poor?). You could include this as an appendix if it is too large to go in the article body
As I understood it, the statistical data appeared fine and the tool ratings are generally reliable (albeit, with some variation between individual raters). One does wonder whether the reliability analysis should be run at an individual question item level, although I guess the tool does provide a single score. The reliability analysis is in effect testing the reliability of raters to assess several items and generate a score. My concern would be that the raters, whilst appearing to give consistent scores, actually score items differently - yet when you add them all up, the scores appear similar (and tend towards the mean). This is quite an important critique (unless I have misunderstood your reliability assessment), as you are not really assessing reliability of observing behaviour, but overall scorings of patient-doctor consultations.
I would like the discussion to focus a bit on how the tool should be used to influence assessment and training. Some of the explanations for the findings require unpacking. E.g. on the order effects - I think more explanation as to why this occurred is required (and as I mention above, discussion on why it is expected).
Good paper - I recommend that some extra work be done on it, however fundamentally it is a valuable piece of work.
© 2013 the Reviewer (source).