Content of review 1, reviewed on July 21, 2015
The HEFCE (Higher Education Funding Council for England) report on "The Metric Tide" is a UK-centric review of the role of metrics in research assessment. In other words, it looks at the ways in which quantitative measures (citations, downloads, tweets, etc) can be used to evaluate the quality and significance of research.
One of the main takeaways seem to be that metrics have a place in informing the evaluation of a research output (i.e., a journal article) but that direct peer review should have the final word. This suits me fine; I have advocated for some time that peer review is the best tool we have for evaluating the quality of research (see Publon #1).
A large part of the report focuses on the Research Excellence Framework (REF), which is the UK government’s system for assessing the quality of research in UK higher education institutions. The REF is very important to institutions as public funding is allocated based on its results. In this sense it is similar to the ERA in Australia, the PBRF in New Zealand, and schemes in other commonwealth and european countries.
My understanding of the process (as described here) is that universities are asked to aggregate and submit outputs produced by each of their researchers. These outputs are assigned to one of 36 different subject areas (Units of Assessment). For each UOA, a panel of experts is asked to assess the quality of each output. Outputs are evaluated by placing them into one of five quality buckets:
- 1* (recognised nationally),
- 2* (recognised internationally),
- 3* (internationally excellent), or
- 4* (world-leading).
Review of the correlation between REF scores and metrics
Many reactions of full The Metric Tide report are now available [Altmetric, LSE]. However, the report also includes two supplementary reports. The second of these is a statistical analysis of how well scores from the 2014 REF correlate with a range of different metrics and is the subject of this review.
Be warned that I am not an expert on the details of the REF and certainly not a statistician.
Overall, the final conclusion -- that there is no single metric, or bucket of metrics, that fully predicts REF scores -- seems solid if obvious.
Interestingly the SJR (which aims to measure the scientific strength of journals) has the strongest correlation of any metric. As much as we may hate it, journal of publication does a better job of predicting REF outputs than anything else. (Of course the the REF evaluators could be biased by the journal of publication, affecting REF scores).
The study aims to measure correlations between 15 different article-level metrics and their associated REF scores. There were about 191k different outputs returned in the 2014 REF. The authors found metric data for 150k of these.
To allow for the statistical analysis, the REF scores (unclassified, 1*, 2*, 3*, and 4*) were coded as numeric values from 0 to 4. The authors then measure how well each metric predicts the REF score. Predicting REF scores is basically a classification problem so I wonder what what results a discrete machine learning approach would have returned.
All in all, it seems strange to me to try to classify the REF scores. The headline stats for the 2014 REF were that:
- 4* => 30% of all outputs
- 3* => 46% of all outputs
- 2* => 20% of all outputs
- 1* => 3% of all outputs.
In other words, 76% of all outputs were either 3* or 4*. That seems like a very uninteresting result to try to predict. I couldn’t find what the REF coverage is, but it would perhaps be more interesting to include all research outputs from UK institutions and try to use metrics to predict which would be submitted to the REF.
For many metrics, the unclassified REF outputs have better scores than most other outputs.
The authors note (footnote of page 7) that an output can be unclassified for a number of reasons that have nothing to do with the quality of research but decide to leave them in the study anyway because “the total number of outputs that received and unclassified score overall is otherwise quite small.” I don't understand this decision. If the unclassified category is small enough to not matter then there’s no point in including it in the study.
If you dig into the analysis of each individual metric you will quite often find that the the unclassified outputs perform better than 1*, 2*, and 3* outputs. For example, in the summary of the citation count metric you find that the unclassified results have a mean citation count of 40.8, while the mean for 3* results is 26.8 (1* and 2* are lower still). In fact the unclassified category performs better than 2* for the 10 continuous metrics in the report (excluding author and country count), and better than 3* for 9 of them.
The authors claim (on page 10) that “This demonstrates that an output with a high metrics score could relate to a poor quality or ineligible output, which could be attributed to the fact that citations can be negatively or positively worded.”
In fact it does nothing of the sort. It is true that the metrics in this study don’t distinguish between negative or positive citations (something that I expect will change as we develop new metrics) but relying on unclassified outputs to demonstrate this is flawed for three reasons:
- First, we don't know why an output received an unclassified status and the authors have said as much. It could be that some of the unclassified outputs are low quality but I doubt it's a substantial fraction -- why would institutions include poor quality outputs in their REF submissions?
- Second, could it simply be that the REF methodology is not perfect in the way it labels unclassified outputs? Perhaps some of these unclassified outputs are decent research but the REF doesn't recognise them for one reason or another. That would be my guess, based on the metrics.
- Finally, the authors have already pointed out that the unclassified outputs are a small fraction of all outputs -- so why are they relying on them to demonstrate that the metrics they study are flawed?
A much better example can be taken from Publon #3937 which has the highest Altmetric score for any reviewed paper on Publons (2,820) but where the reviewer has determined that the article is not very good at all. A cursory examination of the article and review suggest that the reviewer's evaluation is much closer to being right and does actually demonstrate that an output with a high metrics score can relate to a poor quality output.
Difference between Twitter and Mendeley correlations
I charted the Spearman correlations for each metric for each year:
One particular oddity that stuck out for me was that the Mendeley correlation is surprisingly stable over the 5 years while Twitter rapidly increases (without yet reaching the level of Mendeley data).
The authors note this in saying that “The exception [to significant decreases in correlation for more recent outputs] was number of tweets, where weak correlation increased for recent publications, but this is likely to be related to the relatively recent increase in use of Twitter by the academic community.”
I think it’s worth noting that Mendeley also grew its user base rapidly over the 5 years of the study but shows relatively stable correlations over that time. This, to me, indicates a difference in behavior -- Mendeley bookmarks will include the all the research you’re reading, while Tweets tend to only include what is new. Score one for Mendeley.
My understanding is that REF scores are not made public so it is impossible to independently reproduce the results of the study or extend them. (I had initially hoped to compare REF scores with some of our internal data.)
The authors state that they produced an anonymised dataset but this has not been made available so far as I can tell. Releasing these data would make it possible to address the potential methodological issues I discuss above. One would expect a funder to set a better example on reproducibility!
The report uses data sourced from Elsevier (e.g., citation data from SCOPUS, download data from Mendeley) and publicly available Google Scholar citation data. There is no mention of why the authors chose to use Elsevier’s SCOPUS over Thomson Reuters’ Web of Science but this is alluded to in the separately published literature review [10.13140/RG.2.1.5066.3520] where the authors point out that “WoS can almost be considered a perfect subset of Scopus.”
Personally, I would have liked to see the inclusion of overall Altmetric.com score. It is largely based on the constituent data of the study (tweets, downloads, etc) but would have made for a user-friendly correlation. It may have been that Altmetric declined to participate in the study but it is shame that they are not included given that Elsevier and Google both seem to be heavily involved.
There appears to be an over-reliance on Elsevier data. This stood out particularly in the usage of ScienceDirect download counts. Less 1/3 of outputs had associated ScienceDirect download data. This is not surprising as ScienceDirect contains only 2.5k (largely Elsevier-published) journals. Regardless of how you count journals, this amounts to less than 20% coverage. I understand that download data are hard to come and I think the authors are probably right to persevere with them but they should have talked to the thought process behind the decision to include such a limited dataset and any issues that might arise from an metric that targets mostly Elsevier journals.
Our current metrics don't cover enough research
Many metrics lack coverage in another crucial way. The authors note that Mendeley bookmarks, WIPO patent citations, and tweets returned a zero score for more than 85% of all outputs. Given that 30% of all outputs are 4*, the absolute best any of these metrics could do would be to predict half of all 4* outputs. That’s a built in sensitivity limit of 50%.
This is likely why journal-based metrics perform better than others. There is a need for metrics that more completely span all research outputs. The authors are well placed to address this we would have benefited from their analysis in this regard.
Do we believe the metrics or not?
At the end of the day this report is interpreted to show that metrics are not good enough to reproduce REF scores. I mention in my discussion of the treatment of unclassified data that the first half of this study is built on the assumption that REF scores are perfect.
However, in the second half of the study, the authors note that there “was evidence to indicate higher REF scores for male authors and non-ECRs after holding metric scores constant for a small number of UOAs, potentially indicating issues for women or ECRs in these disciplines.”
This is certainly interesting -- if you believe that there could be any flaws in the REF. So which is it? Do the REF evaluators make mistakes or not? Does the journal of publication impact the resulting REF classification? As it stands, using metrics to support arguments in one area (inequality) while blaming them in others (attributing high metric scores for unclassified REF outputs to poor metrics), seems a little like cherry-picking.
The authors are perfectly placed to opine on when and how metrics could be used to help the evaluators to make better (and more objective) evaluations. I would have liked to see more nuanced discussion on this point.
© 2015 the Reviewer (CC BY 4.0).