Content of review 1, reviewed on July 14, 2018

Significance

The paper entitled “Combining citizen science and deep learning to amplify expertise in neuroimaging” by Keshavan A., Yeatman J. and Rokem A. (https://www.biorxiv.org/content/early/2018/07/06/363382) presents braindr, a web application to crowdsource image quality ratings of 2D MRI scans. The manuscript describes how, using adequate and creative solutions, costly problems that require intensive and extensive human intervention may be translated into small and independent “micro-tasks” performed by “citizen scientists” without a degradation on accuracy and reliability. In particular, the problem of quality assessment of MR images is demonstrated. The authors collected over 80,000 ratings from 261 citizen scientists, on 2D image slices extracted from 722 3D-MRI, which is astonishing. The citizen scientists are weighted by selecting those that best predicted the ratings of four experts who rated the images in a more “traditional” protocol (MindControl), using the feature ranking properties of XGBoost. Finally, the authors set up a transfer-learning framework that uses the pre-trained VGG-16 as base model. They also compare to MRIQC (for which, full disclosure, I am lead developer). They demonstrate how their modified CNN (based on the VGG-16) makes accurate predictions of quality. In a simple volume analysis, the authors replicate previous findings only after removing subpar images, illustrating the relevance of the problem and demonstrating how they successfully addressed it.

The paper reads very well, data and analysis support all the findings and I have no doubt of the enormous impact of this work. This work implies a full coverage, from quality annotation collection with braindr to objective data exclusion in a paradigmatic application of structural brain MRI. Nevertheless, all the elements in that workflow have been profusely, transparently and accurately reported. I feel impressed by the workload this paper reflects.

The quality of the paper is outstanding, and in my opinion it could be accepted as is by most of the journals the authors may pick. The paper has an online version with an impressive extension of Figure 4 that makes it interactive. With this little room for improvements, I will try to bring several minor (even minimal) aspects to the attention of the authors (see comments below).

Comments to author

  • Page 2: “Pathological brains, or those of children or the elderly may violate the assumptions of these algorithms, and their outputs often still require manual expert editing.” - The authors may want to consider the BRATS challenge (doi:10.1109/TMI.2014.2377694) as a reference here. In the latest editions, deep learning approaches are leading the table.

  • Page 2: “There are several automated methods to quantify image quality, based on MRI physics and the statistical properties of images, and these methods have been collected under one umbrella in an algorithm called MRIQC (Esteban et al., 2017)” - I’m not sure this is fair to other efforts in QC of MRI. I presume this is referring to the fact that most of the quality metrics in MRIQC come from the PCP QAP. Otherwise, if the intent here is to show how quality metrics may be used in prediction, then the authors may want to modulate the message and say that MRIQC is just an example that has attracted some attention lately.

  • Page 3: “Each researcher has their own standards as to which images pass or fail on inspection, and this variability may have problematic effects on downstream analyses, especially statistical inference.” - based on our MRIQC paper, even the intra-rater variability is pretty high and it is heavily influenced by the labeling protocol. I think that a very nice feature of braindr is that it minimizes this particular source of intra-rater variability without a cost on bias (changing the way they screen images may blind experts to specific types of artifacts, although the intra-rater may be substantially reduced). I would suggest showing this aspect explicitly, since it is currently missing.

  • Page 7: “we estimate an effect of approximately -4.3 cm3 per year - a decrease in gray matter volume over the ages measured (See Figure 2 in the the original manuscript; we estimate the high point to be 710 cm3 and the low point to be 580 cm3 with a range of ages of approximately 5 years to 35 years and hence: (710-580)/(5-35) = -4.3 cm3 /year).” - this sentence is too complicated. I would suggest moving how the effect size was calculated to the appropriate section under Methods, and leave here just the figure of the effect size, and a cross-reference to the point the calculation is described: “we estimate an effect of approximately -4.3 cm3/yr”.

  • Figure 4: please use cm3 to maintain consistency with the text and improved readability.

  • Page 9: “We have demonstrated that while an individual citizen scientists may not”, I think it should be “an individual citizen scientist”?.

  • Page 10: “interpretability to speed” -> “interpretability-to-speed”

  • I would appreciate further details on how model selection for the XGBoost configurations was done, whether nested cross-validation was used, etc. Similarly, further details on the CNN training and performance figures (particularly what is the inference time for a new 2D slice?, would it allow ideas like streamlining the net into the scanner workflow?) However, I think this could be provided as supplemental information without extending the paper. These topics are introduced with enough depth within the manuscript.

  • Code and Data Availability: even though most of the research materials and software are publicly available, I would really appreciate to find here some other resources and how to access them:

    • How to get the ratings from MindControl, of the 4 top raters used in this study and in general of all raters.
    • The IQMs extracted with MRIQC for the HBN dataset, were those contributed to the MRIQC Web-API?
    • The “frozen” model of the CNN, to permit researchers run the network trained for this paper.
  • References:

    • Some have URLs with links repeating the DOI link, it would be great that the DOIs were links already (and remove the URL field). I have the impression the document is generated with latex, importing the package doi would do 50% of this problem.
    • Some references have missing fields (e.g. Yeatman 2012, Marx 2013, Gorgolewski 2017ab, Glasser 2016, Chollet 2015, and probably some that look like conference papers).
    • Make consistent use of initials or full names

Source

    © 2018 the Reviewer (CC BY 4.0).

References

    Anisha, K., D., Y. J., Ariel, R. 2019. Combining Citizen Science and Deep Learning to Amplify Expertise in Neuroimaging. Frontiers in Neuroinformatics.