Abstract

Supervised machine learning (ML), in which models are automatically derived from labeled training data, is only as good as the quality of that training data. This study builds on prior work that investigated to what extent 'best practices' around labeling training data, particularly labeling done by humans, were followed in applied ML publications within a single domain (social media platforms). In this paper, we expand by studying publications that apply supervised ML in a far broader spectrum of fields across the biological and medical sciences, physical and environmental sciences, and social sciences and humanities. We report to what extent a random sample of ML application papers give specific details about whether best practices were followed, while acknowledging that a greater range of application fields necessarily produces greater diversity of annotation methods. Because much of machine learning research and education only focuses on what is done once a "gold standard" of training data is available, it is especially relevant to discuss issues around the equally-important aspect of whether such data is reliable in the first place. This determination becomes increasingly complex when applied to a variety of specialized fields, as annotation can range from a task requiring little-to-no background knowledge to one that must be performed by someone with career expertise.


Authors

R. Stuart Geiger;  Dominique Cope;  Jamie Ip;  Marsha Lotosh;  Aayush Shah;  Jenny Weng;  Rebekah Tang

Publons users who've claimed - I am an author
Contributors on Publons
  • 1 author
  • 1 reviewer
  • pre-publication peer review (FINAL ROUND)
    Decision Letter
    2021/06/12

    12-Jun-2021

    Dear Dr. Geiger:

    Thank you for the careful revision of your manuscript entitled ""Garbage In, Garbage Out" Revisited: What Do Machine Learning Application Papers Report About Human-Labeled Training Data?". The comments made by the reviewers have been addressed in a satisfactory way. I am therefore happy to let you know that your manuscript has been accepted for publication in Quantitative Science Studies.

    I would like to request you to prepare the final version of your manuscript using the checklist available at https://tinyurl.com/qsschecklist. Please also sign the publication agreement, which can be downloaded from https://tinyurl.com/qssagreement. The final version of your manuscript, along with the completed checklist and the signed publication agreement, can be returned to qss@issi-society.org.

    Thank you for your contribution. On behalf of the Editors of Quantitative Science Studies, I look forward to your continued contributions to the journal.

    Best wishes,
    Dr. Ludo Waltman
    Editor, Quantitative Science Studies
    qss@issi-society.org

    Decision letter by
    Cite this decision letter
    Author Response
    2021/06/08

    Thank you for your helpful comments. We have prepared a revision that addresses these concerns and suggestions.

    Public data and analysis code: Per R1’s request, have released our datasets for all labels by all lablers and for the final labels with calculated scores. We have also released our analysis code for IRR calculation, data cleaning, statistics, and visualizations. These are on GitHub and Zenodo in Jupyter notebooks and can be run interactively from the mybinder.org link in the GitHub repo or paper.

    Inter-rater reliability: Based on R1 and R2’s comments, we have revised and expanded our discussion of our IRR metrics. We removed the mean % mean agreement metric, because we believe our other two custom metrics provide sufficient description. We have added a better explanation of how these metrics work, with a sample table (Table 2). We have also added Krippendorff’s alpha for comparison, even though the skewed distributions of our responses make this a problematic metric to use and lead to lower rates than if distributions were not skewed. We have also expanded our discussion around the role of IRR metrics, which only capture the level of agreement before our discussion-based reconciliation process. Finally, we have discussed why certain questions likely had lower rates than others.

    Presentation of results: R1 raised various questions that relate to how we presented our results, such as why we analyze only 103 out of 141 papers for some questions, or the role of "unsure" answers. In response, we have expanded all results tables to show a total of applicable & non-applicable papers, with a note about the applicability criteria. We also present a new results bar chart (Figure 1) that makes our staged process more intuitive. Finally, we have revised table labels to be more descriptive.

    Unclear question names: R1 raised issues with how we named some questions, like “used original human annotation.” We have renamed some questions to make them more descriptive of what our instructions actually were. “Used original human annotation” is now “original or external human-labeled data” and is presented in a far more intuitive manner. Also, we have now consistently used “labeling” instead of “annotation” throughout the paper. We renamed “labels from human annotation” to be “labels from human judgement.”

    Expanded analysis of scores by various categories: Based on R1’s requests, we have added more analyses and visualizations of scores by area/domain, specifically boxplots that show the different distributions. We also add analyses of info scores by conference paper vs. journal article, which is a statistically significant difference. Finally, because info scores were only calculated for the 45 papers that used original human labeling, we also produced several analyses using the label source reporting rate, which was applicable for 141 papers.

    Expanded discussion: Per R2’s suggestion, we have expanded more on institutional solutions to issues of methodological standards. We discuss our findings in the context of various domain-independent and domain-specific methodological reporting guides. We specifically reference recent meta-research on the impact of journals adopting PRISMA guidelines for meta-reviews. We have also removed the uncited assertion we made that R2 identified and rephrased this as a suggestion.

    Typos: we have fixed the minor issues and errors identified.



    Cite this author response
  • pre-publication peer review (ROUND 1)
    Decision Letter
    2021/03/29

    29-Mar-2021

    Dear Dr. Geiger:

    Your manuscript QSS-2021-0012 entitled ""Garbage In, Garbage Out" Revisited: What Do Machine Learning Application Papers Report About Their Training Data?", which you submitted to Quantitative Science Studies, has been reviewed. The comments of the reviewers are included at the bottom of this letter.

    Both reviewers are positive about your work and have only minor suggestions for improvements. Based on the comments of the reviewers as well as my own reading of your manuscript, my editorial decision is to invite you to prepare a minor revision of your manuscript.

    To revise your manuscript, log into https://mc.manuscriptcentral.com/qss and enter your Author Center, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision.

    You may also click the below link to start the revision process (or continue the process if you have already started your revision) for your manuscript. If you use the below link you will not be required to login to ScholarOne Manuscripts.

    PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm.

    https://mc.manuscriptcentral.com/qss?URL_MASK=0bf88698410e48c383ba03717f0f4792

    You will be unable to make your revisions on the originally submitted version of the manuscript. Instead, revise your manuscript using a word processing program and save it on your computer. Please also highlight the changes to your manuscript within the document by using the track changes mode in MS Word or by using bold or colored text.

    Once the revised manuscript is prepared, you can upload it and submit it through your Author Center.

    When submitting your revised manuscript, you will be able to respond to the comments made by the reviewers in the space provided. You can use this space to document any changes you make to the original manuscript. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response to the reviewers.

    IMPORTANT: Your original files are available to you when you upload your revised manuscript. Please delete any redundant files before completing the submission.

    If possible, please try to submit your revised manuscript by 28-May-2021. Let me know if you need more time to revise your work.

    Once again, thank you for submitting your manuscript to Quantitative Science Studies and I look forward to receiving your revision.

    Best wishes,
    Dr. Ludo Waltman
    Editor, Quantitative Science Studies
    qss@issi-society.org

    Reviewers' Comments to Author:

    Reviewer: 1

    Comments to the Author
    In this paper, the authors provide an interesting study in which they analyse the process of generating gold standards on a sample of Machine Learning papers. In addition, they also investigate how this process is described. This is of real importance as it has implications for their reproducibility.

    I believe this paper will shed even more light on this problem. We do not put enough effort.
    The big question: is 141 papers enough representative of the phenomena? However, at this stage considering the level of effort put into this experiment, I would consider it a substantial contribution.

    I found this paper comprehensive and very interesting to read. At this stage, I do not have strong issues to highlight. However, I do have some minor ones.

    In table 2, the mean total agreement, mean % mean agreement and mean % correct, although described later, are still hard to understand. Specifically, I am struggling to understand the mean % mean agreement. I would encourage the authors to clarify their definition. Not sure if the actual formula could help.

    In "Human annotation for training versus evaluation" section, you analyse only 103 out of 141. In the text, one can read these papers come ‘yes’ or ‘implicit’ which add to 102. I am assuming you added also the ‘unsure’. Not clear this.

    In the section "Used original human annotation", I don’t understand what table 7 and 8 really show. Because the caption of table 7 says “Table 7. Used original human annotation” but in this table, I can see external contribution. Vice versa applies to table 8.

    In section “link to dataset available”, I would argue that sometimes depending on the field authors might not be fully aware of the importance of sharing data, it might not be yet embedded in their culture. In addition, I would also argue they do not have the expertise to share data properly, which includes the format, the annotations, the documentation, and finally where to store the data.

    One final and general question. Is it possible for you to show how those analytics distribute over the different fields showed in table 4? This would allow the readers/community to assess whether the problem is shared in equal proportion across the whole science or there are research fields that are more active in describing their dataset. Such information could indeed drive more targeted policies.

    I really enjoyed reading and reviewing your paper. Thanks a lot for your work.

    P.S. It would be nice if your dataset was open for scrutiny, and perhaps help that 11.11% in table 19.

    Reviewer: 2

    Comments to the Author
    The authors present a study examining the practices of labeling machine learning (ML) training datasets for 200 studies from the social sciences & humanities, life & health sciences and physical & ecosciences, expanding on an earlier study that found similar findings. The study itself relies on labeling tasks similar the identified studies, but from the perspective of details provided for the ML labeling in the publications that are investigated. The research posits that if quality datasets for training purpose represent a gold standard, then published studies should ensure that the data labelled and reported should also reflect this. The authors concluded that practices for documenting the details that went into human labeling of datasets in the studies investigated varied widely. They argue that the implications of the findings are not limited to ML studies but could provide insight into any studies where human-labeled training data are used and then re-used in future studies and assumed to be correct.

    The study addresses a fundamental aspect of ML research by investigating the quality of ML training data through the detailed reporting of the human-annotated training data. I don’t have many suggestions for the authors. They have largely practiced what they promote in terms of methodological transparency in their own labeling method, so my comments relate more to the reporting of the findings and conclusions drawn.

    The average for the mean total agreement reported in Table 2 (57.65%) was not great and it was particularly low for selected questions (labels from human annotations, original human annotation source, synthesis of annotator overlap, reported inter-annotator agreement). Can the authors provide some explanation why these were much lower than for some of the other questions?

    In the discussion, the authors write “We call on the institutions of science—publications, funders, disciplinary societies, and educators—to play a major role in working out solutions to these issues. Research publications are limited by length restrictions, which can leave little space for such details.” Do the authors have some recommendations for handling this issue? For example, should more detailed methodology coverage be required within submitted manuscripts, or could some of these details be considered supplemental material to be handled in a similar way as open research data?

    Also, the authors indicate “Peer reviewers and editors play a major role in deciding what details are considered extraneous, with methodological details often removed to give more space to findings and discussions.” Can you cite a source to back this up? This is a broad statement to make without cited evidence.

    Minor things:
    Instead of “axises” use “axes”
    Instead of “corpuses” use “corpora”
    p. 6 line 19, change “an” to “a”
    p. 6 line 31, revisit the first sentence of the paragraph. It’s awkwardly stated and repeats “each week” three times.

    Decision letter by
    Cite this decision letter
    Reviewer report
    2021/03/28

    In this paper, the authors provide an interesting study in which they analyse the process of generating gold standards on a sample of Machine Learning papers. In addition, they also investigate how this process is described. This is of real importance as it has implications for their reproducibility.

    I believe this paper will shed even more light on this problem. We do not put enough effort.
    The big question: is 141 papers enough representative of the phenomena? However, at this stage considering the level of effort put into this experiment, I would consider it a substantial contribution.

    I found this paper comprehensive and very interesting to read. At this stage, I do not have strong issues to highlight. However, I do have some minor ones.

    In table 2, the mean total agreement, mean % mean agreement and mean % correct, although described later, are still hard to understand. Specifically, I am struggling to understand the mean % mean agreement. I would encourage the authors to clarify their definition. Not sure if the actual formula could help.

    In "Human annotation for training versus evaluation" section, you analyse only 103 out of 141. In the text, one can read these papers come ‘yes’ or ‘implicit’ which add to 102. I am assuming you added also the ‘unsure’. Not clear this.

    In the section "Used original human annotation", I don’t understand what table 7 and 8 really show. Because the caption of table 7 says “Table 7. Used original human annotation” but in this table, I can see external contribution. Vice versa applies to table 8.

    In section “link to dataset available”, I would argue that sometimes depending on the field authors might not be fully aware of the importance of sharing data, it might not be yet embedded in their culture. In addition, I would also argue they do not have the expertise to share data properly, which includes the format, the annotations, the documentation, and finally where to store the data.

    One final and general question. Is it possible for you to show how those analytics distribute over the different fields showed in table 4? This would allow the readers/community to assess whether the problem is shared in equal proportion across the whole science or there are research fields that are more active in describing their dataset. Such information could indeed drive more targeted policies.

    I really enjoyed reading and reviewing your paper. Thanks a lot for your work.

    P.S. It would be nice if your dataset was open for scrutiny, and perhaps help that 11.11% in table 19.

    Cite this review
    Reviewer report
    2021/03/28

    The authors present a study examining the practices of labeling machine learning (ML) training datasets for 200 studies from the social sciences & humanities, life & health sciences and physical & ecosciences, expanding on an earlier study that found similar findings. The study itself relies on labeling tasks similar the identified studies, but from the perspective of details provided for the ML labeling in the publications that are investigated. The research posits that if quality datasets for training purpose represent a gold standard, then published studies should ensure that the data labelled and reported should also reflect this. The authors concluded that practices for documenting the details that went into human labeling of datasets in the studies investigated varied widely. They argue that the implications of the findings are not limited to ML studies but could provide insight into any studies where human-labeled training data are used and then re-used in future studies and assumed to be correct.

    The study addresses a fundamental aspect of ML research by investigating the quality of ML training data through the detailed reporting of the human-annotated training data. I don’t have many suggestions for the authors. They have largely practiced what they promote in terms of methodological transparency in their own labeling method, so my comments relate more to the reporting of the findings and conclusions drawn.

    The average for the mean total agreement reported in Table 2 (57.65%) was not great and it was particularly low for selected questions (labels from human annotations, original human annotation source, synthesis of annotator overlap, reported inter-annotator agreement). Can the authors provide some explanation why these were much lower than for some of the other questions?

    In the discussion, the authors write “We call on the institutions of science—publications, funders, disciplinary societies, and educators—to play a major role in working out solutions to these issues. Research publications are limited by length restrictions, which can leave little space for such details.” Do the authors have some recommendations for handling this issue? For example, should more detailed methodology coverage be required within submitted manuscripts, or could some of these details be considered supplemental material to be handled in a similar way as open research data?

    Also, the authors indicate “Peer reviewers and editors play a major role in deciding what details are considered extraneous, with methodological details often removed to give more space to findings and discussions.” Can you cite a source to back this up? This is a broad statement to make without cited evidence.

    Minor things:
    Instead of “axises” use “axes”
    Instead of “corpuses” use “corpora”
    p. 6 line 19, change “an” to “a”
    p. 6 line 31, revisit the first sentence of the paragraph. It’s awkwardly stated and repeats “each week” three times.

    Reviewed by
    Cite this review
All peer review content displayed here is covered by a Creative Commons CC BY 4.0 license.