Abstract

Controlling for confounding factors is one of the central aspects of quantitative research. Although methods such as linear regression models are common, their results can be misleading under certain conditions. We demonstrate how statistical matching can be utilized as an alternative that enables the inspection of post-matching balancing. This contribution serves as an empirical demonstration of matching in bibliometrics and discusses the advantages and potential pitfalls. We propose matching as an easy-to-use approach in bibliometrics to estimate effects and remove bias. To exemplify matching, we use data about papers published in Physical Review E and a selection classified as milestone papers. We analyze whether milestone papers score higher in terms of a proposed class of indicators for measuring disruptiveness than nonmilestone papers. We consider disruption indicators DI1, DI5, DI1n, DI5n, and DEP and test which of the disruption indicators performs best, based on the assumption that milestone papers should have higher disruption indicator values than nonmilestone papers. Four matching algorithms (propensity score matching (PSM), coarsened exact matching (CEM), entropy balancing (EB), and inverse probability weighting (IPTW)) are compared. We find that CEM and EB perform best regarding covariate balancing and DI5 and DEP performing well to evaluate disruptiveness of published papers.


Authors

Bittmann, Felix;  Tekles, Alexander;  Bornmann, Lutz

Publons users who've claimed - I am an author

No Publons users have claimed this paper.

  • pre-publication peer review (FINAL ROUND)
    Decision Letter
    2021/09/05

    05-Sep-2021

    Dear Mr. Bittmann:

    It is a pleasure to accept your manuscript entitled "Applied Usage and Performance of Statistical Matching in Bibliometrics: The Comparison of Milestone and Regular Papers With Multiple Measurements of Disruptiveness as an Empirical Example" for publication in Quantitative Science Studies. Some final comments by reviewer 1 can be found at the bottom of this message. Reviewer 2 was not available to review the second revised version of your manuscript. Based on the comments of reviewer 1 and my own reading of your revised manuscript, I consider your work to be suitable for publication in Quantitative Science Studies.

    I would like to request you to prepare the final version of your manuscript using the checklist available at https://tinyurl.com/qsschecklist. Please fix the missing figure numbers on p. 25 and 27. Also, in the supplementary material, change Table/Figure A1, A2, etc. into Table/Figure S1, S2, etc. Please also sign the publication agreement, which can be downloaded from https://tinyurl.com/qssagreement. The final version of your manuscript, along with the completed checklist and the signed publication agreement, can be returned to qss@issi-society.org.

    Thank you for your contribution. On behalf of the Editors of Quantitative Science Studies, I look forward to your continued contributions to the journal.

    Best wishes,
    Dr. Ludo Waltman
    Editor, Quantitative Science Studies
    qss@issi-society.org

    Reviewers' Comments to Author:

    Reviewer: 1

    Comments to the Author
    I appreciate the renewed effort in revising the paper and clarifying open questions. The authors have adressed all my concerns. After reading the manuscript again, I think it is ready for publication. The use case lacks a clear treatment and is maybe not the best to introduce the method, but I think the paper is still valuable for the community in its current form.

    Decision letter by
    Cite this decision letter
    Reviewer report
    2021/07/23

    I appreciate the renewed effort in revising the paper and clarifying open questions. The authors have adressed all my concerns. After reading the manuscript again, I think it is ready for publication. The use case lacks a clear treatment and is maybe not the best to introduce the method, but I think the paper is still valuable for the community in its current form.

    Reviewed by
    Cite this review
    Author Response
    2021/06/21

    Please see our response letter attached.



    Cite this author response
  • pre-publication peer review (ROUND 2)
    Decision Letter
    2021/04/25

    25-Apr-2021

    Dear Mr. Bittmann:

    Your manuscript QSS-2020-0075.R1 entitled "Applied Usage and Performance of Statistical Matching in Bibliometrics: The Comparison of Milestone and Regular Papers With Multiple Measurements of Disruptiveness as an Empirical Example", which you submitted to Quantitative Science Studies, has been reviewed. The comments of the reviewers are included at the bottom of this letter.

    Based on the comments of the reviewers as well as my own reading of your manuscript, my editorial decision is to invite you to prepare a major revision of your manuscript. I need to emphasize that revising your work does not guarantee that your work will eventually be accepted for publication in the journal.

    In your revision, it is essential to carefully address the issues identified by the two reviewers regarding the causal interpretation of your statistical analysis.

    Regarding the issue of the methodological details of the matching procedures (raised by reviewer 2), my request is to provide more detailed information in a separate document that can be made available as supplementary material.

    Regarding the issue of Stata vs. R (also raised by reviewer 2), I would like to ask you to make available to the reader both results obtained using Stata and results obtained using R. For instance, the Stata results could be presented in your manuscript, while the R results could be reported in the supplementary material.

    To revise your manuscript, log into https://mc.manuscriptcentral.com/qss and enter your Author Center, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision.

    You may also click the below link to start the revision process (or continue the process if you have already started your revision) for your manuscript. If you use the below link you will not be required to login to ScholarOne Manuscripts.

    PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm.

    https://mc.manuscriptcentral.com/qss?URL_MASK=abd6d76fad804bd3951fd13f40887355

    You will be unable to make your revisions on the originally submitted version of the manuscript. Instead, revise your manuscript using a word processing program and save it on your computer. Please also highlight the changes to your manuscript within the document by using the track changes mode in MS Word or by using bold or colored text.

    Once the revised manuscript is prepared, you can upload it and submit it through your Author Center.

    When submitting your revised manuscript, you will be able to respond to the comments made by the reviewers in the space provided. You can use this space to document any changes you make to the original manuscript. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response to the reviewers.

    IMPORTANT: Your original files are available to you when you upload your revised manuscript. Please delete any redundant files before completing the submission.

    If possible, please try to submit your revised manuscript by 23-Aug-2021. Let me know if you need more time to revise your work.

    Once again, thank you for submitting your manuscript to Quantitative Science Studies and I look forward to receiving your revision.

    Best wishes,
    Dr. Ludo Waltman
    Editor, Quantitative Science Studies
    qss@issi-society.org

    Reviewers' Comments to Author:

    Reviewer: 1

    Comments to the Author
    Review of Manuscript QSS-2020-007.R1
    „Applied Usage and Performance of Statistical Matching in Bibliometrics“

    Overall evaluation
    The already good first draft of the paper has further gained from the first round of revisions, especially the inclusion of a seperate section on matching methods. The authors took my comments seriously and addressed most of them adequately. Nonetheless, some points remain which should be fixed before publication.

    Major Comments
    - P. 19: Thanks for pointing the reader to ATE, ATT and ATC distinction. I think this is helpful, but the results section is not the right place to do so. Why not add it to 3.1 as part oft he counterfactual framework? Clarifying that differences between those three estimands and making researchers think about the meaning of the estimated effect is also one advantage of the matching approach. Related to that, I believe the following sentence does not adequately describe the ATT & ATC: “ATT quantifies the disruptiveness (or citation impact) effect of milestone papers, ATC the disruptiveness (or citation impact) effect of papers in the control group.“ ATT (ATC) gives the differences in disruptiveness if papers in the treated group were assigned in the control group.
    - Following up on this, I also began to wonder about your analytical causal claim and a potentially more far-reaching flaw of the analysis: Do you really expect that the treatment “milestone paper“ makes the paper more disruptive? This is the current setup of the analysis T “milestone“, Y “disruptiveness“. Thinking about it, I would suspect the idea is that papers are selected as milestone papers because they are more disruptive. This would mean the causal arrow runs in the opposite direction, but that also raises the question of whether you need to switch X and Y in your analysis. I read the related comment of the other reviewer and your response to that does not completely solve this problem. At minimum, you need to be much more explict in the introduction and the conclusion that this is just for illustrative purposes. However, depending on the research question it is worth considering what reviewer 2 suggested: measuring disruptiveness before and after the treatment and measuring the effect of T on disruptiveness after the treatment.
    - The paper still contains validity claims at various locations which I believe is problematic. Validity of an inference cannot be seen from covariate balance. Imbalanced groups can still suffice to derive valid estimates, while balance for covariates does not ensure balance for other potential confounders. This also makes it a bit difficult to evaluate the different matching approaches as better or worse.
    - There remains some sort of imbalance in the paper. In parts of the paper, especially the results section effects are interpreted as causal effects, but in the conclusion section it is actually acknowledged that the authors themselves do not really believe in such a causal interpretation. I share this reservations on a causal interpretation and recommend to read the results section again having this conclusion in mind.

    Further comments
    - General: While it is common in the literature to use OLS as a placeholder for linear regression, this use is actually not correct. You can estimate linear regressions with OLS, but also with ML and other techniques. At the same time, OLS can be used to estimate other types of models. Please check the complete for manuscript for “OLS“ and ordinary least squares“ and change to “linear“.
    - Abstract: „their results can be misleading when balancing fails“. It is not neccessarily lack of balancing that causes the problem, but lack of common support combined with misspecification of the model. I recommend to add „their results can be misleading under certain conditions“.
    - P.4 „unbalanced“ (two times): Might be a bit misleading since unbalanced data has a different meaning in panel analysis.
    - P. 14 „Usually, balancing the observed covariates also balances most unobserved covariates“. This claim is too strong, I would say: “Balancing for observed covariates can also help to balance for unobserved covariates which are correlated with observed covariates.“
    - P.8: As highlighted by the other reviewer, Pearl 2009 is really not the correct reference fort he counterfactual framework. Why not cite Rubin, Rubin/Rosenbaum, Rubin/Imbens….? Or if you prefer textbook versions Angrist/Pischke or Morgan Winship.
    - P. 23 „The balancing in Figure indicates“. Number is missing
    - Delete the following sentence: „The derived outcomes should be of high quality“ (p.23) This is a strong claim without a real proof.
    - P. 23: „conclusions are more correct“. Wording strange because something can be either correct or wrong (ignoring philsophical discussion on truth at this point).
    - P. 30 „are also often more suitable“. Again too strong, I believe. I would say „ „can be sometimes more suitable/adequate for“
    - P. 31: When reporting the robustness checks using linear regression, I would add (see Table XY in the appendix) and put the regression results there for means of transparency.
    - P.31 “We would like to encourage researchers to check“ This can be misleading and might be read as hidden critique of the studies cited in the previous studies. What you want to say, I guess, is: “We would like to encourage researchers to follow these and examples and check“
    - „Pure causal effects“: I have some problems with the wording „pure“. More common in statistics is „unbiased“.
    - P. 27: „models are balanced“. Not the models are balanced, but balance is achieved between treatment and control group.
    - P. 28, last bullet: matching and regression can also be combined leading to a “double robust estimator“
    - Not mandatory, but worth considering: Since IPTW is conceptually so close to PSM, why not introduce it straight after PSM (and do in the same way in the results section)?

    Reviewer: 2

    Comments to the Author
    The manuscript deals with statistical matching in bibliometrics. Statistical matching is proposed as an easy-to use tool to estimate effects and remove bias in bibliometric studies comparing different groups. Data about papers published in Physical Review E and a selection classified as milestones are used to exemplify matching. It was tested whether papers classified as milestone papers are more disruptive than papers classified elsewhere. Different matching procedures as CEM, EB and IPW were applied.

    I very much appreciate the changes in the revision that were made to improve the manuscript. Many thanks! However, I cannot recommend the manuscript for publication in QSS and recommend to submit it elsewhere (e.g., STATA journal). The methodological claim of the manuscript to set a standard for the application of matching in bibliometrics can unfortunately not be fulfilled by the manuscript for the following reasons:
    - Data: The data are very unbalanced, 21 milestone papers are compared to 21,164 (minus 21) papers, which are not milestone papers. In the end, there are, actually, only 21 propensity scores. A total of 4 covariates were used to perform matching. Overall, the data set is far too small in terms of milestone studies and number of variables to perform adequate statistical matching.
    - Causal concepts: Two different causal concept are confused, the concept of focus papers in the disparity index and the causal concept of milestone papers as “treatment”.
    - Statistical concept: The interesting thing about statistical matching is the interplay between the concept of causal inference from Rosenbaum and Rubin and the various matching procedures. Without sufficient discussion of the theoretical assumption, e.g., “strong ignorability condition”, “randomization”, “endogeneity” the analyses are worthless and it is not clear why certain methodological steps are important at all. As reviewer 1 points out, the difference between matching and regression analysis is blurred. The same interplay can be found in missing imputation, also developed by Don Rubin.
    - KMATCH: The study is strongly related to the STATA module KMATCH. Unfortunately, so far there are only reports and no relevant journal articles in which this module is presented or at least reviewed. To refer to the supposed reputation of the developer (B. Jann) beyond any seminal contributions in this field is not very convincing. The content of the manuscript may, therefore, more or less reflect the specific state of discussion within a particular scientific community.
    - R: Since the manuscript aims to show how propensity score matching can be profitably applied in bibliometrics in the future, I wonder why STATA is used instead of the free-software R. There are now several packages in R that allow propensity score matching and are state-of-the-art. A good introduction is provided by Leite (Leite, W. (2016). Practical propensity score method using R. Sage). Why should one use the EB method in the STATA module, for instance, when the EB method already exists in R (package "ebalance"), developed by Hainmueller himself? I think STATA is in this respect a little bit old-fashioned.
    - Matching procedures: Unfortunately, the study is not very methodological oriented. No statement can be made from any empirical results as to which method is adequate. Such statements can best be derived from simulation studies, which investigate the behavior of the procedures under different sample conditions. Furthermore, some important steps are missing for a complete presentation of a statistical matching, e.g. the graphical representation of the overlap of the propensity score distributions among groups and the sensitivity analysis. Olmos and Govindasamy (2015), for example, have published a nice and simple introduction to propensity score matching that meets all the requirements for carrying out this method in R.

    Reference
    Olmos, A. & Govindasamy, P. (2015). Propensity Scores:
    A Practical Introduction Using R. Journal of MultiDisciplinary
    Evaluation, 11(25), 68-88.

    Decision letter by
    Cite this decision letter
    Reviewer report
    2021/03/12

    The manuscript deals with statistical matching in bibliometrics. Statistical matching is proposed as an easy-to use tool to estimate effects and remove bias in bibliometric studies comparing different groups. Data about papers published in Physical Review E and a selection classified as milestones are used to exemplify matching. It was tested whether papers classified as milestone papers are more disruptive than papers classified elsewhere. Different matching procedures as CEM, EB and IPW were applied.

    I very much appreciate the changes in the revision that were made to improve the manuscript. Many thanks! However, I cannot recommend the manuscript for publication in QSS and recommend to submit it elsewhere (e.g., STATA journal). The methodological claim of the manuscript to set a standard for the application of matching in bibliometrics can unfortunately not be fulfilled by the manuscript for the following reasons:
    - Data: The data are very unbalanced, 21 milestone papers are compared to 21,164 (minus 21) papers, which are not milestone papers. In the end, there are, actually, only 21 propensity scores. A total of 4 covariates were used to perform matching. Overall, the data set is far too small in terms of milestone studies and number of variables to perform adequate statistical matching.
    - Causal concepts: Two different causal concept are confused, the concept of focus papers in the disparity index and the causal concept of milestone papers as “treatment”.
    - Statistical concept: The interesting thing about statistical matching is the interplay between the concept of causal inference from Rosenbaum and Rubin and the various matching procedures. Without sufficient discussion of the theoretical assumption, e.g., “strong ignorability condition”, “randomization”, “endogeneity” the analyses are worthless and it is not clear why certain methodological steps are important at all. As reviewer 1 points out, the difference between matching and regression analysis is blurred. The same interplay can be found in missing imputation, also developed by Don Rubin.
    - KMATCH: The study is strongly related to the STATA module KMATCH. Unfortunately, so far there are only reports and no relevant journal articles in which this module is presented or at least reviewed. To refer to the supposed reputation of the developer (B. Jann) beyond any seminal contributions in this field is not very convincing. The content of the manuscript may, therefore, more or less reflect the specific state of discussion within a particular scientific community.
    - R: Since the manuscript aims to show how propensity score matching can be profitably applied in bibliometrics in the future, I wonder why STATA is used instead of the free-software R. There are now several packages in R that allow propensity score matching and are state-of-the-art. A good introduction is provided by Leite (Leite, W. (2016). Practical propensity score method using R. Sage). Why should one use the EB method in the STATA module, for instance, when the EB method already exists in R (package "ebalance"), developed by Hainmueller himself? I think STATA is in this respect a little bit old-fashioned.
    - Matching procedures: Unfortunately, the study is not very methodological oriented. No statement can be made from any empirical results as to which method is adequate. Such statements can best be derived from simulation studies, which investigate the behavior of the procedures under different sample conditions. Furthermore, some important steps are missing for a complete presentation of a statistical matching, e.g. the graphical representation of the overlap of the propensity score distributions among groups and the sensitivity analysis. Olmos and Govindasamy (2015), for example, have published a nice and simple introduction to propensity score matching that meets all the requirements for carrying out this method in R.

    Reference
    Olmos, A. & Govindasamy, P. (2015). Propensity Scores:
    A Practical Introduction Using R. Journal of MultiDisciplinary
    Evaluation, 11(25), 68-88.

    Reviewed by
    Cite this review
    Reviewer report
    2021/03/08

    Review of Manuscript QSS-2020-007.R1
    „Applied Usage and Performance of Statistical Matching in Bibliometrics“

    Overall evaluation
    The already good first draft of the paper has further gained from the first round of revisions, especially the inclusion of a seperate section on matching methods. The authors took my comments seriously and addressed most of them adequately. Nonetheless, some points remain which should be fixed before publication.

    Major Comments
    - P. 19: Thanks for pointing the reader to ATE, ATT and ATC distinction. I think this is helpful, but the results section is not the right place to do so. Why not add it to 3.1 as part oft he counterfactual framework? Clarifying that differences between those three estimands and making researchers think about the meaning of the estimated effect is also one advantage of the matching approach. Related to that, I believe the following sentence does not adequately describe the ATT & ATC: “ATT quantifies the disruptiveness (or citation impact) effect of milestone papers, ATC the disruptiveness (or citation impact) effect of papers in the control group.“ ATT (ATC) gives the differences in disruptiveness if papers in the treated group were assigned in the control group.
    - Following up on this, I also began to wonder about your analytical causal claim and a potentially more far-reaching flaw of the analysis: Do you really expect that the treatment “milestone paper“ makes the paper more disruptive? This is the current setup of the analysis T “milestone“, Y “disruptiveness“. Thinking about it, I would suspect the idea is that papers are selected as milestone papers because they are more disruptive. This would mean the causal arrow runs in the opposite direction, but that also raises the question of whether you need to switch X and Y in your analysis. I read the related comment of the other reviewer and your response to that does not completely solve this problem. At minimum, you need to be much more explict in the introduction and the conclusion that this is just for illustrative purposes. However, depending on the research question it is worth considering what reviewer 2 suggested: measuring disruptiveness before and after the treatment and measuring the effect of T on disruptiveness after the treatment.
    - The paper still contains validity claims at various locations which I believe is problematic. Validity of an inference cannot be seen from covariate balance. Imbalanced groups can still suffice to derive valid estimates, while balance for covariates does not ensure balance for other potential confounders. This also makes it a bit difficult to evaluate the different matching approaches as better or worse.
    - There remains some sort of imbalance in the paper. In parts of the paper, especially the results section effects are interpreted as causal effects, but in the conclusion section it is actually acknowledged that the authors themselves do not really believe in such a causal interpretation. I share this reservations on a causal interpretation and recommend to read the results section again having this conclusion in mind.

    Further comments
    - General: While it is common in the literature to use OLS as a placeholder for linear regression, this use is actually not correct. You can estimate linear regressions with OLS, but also with ML and other techniques. At the same time, OLS can be used to estimate other types of models. Please check the complete for manuscript for “OLS“ and ordinary least squares“ and change to “linear“.
    - Abstract: „their results can be misleading when balancing fails“. It is not neccessarily lack of balancing that causes the problem, but lack of common support combined with misspecification of the model. I recommend to add „their results can be misleading under certain conditions“.
    - P.4 „unbalanced“ (two times): Might be a bit misleading since unbalanced data has a different meaning in panel analysis.
    - P. 14 „Usually, balancing the observed covariates also balances most unobserved covariates“. This claim is too strong, I would say: “Balancing for observed covariates can also help to balance for unobserved covariates which are correlated with observed covariates.“
    - P.8: As highlighted by the other reviewer, Pearl 2009 is really not the correct reference fort he counterfactual framework. Why not cite Rubin, Rubin/Rosenbaum, Rubin/Imbens….? Or if you prefer textbook versions Angrist/Pischke or Morgan Winship.
    - P. 23 „The balancing in Figure indicates“. Number is missing
    - Delete the following sentence: „The derived outcomes should be of high quality“ (p.23) This is a strong claim without a real proof.
    - P. 23: „conclusions are more correct“. Wording strange because something can be either correct or wrong (ignoring philsophical discussion on truth at this point).
    - P. 30 „are also often more suitable“. Again too strong, I believe. I would say „ „can be sometimes more suitable/adequate for“
    - P. 31: When reporting the robustness checks using linear regression, I would add (see Table XY in the appendix) and put the regression results there for means of transparency.
    - P.31 “We would like to encourage researchers to check“ This can be misleading and might be read as hidden critique of the studies cited in the previous studies. What you want to say, I guess, is: “We would like to encourage researchers to follow these and examples and check“
    - „Pure causal effects“: I have some problems with the wording „pure“. More common in statistics is „unbiased“.
    - P. 27: „models are balanced“. Not the models are balanced, but balance is achieved between treatment and control group.
    - P. 28, last bullet: matching and regression can also be combined leading to a “double robust estimator“
    - Not mandatory, but worth considering: Since IPTW is conceptually so close to PSM, why not introduce it straight after PSM (and do in the same way in the results section)?

    Reviewed by
    Cite this review
    Author Response
    2021/02/08

    We thank all reviewers and editors for their support and comments! Please find all specifc comments in the response letter, attached as a DOCX file.



    Cite this author response
  • pre-publication peer review (ROUND 1)
    Decision Letter
    2020/11/01

    01-Nov-2020

    Dear Mr. Bittmann:

    Your manuscript QSS-2020-0075 entitled "Applied Usage and Performance of Statistical Matching in Bibliometrics: The Comparison of Milestone and Regular Papers With Multiple Measurements of Disruptiveness", which you submitted to Quantitative Science Studies, has been reviewed. There are two reviewers. The comments of reviewer 1 can be found in the attached PDF file. The comments of reviewer 2 are provided at the bottom of this letter.

    Based on the comments of the reviewers as well as my own reading of your manuscript, my editorial decision is to invite you to prepare a major revision of your manuscript. In the revision, the comments of the reviewers need to be carefully taken into consideration. Two comments require special attention. Reviewer 1 provides suggestions for improving the structure of your manuscript. I support these suggestions. Please consider whether these suggestions can be implemented in your revised manuscript. Reviewer 2 is critical about your discussion of causal inference. I agree with the reviewer that your manuscript does not provide a proper discussion of causal inference. As suggested by the reviewer, it might be better to weaken your statements about inferring causality. Discussing both statistical matching and causal inference in a single manuscript might be too ambitious.

    To revise your manuscript, log into https://mc.manuscriptcentral.com/qss and enter your Author Center, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision.

    You may also click the below link to start the revision process (or continue the process if you have already started your revision) for your manuscript. If you use the below link you will not be required to login to ScholarOne Manuscripts.

    PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm.

    https://mc.manuscriptcentral.com/qss?URL_MASK=6dcd9f198a5a451383f80e0993fd2926

    You will be unable to make your revisions on the originally submitted version of the manuscript. Instead, revise your manuscript using a word processing program and save it on your computer. Please also highlight the changes to your manuscript within the document by using the track changes mode in MS Word or by using bold or colored text.

    Once the revised manuscript is prepared, you can upload it and submit it through your Author Center.

    When submitting your revised manuscript, you will be able to respond to the comments made by the reviewers in the space provided. You can use this space to document any changes you make to the original manuscript. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response to the reviewers.

    IMPORTANT: Your original files are available to you when you upload your revised manuscript. Please delete any redundant files before completing the submission.

    If possible, please try to submit your revised manuscript by 01-Mar-2021. Let me know if you need more time to revise your work.

    Once again, thank you for submitting your manuscript to Quantitative Science Studies and I look forward to receiving your revision.

    Best wishes,
    Dr. Ludo Waltman
    Editor, Quantitative Science Studies
    qss@issi-society.org

    Reviewers' Comments to Author:

    Reviewer: 1

    Comments to the Author
    Please see attached file

    Reviewer: 2

    Comments to the Author
    The manuscript is about statistical matching and its application on the question, whether the assignment of papers as “milestone studies” differ from regular papers in measurements of disruptiveness. With different kinds of statistical matching procedures, e.g., propensity score matching, exact matching, causal effects of “milestone studies” were tried to reveal. The manuscript claims to “serve as research guide to matching and discuss advantages and potential pitfalls” and to propose matching as an “easy-to-use and powerful approach”.

    The manuscript is quite interesting and relevant for the bibliometric research. Different matching procedures were empirically compared However, I cannot recommend it for publication in QSS in this version. A major revision or new submission is necessary, which should address the following concerns:

    • Treatment: As far as I understand the manuscript correctly, the treatment is awarding a paper as a milestone paper in 2015 or not. But the focal papers have been published a long time before the awarding (1993-2004) and the disruption might have, actually, occurred before the treatment could take effect. I wonder whether the disruption index must be separately calculated with citations till 2015 and after 2015 for both milestone papers and regular papers, in order to show, whether awarding a paper as milestone paper has an effect on disruption. With respect to the causality framework it is not clear, whether awarding a paper as a milestone papers is a cause of disruption or an impact of disruption. As the authors remarked, the selection of milestone papers might depend on the citations they obtain. The revision should give some good arguments to this problem.

    • Data (p.4): Actually, the data analysis based on 21 Milestone papers and 21,164 regular papers. This is a rather low sample size for milestone papers, especially for estimating the very low propensity scores (rare events), i.e., the probability of treatment assignment (p<.001). For example, in his simulation for the Entropy Balancing method Hainmueller (2012) used a minimal sample size of 300 with a 1:5 treated to control ratio. Furthermore, the set of covariates is quite small to match papers. I recommend mentioning this at least as a limitation of the study.

    • Disruption index (p.4-5): The disruption index is a very interesting indicator. I wonder whether a certain normalization of the disruption index itself is desirable both with respect to the number of cited references and to the citations (citation window) to make valid comparisons beyond any matching procedures. A higher number of references of a focal paper gives more chance for citations (N1_j, N_k) than a paper with a lower number of cited references. With this kind of standardization a covariate could be eliminated.

    • Citation: On page 7 it was mentioned “that the citation distribution of milestone papers and the non-milestone papers… are very different, it is not possible to include the citation impact itself in the matching procedure”. I do not really understand this argument, because the other covariates might also differ in their distributions among groups. I suspect that total citation and the diversion index are strongly correlated, because the citations take part in the definition of the dispersion index. If one control for citations, the differences among the groups in the dispersion index vanish (Figure 2). I hope the revision may clarify this issue. The revision should give some further argument why citations were not considered.

    • Assumptions: The manuscript gives an easy to understand introduction to statistical matching, but in my view the core of causal inference in statistics is not accurate explained. Pearl, for example, was mentioned in the beginning of section 3.2., but Pearl has suggested a quite different approach to causality, the concept of Directed Acyclic Graphs (DAGS) motivated by models in computer sciences. Essential assumptions are not introduced (e.g., strong ignorability, average treatment effect, randomization) and certain tests are not mentioned (e.g., propensity score balance check, amount of variance reduction by matching). For example, the central idea of propensity score is mentioned later in the manuscript (row 16, p.18) for cases with the same propensity “the cases are similar with regard to all control variables” or balanced. Overall, the claim of the manuscript to serve “as a research guide to matching and discuss advantages and potential pitfalls” seems not to be redeemed. I recommend to waive the claim to provide a “search guide to matching”, “to estimate causal effects” and in return not to go much into the statistical details of causal inference and to focus more on statistical matching as a method to make groups more comparable in the sense of “applied usage”.

    • Propensity score matching procedures: There are numerous kinds of propensity score matching procedures, the question is, which fit best to the data. The groundbreaking paper with respect to this study was published by Rosenbaum and Rubin (1985) under the heading of “matched sampling”, the idea of constructing multiple control samples with similar distributions of the propensities of the same sample size as the treatment group (ATT). I wonder why this method was not applied. It is rather simple and do not need sophisticated statistical procedures. A sample of 21,164 papers is very large and too heterogeneous for comparison purposes, but serves as huge source for matching comparable to CEM. Another idea would be to standardize the disruption index with the disruption index to categorize the other covariate each in three categories. The a simple matching with one covariate with a combination of the three covariates can be applied (Cochran procedure).

    • Criticism of propensity scores: The manuscript claims to “serve a research guide to matching and discuss advantages and potential pitfalls”. I wonder why the paper of King and Nielsen (2019) was not discussed, which was published with the title, “why propensity score should not be used for matching”.
      Minor:

    • OLS. In the abstract “OLS” was mentioned, “OLS” is only an estimation procedure (ordinary least squares) as ML, not a statistical procedure. I suppose “OLS regression” is mentioned. In abstracts, abbreviations should be avoided.

    • ATT: It seems that in the case of nearest-neighbour-matching not the average treatment effect (ATE), but the average treatment effect of the treated (ATT) will be estimated.

    • Jann (2017, 2019). These manuscripts seem not to be published yet in a methodological statistical oriented journal. Give the huge literature in statistics and medical statistics to this topic, there is no need to refer to manuscripts, which seem not to be reviewed yet. The fact that ordinary t-tests or standard errors are biased after matching is not an insight of Jann (2019), but it is common knowledge in propensity score matching (dependency of measurements).

    References
    Rosenbaum, P. and Rubin, D. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistican, 39 (1), 33-38.
    King, G. and Nielsen , R. (2019). “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis, 27, 4. Copy at https://j.mp/2ovYGsW

    Decision letter by
    Cite this decision letter
    Reviewer report
    2020/10/23

    The manuscript is about statistical matching and its application on the question, whether the assignment of papers as “milestone studies” differ from regular papers in measurements of disruptiveness. With different kinds of statistical matching procedures, e.g., propensity score matching, exact matching, causal effects of “milestone studies” were tried to reveal. The manuscript claims to “serve as research guide to matching and discuss advantages and potential pitfalls” and to propose matching as an “easy-to-use and powerful approach”.

    The manuscript is quite interesting and relevant for the bibliometric research. Different matching procedures were empirically compared However, I cannot recommend it for publication in QSS in this version. A major revision or new submission is necessary, which should address the following concerns:

    • Treatment: As far as I understand the manuscript correctly, the treatment is awarding a paper as a milestone paper in 2015 or not. But the focal papers have been published a long time before the awarding (1993-2004) and the disruption might have, actually, occurred before the treatment could take effect. I wonder whether the disruption index must be separately calculated with citations till 2015 and after 2015 for both milestone papers and regular papers, in order to show, whether awarding a paper as milestone paper has an effect on disruption. With respect to the causality framework it is not clear, whether awarding a paper as a milestone papers is a cause of disruption or an impact of disruption. As the authors remarked, the selection of milestone papers might depend on the citations they obtain. The revision should give some good arguments to this problem.

    • Data (p.4): Actually, the data analysis based on 21 Milestone papers and 21,164 regular papers. This is a rather low sample size for milestone papers, especially for estimating the very low propensity scores (rare events), i.e., the probability of treatment assignment (p<.001). For example, in his simulation for the Entropy Balancing method Hainmueller (2012) used a minimal sample size of 300 with a 1:5 treated to control ratio. Furthermore, the set of covariates is quite small to match papers. I recommend mentioning this at least as a limitation of the study.

    • Disruption index (p.4-5): The disruption index is a very interesting indicator. I wonder whether a certain normalization of the disruption index itself is desirable both with respect to the number of cited references and to the citations (citation window) to make valid comparisons beyond any matching procedures. A higher number of references of a focal paper gives more chance for citations (N1_j, N_k) than a paper with a lower number of cited references. With this kind of standardization a covariate could be eliminated.

    • Citation: On page 7 it was mentioned “that the citation distribution of milestone papers and the non-milestone papers… are very different, it is not possible to include the citation impact itself in the matching procedure”. I do not really understand this argument, because the other covariates might also differ in their distributions among groups. I suspect that total citation and the diversion index are strongly correlated, because the citations take part in the definition of the dispersion index. If one control for citations, the differences among the groups in the dispersion index vanish (Figure 2). I hope the revision may clarify this issue. The revision should give some further argument why citations were not considered.

    • Assumptions: The manuscript gives an easy to understand introduction to statistical matching, but in my view the core of causal inference in statistics is not accurate explained. Pearl, for example, was mentioned in the beginning of section 3.2., but Pearl has suggested a quite different approach to causality, the concept of Directed Acyclic Graphs (DAGS) motivated by models in computer sciences. Essential assumptions are not introduced (e.g., strong ignorability, average treatment effect, randomization) and certain tests are not mentioned (e.g., propensity score balance check, amount of variance reduction by matching). For example, the central idea of propensity score is mentioned later in the manuscript (row 16, p.18) for cases with the same propensity “the cases are similar with regard to all control variables” or balanced. Overall, the claim of the manuscript to serve “as a research guide to matching and discuss advantages and potential pitfalls” seems not to be redeemed. I recommend to waive the claim to provide a “search guide to matching”, “to estimate causal effects” and in return not to go much into the statistical details of causal inference and to focus more on statistical matching as a method to make groups more comparable in the sense of “applied usage”.

    • Propensity score matching procedures: There are numerous kinds of propensity score matching procedures, the question is, which fit best to the data. The groundbreaking paper with respect to this study was published by Rosenbaum and Rubin (1985) under the heading of “matched sampling”, the idea of constructing multiple control samples with similar distributions of the propensities of the same sample size as the treatment group (ATT). I wonder why this method was not applied. It is rather simple and do not need sophisticated statistical procedures. A sample of 21,164 papers is very large and too heterogeneous for comparison purposes, but serves as huge source for matching comparable to CEM. Another idea would be to standardize the disruption index with the disruption index to categorize the other covariate each in three categories. The a simple matching with one covariate with a combination of the three covariates can be applied (Cochran procedure).

    • Criticism of propensity scores: The manuscript claims to “serve a research guide to matching and discuss advantages and potential pitfalls”. I wonder why the paper of King and Nielsen (2019) was not discussed, which was published with the title, “why propensity score should not be used for matching”.
      Minor:

    • OLS. In the abstract “OLS” was mentioned, “OLS” is only an estimation procedure (ordinary least squares) as ML, not a statistical procedure. I suppose “OLS regression” is mentioned. In abstracts, abbreviations should be avoided.

    • ATT: It seems that in the case of nearest-neighbour-matching not the average treatment effect (ATE), but the average treatment effect of the treated (ATT) will be estimated.

    • Jann (2017, 2019). These manuscripts seem not to be published yet in a methodological statistical oriented journal. Give the huge literature in statistics and medical statistics to this topic, there is no need to refer to manuscripts, which seem not to be reviewed yet. The fact that ordinary t-tests or standard errors are biased after matching is not an insight of Jann (2019), but it is common knowledge in propensity score matching (dependency of measurements).

    References
    Rosenbaum, P. and Rubin, D. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistican, 39 (1), 33-38.
    King, G. and Nielsen , R. (2019). “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis, 27, 4. Copy at https://j.mp/2ovYGsW

    Reviewed by
    Cite this review
    Reviewer report
    2020/10/07

    Please see attached file

    Reviewed by
    Cite this review
All peer review content displayed here is covered by a Creative Commons CC BY 4.0 license.