Abstract

The use of co-occurrence data is common in various domains. Co-occurrence data often needs to be normalised to correct for the size-effect. To this end, Van Eck & Waltman (2009) recommend a probabilistic measure known as the association strength. However, this formula, based on combinations with repetition, implicitly assumes that observations from the same entity can co-occur even though in the intended usage of the measure these self-co-occurrences are non-existent. A more accurate measure inspired on combinations without repetition is introduced here and compared to the original formula in mathematical derivations, simulations, and patent data, which shows that the original formula overestimates the relation between a pair and that some pairs are more overestimated than others. The new measure is available in the EconGeo package for R maintained by Balland (2016).


Authors

Mathieu P.A. Steijn

Publons users who've claimed - I am an author

No Publons users have claimed this paper.

Contributors on Publons
  • 1 reviewer
  • pre-publication peer review (FINAL ROUND)
    Decision Letter
    2021/01/23

    23-Jan-2021

    Dear Mr. Steijn:

    It is a pleasure to accept your manuscript entitled "Improvement on the association strength: implementing a probabilistic measure inpsired on combinations without repetition." for publication in Quantitative Science Studies. The comments of the reviewers who reviewed your manuscript are included at the foot of this letter.

    I would like to request you to prepare the final version of your manuscript using the checklist available at https://bit.ly/2QW3uV5. Please also sign the publication agreement, which can be downloaded from https://bit.ly/2QYuW4w. The final version of your manuscript, along with the completed checklist and the signed publication agreement, can be returned to qss@issi-society.org.

    Thank you for your contribution. On behalf of the Editors of Quantitative Science Studies, I look forward to your continued contributions to the journal.

    Best wishes,
    Dr. Staša Milojević
    Editor, Quantitative Science Studies
    smilojev@indiana.edu, smilojev@indiana.edu

    Editor Comments to Author:

    Reviewers' Comments to Author:

    Decision letter by
    Cite this decision letter
    Author Response
    2021/01/22

    Dear Staša,

    Thank you for your sending me the annotated version and your kind words. As it were only rather small but handy comments, I could resolve matters rather swiftly. Find here the revised version.

    With kind regards,

    Mathieu

    Author response by


    Cite this author response
  • pre-publication peer review (ROUND 2)
    Decision Letter
    2021/01/20

    20-Jan-2021

    Dear Mr. Steijn:

    Manuscript QSS-2020-0071.R1 entitled "Improvement on the association strength: implementing a probabilistic measure based on combinations without repetition.", which you submitted to Quantitative Science Studies, has been reviewed. The comments of the reviewers are included at the bottom of this letter.

    Based on the comments of the reviewers as well as my own reading of your manuscript, my editorial decision is to invite you to prepare a minor revision of your manuscript.

    To revise your manuscript, log into https://mc.manuscriptcentral.com/qss and enter your Author Center, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision.

    You may also click the below link to start the revision process (or continue the process if you have already started your revision) for your manuscript. If you use the below link you will not be required to login to ScholarOne Manuscripts.

    PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm.

    https://mc.manuscriptcentral.com/qss?URL_MASK=47042a67b4f7499d99d8cbafcb8c5d86

    You will be unable to make your revisions on the originally submitted version of the manuscript. Instead, revise your manuscript using a word processing program and save it on your computer. Please also highlight the changes to your manuscript within the document by using the track changes mode in MS Word or by using bold or colored text.

    Once the revised manuscript is prepared, you can upload it and submit it through your Author Center.

    When submitting your revised manuscript, you will be able to respond to the comments made by the reviewers in the space provided. You can use this space to document any changes you make to the original manuscript. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response to the reviewers.

    IMPORTANT: Your original files are available to you when you upload your revised manuscript. Please delete any redundant files before completing the submission.

    If possible, please try to submit your revised manuscript by 21-Mar-2021. Let me know if you need more time to revise your work.

    Once again, thank you for submitting your manuscript to Quantitative Science Studies and I look forward to receiving your revision.

    Best wishes,
    Dr. Staša Milojević
    Editor, Quantitative Science Studies
    smilojev@indiana.edu, smilojev@indiana.edu

    Editor Comments to Author:

    Both reviewers recommended the paper to be accepted for publication and I agree. However, one of the reviewers made a number of suggestions you may wish to implement, therefore the minor revision decision.

    Reviewers' Comments to Author:

    Reviewer: 1

    Comments to the Author
    The author has addressed all my comments.

    Reviewer: 2

    Comments to the Author
    I thank the author for his detailed answers to the comments and the modifications and corrections brought to the paper. I spotted some additional typos (see the annotated version and the detailed comments below). In the annotated version, I also point to two last issues:
    - First, the distinction that could be made between 1) the total count of occurrence that could be counted in the diagonal (that we can name “self co-occurrence”) and 2) the number of “solo” participation to events or places (“self occurrence”) – in the examples given in the paper, it is the case of Patent 1 that only involves one class. I guess, the author could have been clearer regarding this issue but the current version is already better in this regard. My idea is that taking into account the number of solo participation could be relevant for some researchers and some questions and therefore, the idea of disregarding these participations in the normalisation process is not completely obvious (see more detailed comments regarding this issue below)
    - The choice made by the author to refer to the mathematical notion of “combinations without repetition” is disputable since this terminology implies a type of notation and calculation that is not used in this paper

    To finish, I want to thank the author for his valuable contribution to the journal and I look forward for the publication to be online.

    Detailed comments:

    Line 15, p.2: replace “don't co-author papers” by “do not co-author papers”
    Line 40, p.2: replace “entitites” by “entities”
    Line 42, p.2: replace “at leaste” by “at least”

    Footnote 5, p.2: “In combinations without repetition one can not draw an observation in the second draw if it has been drawn in the 1rst draw. In this setting none of the observations belonging to the same entity as the 1rst observation can be drawn in the second draw.” --> I think this clarification is very important and deserves to appear in the full text. Indeed, it might be useful to justify the title of the article (if you decide to keep referring to “combinations without repetition” in the title despite the fact that it is not exactly a measure based on combinations without repetition…)

    Line 5, p.4: “All in all, it is advisable to use the improved formula when working with co-occurrence data, where self co-occurrences are non-existent or irrelevant.” --> Actually, what you mean here by self co-occurrence implies both counting the total number of occurrence of i and taking into account self occurrence per se i.e. papers or sites where only i is appearing in the occurrence matrix (such as Patent 1 in your example).

    Lines 38-39, p.5: “Although this leads to the same results it is advisable to use zeros as missing values often results in errors when using statistical software.” --> this sentence is grammatically incorrect

    Lines 17-28, p.6: “In many applications of co-occurrence data, such as the concept of relatedness, the raw numbers of co-occurrences between entities cannot straightforwardly be interpreted as giving the strength of the relation between each pair of entities. There is a so-called size-effect, as some classes co-occur more often with others for the simple reason that these classes have more occurrences in the 1rst place. Like in our example, where d has more co-occurrences with c than with a or b but c also has more occurrences in total and therefore is more likely to co-occur with any class.” -->

    Note that this could be a reason why it could be interesting to also take into account the patents with only one class (since it could give an idea of which are the more "popular" classes). I mean it could be an argument to consider the value of self-occurrence, which does not mean that I do not agree with your approach. It is just that I could understand the choice for a different approach, e.g. in ecological network, I think they do normalise their data in taking into account self-occurrences of certain species in certain sites.

    • The same argument applies to footnotes 16 and 17, p.9. --> The fact that patent 1 is related to class C implies that class C is the more popular class in your example so that it would not be unjustified to take this information into account to infer an "expected number of co-occurrence" – again I understand your approach but I would understand this other approach as well

    Lines 36-37, p.7: “The challenge therefore is to correctly estimate the number of expected co-occurrences per combination.” --> As mentioned above, I am not sure about your use of the term "combination" since it implies a specific way of counting (and a specific notation) in probabilities that you do not use eventually

    Lines 39-40, p.8: “To calculate Si one can use the row sum or the column sum of row i” --> what do you mean by the “column sum of row i”?

    Main answers to the authors’ letter (see the annotated version for more comments):

    Regarding “Matthijs J. Warrens review of similarity measures in 2019, or on the same topic Choi & Cha, 2010.” --> it was just an example, I was concerned about the fact that the literature review was quite restricted in your paper. I thought you could enrich the state of art by referring to more recent and contemporary work on a close topic.

    Regarding the story behind the paper “I’m not entirely sure how to integrate this in the paper but I’m open for suggestions.” --> thank you for clarifying this, no need to add this in the paper of course

    Thanks for all the detailed answers and all the changes you brought to the publication.

    Decision letter by
    Cite this decision letter
    Reviewer report
    2021/01/13

    I thank the author for his detailed answers to the comments and the modifications and corrections brought to the paper. I spotted some additional typos (see the annotated version and the detailed comments below). In the annotated version, I also point to two last issues:
    - First, the distinction that could be made between 1) the total count of occurrence that could be counted in the diagonal (that we can name “self co-occurrence”) and 2) the number of “solo” participation to events or places (“self occurrence”) – in the examples given in the paper, it is the case of Patent 1 that only involves one class. I guess, the author could have been clearer regarding this issue but the current version is already better in this regard. My idea is that taking into account the number of solo participation could be relevant for some researchers and some questions and therefore, the idea of disregarding these participations in the normalisation process is not completely obvious (see more detailed comments regarding this issue below)
    - The choice made by the author to refer to the mathematical notion of “combinations without repetition” is disputable since this terminology implies a type of notation and calculation that is not used in this paper

    To finish, I want to thank the author for his valuable contribution to the journal and I look forward for the publication to be online.

    Detailed comments:

    Line 15, p.2: replace “don't co-author papers” by “do not co-author papers”
    Line 40, p.2: replace “entitites” by “entities”
    Line 42, p.2: replace “at leaste” by “at least”

    Footnote 5, p.2: “In combinations without repetition one can not draw an observation in the second draw if it has been drawn in the 1rst draw. In this setting none of the observations belonging to the same entity as the 1rst observation can be drawn in the second draw.” --> I think this clarification is very important and deserves to appear in the full text. Indeed, it might be useful to justify the title of the article (if you decide to keep referring to “combinations without repetition” in the title despite the fact that it is not exactly a measure based on combinations without repetition…)

    Line 5, p.4: “All in all, it is advisable to use the improved formula when working with co-occurrence data, where self co-occurrences are non-existent or irrelevant.” --> Actually, what you mean here by self co-occurrence implies both counting the total number of occurrence of i and taking into account self occurrence per se i.e. papers or sites where only i is appearing in the occurrence matrix (such as Patent 1 in your example).

    Lines 38-39, p.5: “Although this leads to the same results it is advisable to use zeros as missing values often results in errors when using statistical software.” --> this sentence is grammatically incorrect

    Lines 17-28, p.6: “In many applications of co-occurrence data, such as the concept of relatedness, the raw numbers of co-occurrences between entities cannot straightforwardly be interpreted as giving the strength of the relation between each pair of entities. There is a so-called size-effect, as some classes co-occur more often with others for the simple reason that these classes have more occurrences in the 1rst place. Like in our example, where d has more co-occurrences with c than with a or b but c also has more occurrences in total and therefore is more likely to co-occur with any class.” -->

    Note that this could be a reason why it could be interesting to also take into account the patents with only one class (since it could give an idea of which are the more "popular" classes). I mean it could be an argument to consider the value of self-occurrence, which does not mean that I do not agree with your approach. It is just that I could understand the choice for a different approach, e.g. in ecological network, I think they do normalise their data in taking into account self-occurrences of certain species in certain sites.

    • The same argument applies to footnotes 16 and 17, p.9. --> The fact that patent 1 is related to class C implies that class C is the more popular class in your example so that it would not be unjustified to take this information into account to infer an "expected number of co-occurrence" – again I understand your approach but I would understand this other approach as well

    Lines 36-37, p.7: “The challenge therefore is to correctly estimate the number of expected co-occurrences per combination.” --> As mentioned above, I am not sure about your use of the term "combination" since it implies a specific way of counting (and a specific notation) in probabilities that you do not use eventually

    Lines 39-40, p.8: “To calculate Si one can use the row sum or the column sum of row i” --> what do you mean by the “column sum of row i”?

    Main answers to the authors’ letter (see the annotated version for more comments):

    Regarding “Matthijs J. Warrens review of similarity measures in 2019, or on the same topic Choi & Cha, 2010.” --> it was just an example, I was concerned about the fact that the literature review was quite restricted in your paper. I thought you could enrich the state of art by referring to more recent and contemporary work on a close topic.

    Regarding the story behind the paper “I’m not entirely sure how to integrate this in the paper but I’m open for suggestions.” --> thank you for clarifying this, no need to add this in the paper of course

    Thanks for all the detailed answers and all the changes you brought to the publication.

    Reviewed by
    Cite this review
    Reviewer report
    2020/12/14

    The author has addressed all my comments.

    Reviewed by
    Cite this review
    Author Response
    2020/12/04

    Dear Staša,

    Thank you very much for reading my work and organizing the review process. I'm very happy with the comments I received, in particular of the second referee. I took over many suggestions and detail how I processed them below. I also added these comments as a supplemental file under file upload in case some of the picturees do not show properly. In the .pdf I indicate changes in bold.

    With kind regards,

    Mathieu

    Response to reviewer 1
    I would like to thank you for your time reading my paper, your kind words and useful comments. I listed them in italic below and my answers to them underneath each. In the main text I show changes in bold.

    1. The author should explicitly state if the newly proposed measure is indeed new. It seems it is already incorporated into an R package in 2016 by another researcher. So it is unclear to me who proposed the new measure.

    This was indeed unclear. I developed the measure but the R package is maintained by another researcher. I changed the wording to make this more clear.

    1. on p. 5, this sentence reads odd to me: "To correct the absolute number of co-occurrences for the size-effect data is normalised". Do you mean "is normalization"?

    I clarified the phrasing.

    1. on p. 13, there is a latex formula reference error displayed as ??.

    Thank you. It should refer to formula 11.

    Response to reviewer 2
    I would like to thank you for the visibly large amount of time you took to review and comment my paper. I find the literature suggestions particularly useful as I was unaware of the existence of these relevant pieces. The addition of this literature and also the suggested terminology has allowed me to position this paper more clearly within the domain of scientometrics. Furthermore, I am grateful for your verifications of my formulas and eye to detail in finding typos. I listed your comments in italic below and my answers to them underneath each. In the main text I show changes in bold.

    The article is interesting and proposes to refine a normalisation measure proposed by a previous article by van Eck and Waltman.
    In doing so, the article is part of an important tradition of discussion of similarity measures in the Scientometrics literature.
    However, it makes little reference to these debates and to this abundant literature. Instead, it mainly relies on the contribution of van Eck and Waltman. Given the very good state of the art done by van Eck and Waltman, we could consider that it is sufficient to refer to it.
    However, it seems that at least two or three additional articles in scientometrics could be considered for the author's purpose.

    I agree and thank you for your suggestions.

    First, I think of articles that specifically raise the question of the content of the diagonal in co-occurrence matrices.
    I am thinking in particular of the article by Alghren et al. 2003 included in the review of Mêgnigbêto 2013, which could also be read and mentioned.

    Thank you for the suggestion. I wasn’t aware of this discussion and I think it is very relevant for the paper. I included the discussion on the diagonal on the basis of Alghren et al. (2003) in section 2 and come back on it in section 3. I prepare the reader for this discussion by already mentioning it in a footnote in the introduction.

    Second, the article by Leydesdorff and Vaughan 2006 is, in my opinion, interesting in pointing out the existence of a difference in epistemological tradition between the approach of scientometricians in the treatment of co-occurrence data and that of network analysis specialists. The two traditions seem to have come closer together later on in the 2000s.

    Thank you for the suggestion. I wasn’t aware of this point either. I respond to this comment in more detail below with the text references.

    It is notably interesting to observe that the author don’t use the term “bipartite network” whereas the issue dealt with in this paper is often discussed in papers dealing with bipartite network (e.g. Neal 2014 or earlier, Latapy et al. 2008).

    Thank you for the suggestion. I chose to stick as much to the context of Van Eck & Waltman (2009) and others but by mentioning terminology from this domain the paper may appeal to a larger audience. Thank you for this suggestion. I introduced the term and referred to Latapy in Section 2.

    Finally, more recent works could be referred to such as for instance Matthijs J. Warrens review of similarity measures in 2019, or on the same topic Choi & Cha, 2010.

    Thank you for the suggestions of these works. I read them but see that both refer to 2 x 2 tables, which is a different context and therefore uses different formulas that are not that straightforward to compare to the one introduced in this paper. I therefore decided not to include these articles but I am open for suggestions on how to do so.

    It might therefore be interesting for the author to try to integrate its contribution more in the continuity of pre-existing and contemporary work and possibly to succeed in explaining that the proposed measure was not envisaged earlier in this disciplinary tradition and broadly. Indeed, when reading this contribution I could not stop thinking: it makes sense but why didn't anyone propose this earlier in the field then?

    My impression is that many authors don’t use the probabilistic similarity measures but set-theoretic similarity measures, as Van Eck & Waltman (2009) call them, such as the cosine, inclusion or Jaccard index. When one agrees with the point by Van Eck & Waltman (2009) that these latter measures are less suitable than the probabilistic similarity measure then the reasoning is intuitive. It is only when I ran the formula on very small toy examples, as the one given in Matrix 3, that I realised that the formula was giving unreasonable answers. That is when I started to investigate the formula and found out the problem laid out in this paper. I’m not entirely sure how to integrate this in the paper but I’m open for suggestions.

    Perhaps, it is possible that some researchers are already using a formula based on combinations without repetition to compute the “association strength” without formalising it. Besides, one wonders whether some of the articles cited by the author as having used the “association strength” measure have not done that, but the author remains vague on the formula used in these articles (by Hidalgo and Balland) as we point out in the commented version of the document.

    Thank you for your comment. I realise this wasn’t sufficiently clear and have improved this in the paper. Hidalgo uses another probabilistic measure based on conditional probabilities, which I detail in the renewed version of this paper. Balland et al. use the association strength as defined by Van Eck & Waltman (2009).

    Regarding the arguments that are given to use the improved formula, the author insists on the fact that self-co-occurrences should not be accounted for but actually, the issue with the formula of the association strength seems more related to the fact that the possibility of co-occurrence of i with j is counted twice in the calculation of T. I advise the author to clarify that, in particular in the abstract of the publication.

    The problem is in fact with the self-co-occurrences as stated. When one allows for self-co-occurrences the diagonal is not zero. For example if technology classes can co-occur with themselves on patents then Matrix 1 would result in the matrix below:

    Class a, b, c, d co-occur with itself in patent 3, c&d; in patent 2 and c in patent 1. In this case combinations with repetition as used by Van Eck & Waltman would work. As when we first draw the observation of class a we can still redraw a in the second draw.

    However, the reality is that in this line of research, see the others cited in this paper and by van Eck & Waltman the co-occurrence of a & a, or an author with him/herself is non-sensical.

    I tried to clarify the explanation to prevent misunderstandings, see text, but I am open for suggestions for improvement. It is indeed true that not setting the diagonal to zero will lead to a miscalculation of T but a co-occurrence is not counted twice when not done so.

    In addition, the author could mention the fact that the issue of the normalisation of co-occurrence data arises in different ways depending on the size effect to be controlled by researchers.
    For instance, it is interesting to note that the size effect that needs to be controlled in most work on ecological networks presupposes working on the bipartite presence-absence matrix (e.g. the species-site matrix) before projecting it.
    The means proposed by ecologists to standardise these data are very numerous and in this branch of the literature, there is much debate on the issue of normalisation on the one hand and on the choice of a similarity indicator on the other. Besides, the two issues are dealt with somehow separately in this literature.
    In addition to encouraging the author to better situate his argument in the field of scientometrics and thus better explain and justify the interest of the proposal he is making with this new formula, several more formal modifications are requested which are detailed in the annotated version of the document.
    In particular, we draw the author's attention to several calculation errors that we have identified: formula 5 and 10 contain mistakes.
    I also encourage the author to reformulate certain sentences and correct typos (see the detailed comments).
    Finally, I reckon that many important information are present in the footnotes of the articles and I think some of them would deserve to be present in the body of the text.

    Thank you for these comments. I agree that the work could be better situated within the field of scientometrics and I thank you for the literature suggestions. I will respond to these comments by listing my changes and comments in the text below. For the footnotes, I chose to put only the essential information in the main text to allow for readability for the reader that is less interested in all the details but still put in this information for the more informed reader. If any particular footnotes are needed to be placed in the main text I’d gladly consider this.

    Abstract:
    "this formula is based on combinations with repetition, even though in most uses self-co-occurrences are non-existent or irrelevant" --> It seems to me that the issues of self-co-occurrences and that of justifying the use of a formula based on combinations without repetition are not entirely equivalent. Yet, in this sentence, the absence of self-co-occurrences seem to be the only argument for using a formula based on combinations without repetition.

    See earlier comment.

    Page 2, line 8: "The use co-occurrence data is" --> "The use of"

    Page 2, line 16: Missing parenthesis

    Page 2, line 18: "Its use is widespread and in close relation with the popularity of network analysis across disciplines." --> We could object that the use of co-occurrence data does not necessarily require the use of network analysis (the distinction is made by Leydesdorff & Vaughan in Leydesdorff & Vaughan 2006. You could of course disagree with them but their paper suggests that the link between co-occurrence data and network analysis might not be as straightforward as you are suggesting here).

    Thank you for the literature suggestion. I agree on the comment that Leydesdorff & Vaughan (2006) show that the perspective on what are links and nodes differs between the views those within information science and social network analysts. However, I don’t think this justifies the claim that the analysis of co-occurrence data in the tradition of information science as described by Leydesdorff & Vaughan (2006) is not also a form of network analysis. The authors denote different views of the definitions of links and nodes between information science and social network analysists but both are in my eyes therefore forms of network analysis, just with different definitions of what the links and nodes reflect. It is my impression that Leydesdorff & Vaughan (2006) do not claim that network analysis is solely the domain of these social network analysts and therefore network analysis per se does not necessarily need to conform to particularities of the way social network analysts perform this type of analysis.

    That network analysis not only refers to the practices of social network analysts is confirmed in my opinion by the words of Leydesdorff & Vaughan (2006, p. 1625-1626) “First, there has recently been an increasing effort to elabo-rate algorithms in network analysis more generally under thepressure to understand the operations of the Internet, but also of other networks in biological and physical systems (Da F. Costa, Rodrigues, Travieso, & Villa Boas, 2005). Social network analysts can profit from these developments, which are theoretically informed by graph theory.” Or the cited works in the first paragraph of the introduction of this paper. I hope you agree with me that Leydesdorff & Vaughan (2006) denote the difference between co-occurrence data in information science and social network analysis and not network analysis per se. If not I’m open to new arguments and learn more on this debate. I did incorporate some of the information on Leydesdorff & Vaughan (2006) in another part of the paper.

    Page 2, Line 27: "However, the total number of co-occurrences between a pair of entities cannot be used straightforwardly to reflect the relatedness between them because entities with more observations are more likely to co-occur than entities with fewer observations." --> Note that different size effects could be distinguished, e.g for a co-authorship network:
    - a size effect related to the total number of observations/documents of the corpus co-signed by author i
    - a size effect related to the total number of authors per publication co-signed by author i (it is the size effect that Newman, 2001 or Maisonobe et al., 2016 - at the level of spatial units - intend to control by fractionating the links weight. Regarding this issue, see also Leydesdorff & Park, 2017 responding to Perianes-Rodriguez et al., 2016)
    - a size effect related to the total nb of unique co-authors of author i among a given number of documents I think you are offering a solution that can mainly deal with the last one.

    Your comment is right to detail that size effects are differently defined across studies and that the current phrasing at said location passes over these differences. I therefore added a comment to footnote 1. However, the new formula of the paper deals with a different type of size effect than the mentioned “size effect related to the total nb of unique co-authors of author i among a given number of documents.” Thanks to your comment I realised that my explanation of the size effect is inadequate. I therefore added a piece of text with more details to footnote 13.

    Page 3, Line 33: "It is shown that: firstly, the original formula overestimates the relatedness between a pair, when these co-occur at least once." --> what does "these" refer to in this sentence? it is not clear

    Page 5, Line 21: "a binary occurrence matrix" --> Note that in some other branches of the litterature (e.g. Neal, 2014), the same object is called a "bipartite matrix"

    Page 6, Line 8: "The diagonal is set to zero as the reference to a certain class does not entail a co-occurrence between that class and itself." --> Okay, but actually the result of multiplying the transpose of O by O is a square matrix with non-zero diagonal values. Diagonal values in the outcome of this operation correspond to the degree of each entity.

    That is correct but will lead to calculation errors in Si, Sj, and T further on. Therefore, the choice is made to set the diagonal values to zero. I’ve improved the discussion on the diagonal, see earlier comments.

    Page 6, Line 51: "Neffke et al. (2011) who look at the co-occurrence of products in the production process of the same plant also correct for the probability of the respective products." --> The normalised maturity index defined in Neffke et al. does not explicitly take into account the co-occurence of products. I am not entirely convinced by the relation between this publication and your approach. If you believe this is a relevant publication to cite for your purpose, please explain more clearly why.

    I cite Neffke et al. (2011) as an example of a paper in which co-occurrences are not only normalised for the size-effect. The co-occurrence of products within plants is not used to determine maturity but the relatedness between products. My citation and comment is to suggest to the reader that size-effects may not be the only thing that may require normalisation for his/her purposes and one has to think carefully what procedure is fit. I gladly accept feedback to make this more clear if this wasn’t clear from my original writing.

    Page 7, Line 11: "Hidalgo et al. (2007) developed an in influential network analysis tool to derive the what they call relatedness between entities on the basis of co occurrences." --> is the "the" before "what they call" really necessary in this sentence?

    Page 7, Line 15: "Although they use a different probabilistic direct similarity measure than the ones covered by van Eck and Waltman (2009), other authors (e.g. Balland et al., 2015) building on the framework of Hidalgo et al. (2007) do opt for the association strength." --> Are they also giving a different formula than van Eck and Waltman to compute the association strength? What is the difference with van Eck and Waltman's formula? Do these other formulas have the same shortcoming?

    See earlier comment.

    Page 8, Line 40: "column i of the C when the diagonal is set to zero." --> "of matrix C"

    Page 9, Line 9: You can also write: n(n-1)/2

    Page 10, Line 32: "This is because the formula observes 2 occurrences for each class and 3 possible partners to co-occur with even though there are only 2 possible partners. Class a can co-occur with class b and class c but not with itself." --> This assertion could be subject to discussion: see the article by Ahlgren, Jarneving & Rousseau 2003 and the review of Mêgnigbêto 2013 which show that the content of diagonals is a subject of debate among scientometricians.

    See earlier comment.

    Page 12, Formula 5: There is a mistake here: it's (SiT + SjT - 2SiSj) instead of (SiT + SjT - SiSj)

    Page 12, Line 41 and 42: replace "SiSj" by "2SiSj"

    Page 13, Line 36: Missing parenthesis

    Page 14, Formula 10: I ran the math again and I found -2SiSjL instead of -3SiSjL

    I also redid the calculations for formulas 5 and 10 and see that you are correct. There was even a second error in formula 10. Thank you very much for taking the time to verify the calculations. This is much appreciated.

    Page 14, Line 35: replace "??" by "formula 11"

    Page 16, Line 23: Missing comma between Sj and Si

    Page 17, Line 45: Missing comma before "potential"

    Page 17, Line 51: replace "Graph 1" by "Figure 1"

    Page 18, Line 43: Missing comma before "L"

    Page 25, Line 24: "In this line of research self-co-occurrences are non-existent or irrelevant, whereas the probability formula assumes that an observation from an entity can be drawn again after been picked in the first draw." --> I would be more careful than you are since it is a subject of debate among scientometricians (see my previous comment on this issue)

    See earlier comment.

    Page 25, Line 28: "This paper introduces a formula that is based on, but not equal to, combinations without repetition" --> What do you mean by "not equal to"

    Page 25, Line 33: replace "the the probability" by "the probability" (repetition)

    Page 26, Line 1: "it is evident that" --> I would prefer a less assertive formulation like "we have shown that"

    I agree. Thank you for the suggestion.

    References, Line 25: 2 typos

    Thank you for finding these typos.

    Author response by


    Cite this author response
  • pre-publication peer review (ROUND 1)
    Decision Letter
    2020/11/19

    19-Nov-2020

    Dear Mr. Steijn:

    Your manuscript QSS-2020-0071 entitled "Improvement on the association strength: implementing a probabilistic measure based on combinations without repetition.", which you submitted to Quantitative Science Studies, has been reviewed. The comments of the reviewers are included at the bottom of this letter.

    Based on the comments of the reviewers as well as my own reading of your manuscript, my editorial decision is to invite you to prepare a minor revision of your manuscript.

    To revise your manuscript, log into https://mc.manuscriptcentral.com/qss and enter your Author Center, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision.

    You may also click the below link to start the revision process (or continue the process if you have already started your revision) for your manuscript. If you use the below link you will not be required to login to ScholarOne Manuscripts.

    PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm.

    https://mc.manuscriptcentral.com/qss?URL_MASK=3d29ec367e1c460dac967a1c210a952c

    You will be unable to make your revisions on the originally submitted version of the manuscript. Instead, revise your manuscript using a word processing program and save it on your computer. Please also highlight the changes to your manuscript within the document by using the track changes mode in MS Word or by using bold or colored text.

    Once the revised manuscript is prepared, you can upload it and submit it through your Author Center.

    When submitting your revised manuscript, you will be able to respond to the comments made by the reviewers in the space provided. You can use this space to document any changes you make to the original manuscript. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response to the reviewers.

    IMPORTANT: Your original files are available to you when you upload your revised manuscript. Please delete any redundant files before completing the submission.

    If possible, please try to submit your revised manuscript by 18-Jan-2021. Let me know if you need more time to revise your work.

    Once again, thank you for submitting your manuscript to Quantitative Science Studies and I look forward to receiving your revision.

    Best wishes,
    Dr. Staša Milojević
    Editor, Quantitative Science Studies
    smilojev@indiana.edu, smilojev@indiana.edu

    Editor Comments to Author:

    Reviewers' Comments to Author:

    Reviewer: 1

    Comments to the Author
    This is a very good submission. The author made a convincing argument of why a popular existing co-occurrence measure overestimates the similarity of entities in a matrix. The author proposed a new similarity measure. The mathematical narrative for this measure seems logical. The patent data further validate its utility. I only have a few minor comments:
    1. The author should explicitly state if the newly proposed measure is indeed new. It seems it is already incorporated into an R package in 2016 by another researcher. So it is unclear to me who proposed the new measure.
    2. on p. 5, this sentence reads odd to me: "To correct the absolute number of co-occurrences for the size-effect data is normalised". Do you mean "is normalization"?
    3. on p. 13, there is a latex formula reference error displayed as ??.

    Reviewer: 2

    Comments to the Author
    The article is interesting and proposes to refine a normalisation measure proposed by a previous article by van Eck and Waltman.
    In doing so, the article is part of an important tradition of discussion of similarity measures in the Scientometrics literature.
    However, it makes little reference to these debates and to this abundant literature. Instead, it mainly relies on the contribution of van Eck and Waltman. Given the very good state of the art done by van Eck and Waltman, we could consider that it is sufficient to refer to it.
    However, it seems that at least two or three additional articles in scientometrics could be considered for the author's purpose.
    First, I think of articles that specifically raise the question of the content of the diagonal in co-occurrence matrices.
    I am thinking in particular of the article by Alghren et al. 2003 included in the review of Mêgnigbêto 2013, which could also be read and mentioned.
    Second, the article by Leydesdorff and Vaughan 2006 is, in my opinion, interesting in pointing out the existence of a difference in epistemological tradition between the approach of scientometricians in the treatment of co-occurrence data and that of network analysis specialists. The two traditions seem to have come closer together later on in the 2000s.
    It is notably interesting to observe that the author don’t use the term “bipartite network” whereas the issue dealt with in this paper is often discussed in papers dealing with bipartite network (e.g. Neal 2014 or earlier, Latapy et al. 2008).
    Finally, more recent works could be referred to such as for instance Matthijs J. Warrens review of similarity measures in 2019, or on the same topic Choi & Cha, 2010.
    It might therefore be interesting for the author to try to integrate its contribution more in the continuity of pre-existing and contemporary work and possibly to succeed in explaining that the proposed measure was not envisaged earlier in this disciplinary tradition and broadly. Indeed, when reading this contribution I could not stop thinking: it makes sense but why didn't anyone propose this earlier in the field then?
    Perhaps, it is possible that some researchers are already using a formula based on combinations without repetition to compute the “association strength” without formalising it. Besides, one wonders whether some of the articles cited by the author as having used the “association strength” measure have not done that, but the author remains vague on the formula used in these articles (by Hidalgo and Balland) as we point out in the commented version of the document.
    Regarding the arguments that are given to use the improved formula, the author insists on the fact that self-co-occurrences should not be accounted for but actually, the issue with the formula of the association strength seems more related to the fact that the possibility of co-occurrence of i with j is counted twice in the calculation of T. I advise the author to clarify that, in particular in the abstract of the publication.
    In addition, the author could mention the fact that the issue of the normalisation of co-occurrence data arises in different ways depending on the size effect to be controlled by researchers.
    For instance, it is interesting to note that the size effect that needs to be controlled in most work on ecological networks presupposes working on the bipartite presence-absence matrix (e.g. the species-site matrix) before projecting it.
    The means proposed by ecologists to standardise these data are very numerous and in this branch of the literature, there is much debate on the issue of normalisation on the one hand and on the choice of a similarity indicator on the other. Besides, the two issues are dealt with somehow separately in this literature.
    In addition to encouraging the author to better situate his argument in the field of scientometrics and thus better explain and justify the interest of the proposal he is making with this new formula, several more formal modifications are requested which are detailed in the annotated version of the document.
    In particular, we draw the author's attention to several calculation errors that we have identified: formula 5 and 10 contain mistakes.
    I also encourage the author to reformulate certain sentences and correct typos (see the detailed comments).
    Finally, I reckon that many important information are present in the footnotes of the articles and I think some of them would deserve to be present in the body of the text.

    Detailed comments (also present in the annotated version of the text):

    Abstract:
    "this formula is based on combinations with repetition, even though in most
    uses self-co-occurrences are non-existent or irrelevant" --> It seems to me that the issues of self-co-occurrences and that of justifying the use of a formula based on combinations without repetition are not entirely equivalent. Yet, in this sentence, the absence of self-co-occurrences seem to be the only argument for using a formula based on combinations without repetition.

    Page 2, line 8: "The use co-occurrence data is" --> "The use of"

    Page 2, line 16: Missing parenthesis

    Page 2, line 18: "Its use is widespread and in close relation with the popularity of network analysis across disciplines." --> We could object that the use of co-occurrence data does not necessarily require the use of network analysis (the distinction is made by Leydesdorff & Vaughan in Leydesdorff & Vaughan 2006. You could of course disagree with them but their paper suggests that the link between co-occurrence data and network analysis might not be as straightforward as you are suggesting here).

    Page 2, Line 27: "However, the total number of co-occurrences between a pair of entities cannot be used straightforwardly to reflect the relatedness between them because entities with more observations are more likely to co-occur than entities with fewer observations." --> Note that different size effects could be distinguished, e.g for a co-authorship network:
    - a size effect related to the total number of observations/documents of the corpus co-signed by author i
    - a size effect related to the total number of authors per publication co-signed by author i (it is the size effect that Newman, 2001 or Maisonobe et al., 2016 - at the level of spatial units - intend to control by fractionating the links weight. Regarding this issue, see also Leydesdorff & Park, 2017 responding to Perianes-Rodriguez et al., 2016)
    - a size effect related to the total nb of unique co-authors of author i among a given number of documents
    I think you are offering a solution that can mainly deal with the last one.

    Page 3, Line 33: "It is shown that: firstly, the original formula overestimates the relatedness between a pair, when these co-occur at least once." --> what does "these" refer to in this sentence? it is not clear

    Page 5, Line 21: "a binary occurrence matrix" --> Note that in some other branches of the litterature (e.g. Neal, 2014), the same object is called a "bipartite matrix"

    Page 6, Line 8: "The diagonal is set to zero as the reference to a certain class does not entail a co-occurrence between that class and itself." --> Okay, but actually the result of multiplying the transpose of O by O is a square matrix with non-zero diagonal values. Diagonal values in the outcome of this operation correspond to the degree of each entity.

    Page 6, Line 51: "Neffke et al. (2011) who look at the co-occurrence of products in the production process of the same plant also correct for the probability of the respective products." --> The normalised maturity index defined in Neffke et al. does not explicitly take into account the co-occurence of products. I am not entirely convinced by the relation between this publication and your approach. If you believe this is a relevant publication to cite for your purpose, please explain more clearly why.

    Page 7, Line 11: "Hidalgo et al. (2007) developed an in influential network analysis tool to derive the what they call relatedness between entities on the basis of co occurrences." --> is the "the" before "what they call" really necessary in this sentence?

    Page 7, Line 15: "Although they use a different probabilistic direct similarity measure than the ones covered by van Eck and Waltman (2009), other authors (e.g. Balland et al., 2015) building on the framework of Hidalgo et al. (2007) do opt for the association strength." --> Are they also giving a different formula than van Eck and Waltman to compute the association strength? What is the difference with van Eck and Waltman's formula? Do these other formulas have the same shortcoming?

    Page 8, Line 40: "column i of the C when the diagonal is set to zero." --> "of matrix C"

    Page 9, Line 9: You can also write: n(n-1)/2

    Page 10, Line 32: "This is because the formula observes 2 occurrences for each class and 3 possible partners to co-occur with even though there are only 2 possible partners. Class a can co-occur with class b and class c but not with itself." -->
    This assertion could be subject to discussion: see the article by Ahlgren, Jarneving & Rousseau 2003 and the review of Mêgnigbêto 2013 which show that the content of diagonals is a subject of debate among scientometricians.

    Page 12, Formula 5: There is a mistake here: it's (SiT + SjT - 2SiSj) instead of (SiT + SjT - SiSj)

    Page 12, Line 41 and 42: replace "SiSj" by "2SiSj"

    Page 13, Line 36: Missing parenthesis

    Page 14, Formula 10: I ran the math again and I found -2SiSjL instead of -3SiSjL

    Page 14, Line 35: replace "??" by "formula 11"

    Page 16, Line 23: Missing comma between Sj and Si

    Page 17, Line 45: Missing comma before "potential"

    Page 17, Line 51: replace "Graph 1" by "Figure 1"

    Page 18, Line 43: Missing comma before "L"

    Page 25, Line 24: "In this line of research self-co-occurrences are non-existent or irrelevant, whereas the probability formula assumes that an observation from an entity can be drawn again after been picked in the first draw." --> I would be more careful than you are since it is a subject of debate among scientometricians (see my previous comment on this issue)

    Page 25, Line 28: "This paper introduces a formula that is based on, but not equal to, combinations without repetition" --> What do you mean by "not equal to"

    Page 25, Line 33: replace "the the probability" by "the probability" (repetition)

    Page 26, Line 1: "it is evident that" --> I would prefer a less assertive formulation like "we have shown that"

    References, Line 25: 2 typos

    Decision letter by
    Cite this decision letter
    Reviewer report
    2020/11/19

    The article is interesting and proposes to refine a normalisation measure proposed by a previous article by van Eck and Waltman.
    In doing so, the article is part of an important tradition of discussion of similarity measures in the Scientometrics literature.
    However, it makes little reference to these debates and to this abundant literature. Instead, it mainly relies on the contribution of van Eck and Waltman. Given the very good state of the art done by van Eck and Waltman, we could consider that it is sufficient to refer to it.
    However, it seems that at least two or three additional articles in scientometrics could be considered for the author's purpose.
    First, I think of articles that specifically raise the question of the content of the diagonal in co-occurrence matrices.
    I am thinking in particular of the article by Alghren et al. 2003 included in the review of Mêgnigbêto 2013, which could also be read and mentioned.
    Second, the article by Leydesdorff and Vaughan 2006 is, in my opinion, interesting in pointing out the existence of a difference in epistemological tradition between the approach of scientometricians in the treatment of co-occurrence data and that of network analysis specialists. The two traditions seem to have come closer together later on in the 2000s.
    It is notably interesting to observe that the author don’t use the term “bipartite network” whereas the issue dealt with in this paper is often discussed in papers dealing with bipartite network (e.g. Neal 2014 or earlier, Latapy et al. 2008).
    Finally, more recent works could be referred to such as for instance Matthijs J. Warrens review of similarity measures in 2019, or on the same topic Choi & Cha, 2010.
    It might therefore be interesting for the author to try to integrate its contribution more in the continuity of pre-existing and contemporary work and possibly to succeed in explaining that the proposed measure was not envisaged earlier in this disciplinary tradition and broadly. Indeed, when reading this contribution I could not stop thinking: it makes sense but why didn't anyone propose this earlier in the field then?
    Perhaps, it is possible that some researchers are already using a formula based on combinations without repetition to compute the “association strength” without formalising it. Besides, one wonders whether some of the articles cited by the author as having used the “association strength” measure have not done that, but the author remains vague on the formula used in these articles (by Hidalgo and Balland) as we point out in the commented version of the document.
    Regarding the arguments that are given to use the improved formula, the author insists on the fact that self-co-occurrences should not be accounted for but actually, the issue with the formula of the association strength seems more related to the fact that the possibility of co-occurrence of i with j is counted twice in the calculation of T. I advise the author to clarify that, in particular in the abstract of the publication.
    In addition, the author could mention the fact that the issue of the normalisation of co-occurrence data arises in different ways depending on the size effect to be controlled by researchers.
    For instance, it is interesting to note that the size effect that needs to be controlled in most work on ecological networks presupposes working on the bipartite presence-absence matrix (e.g. the species-site matrix) before projecting it.
    The means proposed by ecologists to standardise these data are very numerous and in this branch of the literature, there is much debate on the issue of normalisation on the one hand and on the choice of a similarity indicator on the other. Besides, the two issues are dealt with somehow separately in this literature.
    In addition to encouraging the author to better situate his argument in the field of scientometrics and thus better explain and justify the interest of the proposal he is making with this new formula, several more formal modifications are requested which are detailed in the annotated version of the document.
    In particular, we draw the author's attention to several calculation errors that we have identified: formula 5 and 10 contain mistakes.
    I also encourage the author to reformulate certain sentences and correct typos (see the detailed comments).
    Finally, I reckon that many important information are present in the footnotes of the articles and I think some of them would deserve to be present in the body of the text.

    Detailed comments (also present in the annotated version of the text):

    Abstract:
    "this formula is based on combinations with repetition, even though in most
    uses self-co-occurrences are non-existent or irrelevant" --> It seems to me that the issues of self-co-occurrences and that of justifying the use of a formula based on combinations without repetition are not entirely equivalent. Yet, in this sentence, the absence of self-co-occurrences seem to be the only argument for using a formula based on combinations without repetition.

    Page 2, line 8: "The use co-occurrence data is" --> "The use of"

    Page 2, line 16: Missing parenthesis

    Page 2, line 18: "Its use is widespread and in close relation with the popularity of network analysis across disciplines." --> We could object that the use of co-occurrence data does not necessarily require the use of network analysis (the distinction is made by Leydesdorff & Vaughan in Leydesdorff & Vaughan 2006. You could of course disagree with them but their paper suggests that the link between co-occurrence data and network analysis might not be as straightforward as you are suggesting here).

    Page 2, Line 27: "However, the total number of co-occurrences between a pair of entities cannot be used straightforwardly to reflect the relatedness between them because entities with more observations are more likely to co-occur than entities with fewer observations." --> Note that different size effects could be distinguished, e.g for a co-authorship network:
    - a size effect related to the total number of observations/documents of the corpus co-signed by author i
    - a size effect related to the total number of authors per publication co-signed by author i (it is the size effect that Newman, 2001 or Maisonobe et al., 2016 - at the level of spatial units - intend to control by fractionating the links weight. Regarding this issue, see also Leydesdorff & Park, 2017 responding to Perianes-Rodriguez et al., 2016)
    - a size effect related to the total nb of unique co-authors of author i among a given number of documents
    I think you are offering a solution that can mainly deal with the last one.

    Page 3, Line 33: "It is shown that: firstly, the original formula overestimates the relatedness between a pair, when these co-occur at least once." --> what does "these" refer to in this sentence? it is not clear

    Page 5, Line 21: "a binary occurrence matrix" --> Note that in some other branches of the litterature (e.g. Neal, 2014), the same object is called a "bipartite matrix"

    Page 6, Line 8: "The diagonal is set to zero as the reference to a certain class does not entail a co-occurrence between that class and itself." --> Okay, but actually the result of multiplying the transpose of O by O is a square matrix with non-zero diagonal values. Diagonal values in the outcome of this operation correspond to the degree of each entity.

    Page 6, Line 51: "Neffke et al. (2011) who look at the co-occurrence of products in the production process of the same plant also correct for the probability of the respective products." --> The normalised maturity index defined in Neffke et al. does not explicitly take into account the co-occurence of products. I am not entirely convinced by the relation between this publication and your approach. If you believe this is a relevant publication to cite for your purpose, please explain more clearly why.

    Page 7, Line 11: "Hidalgo et al. (2007) developed an in influential network analysis tool to derive the what they call relatedness between entities on the basis of co occurrences." --> is the "the" before "what they call" really necessary in this sentence?

    Page 7, Line 15: "Although they use a different probabilistic direct similarity measure than the ones covered by van Eck and Waltman (2009), other authors (e.g. Balland et al., 2015) building on the framework of Hidalgo et al. (2007) do opt for the association strength." --> Are they also giving a different formula than van Eck and Waltman to compute the association strength? What is the difference with van Eck and Waltman's formula? Do these other formulas have the same shortcoming?

    Page 8, Line 40: "column i of the C when the diagonal is set to zero." --> "of matrix C"

    Page 9, Line 9: You can also write: n(n-1)/2

    Page 10, Line 32: "This is because the formula observes 2 occurrences for each class and 3 possible partners to co-occur with even though there are only 2 possible partners. Class a can co-occur with class b and class c but not with itself." -->
    This assertion could be subject to discussion: see the article by Ahlgren, Jarneving & Rousseau 2003 and the review of Mêgnigbêto 2013 which show that the content of diagonals is a subject of debate among scientometricians.

    Page 12, Formula 5: There is a mistake here: it's (SiT + SjT - 2SiSj) instead of (SiT + SjT - SiSj)

    Page 12, Line 41 and 42: replace "SiSj" by "2SiSj"

    Page 13, Line 36: Missing parenthesis

    Page 14, Formula 10: I ran the math again and I found -2SiSjL instead of -3SiSjL

    Page 14, Line 35: replace "??" by "formula 11"

    Page 16, Line 23: Missing comma between Sj and Si

    Page 17, Line 45: Missing comma before "potential"

    Page 17, Line 51: replace "Graph 1" by "Figure 1"

    Page 18, Line 43: Missing comma before "L"

    Page 25, Line 24: "In this line of research self-co-occurrences are non-existent or irrelevant, whereas the probability formula assumes that an observation from an entity can be drawn again after been picked in the first draw." --> I would be more careful than you are since it is a subject of debate among scientometricians (see my previous comment on this issue)

    Page 25, Line 28: "This paper introduces a formula that is based on, but not equal to, combinations without repetition" --> What do you mean by "not equal to"

    Page 25, Line 33: replace "the the probability" by "the probability" (repetition)

    Page 26, Line 1: "it is evident that" --> I would prefer a less assertive formulation like "we have shown that"

    References, Line 25: 2 typos

    Reviewed by
    Cite this review
    Reviewer report
    2020/11/13

    This is a very good submission. The author made a convincing argument of why a popular existing co-occurrence measure overestimates the similarity of entities in a matrix. The author proposed a new similarity measure. The mathematical narrative for this measure seems logical. The patent data further validate its utility. I only have a few minor comments:
    1. The author should explicitly state if the newly proposed measure is indeed new. It seems it is already incorporated into an R package in 2016 by another researcher. So it is unclear to me who proposed the new measure.
    2. on p. 5, this sentence reads odd to me: "To correct the absolute number of co-occurrences for the size-effect data is normalised". Do you mean "is normalization"?
    3. on p. 13, there is a latex formula reference error displayed as ??.

    Reviewed by
    Cite this review
All peer review content displayed here is covered by a Creative Commons CC BY 4.0 license.