Abstract

Scientometric research often relies on large-scale bibliometric databases of academic journal articles. Long term and longitudinal research can be affected if the composition of a database varies over time, and text processing research can be affected if the percentage of articles with abstracts changes. This article therefore assesses changes in the magnitude of the coverage of a major citation index, Scopus, over 121 years from 1900. The results show sustained exponential growth from 1900, except for dips during both world wars, and with increased growth after 2004. Over the same period, the percentage of articles with 500+ character abstracts increased from 1% to 95%. The number of different journals in Scopus also increased exponentially, but slowing down from 2010, with the number of articles per journal being approximately constant until 1980, then tripling due to megajournals and online-only publishing. The breadth of Scopus, in terms of the number of narrow fields with substantial numbers of articles, simultaneously increased from one field having 1000 articles in 1945 to 308 in 2020. Scopus’s international character also radically changed from 68% of first authors from Germany and the USA in 1900 to just 17% in 2020, with China dominating (25%).


Authors

Mike Thelwall;  Pardeep Sud

Publons users who've claimed - I am an author

No Publons users have claimed this paper.

Contributors on Publons
  • 1 reviewer
  • pre-publication peer review (FINAL ROUND)
    Decision Letter
    2022/01/04

    04-Jan-2022

    Dear Dr. Thelwall:

    It is a pleasure to accept your manuscript entitled "Scopus 1900-2020: Growth in articles, abstracts, countries, fields, and journals" for publication in Quantitative Science Studies.

    I have a few minor suggestions myself, which can be found at the bottom of this email. You may want to take these suggestions into consideration in the preparation of the final version of your manuscript, but feel free to disregard them at your own discretion.

    I would like to request you to prepare the final version of your manuscript using the checklist available at https://tinyurl.com/qsschecklist. Please also sign the publication agreement, which can be downloaded from https://tinyurl.com/qssagreement. The final version of your manuscript, along with the completed checklist and the signed publication agreement, can be returned to qss@issi-society.org.

    Thank you for your contribution. On behalf of the Editors of Quantitative Science Studies, I look forward to your continued contributions to the journal.

    Best wishes,
    Dr. Ludo Waltman
    Editor, Quantitative Science Studies
    qss@issi-society.org

    Editor Comments to Author:

    "Lower coverage than Google Scholar and Microsoft Academic is a logical outcome of the standards that journals must meet to be indexed by Scopus": It might be worth adding an acknowledgment that the content selection performed by Scopus (as well as Web of Science) seems to result in an underrepresentation of journals from non-Western countries and journals that are not in English. In this context, a reference to the following paper might be of relevance: https://doi.org/10.1007/s11192-015-1765-5.

    I like the analysis of the availability of abstracts. This analysis yields valuable insights. I think it might be useful to very briefly mention other bibliographic data sources that provide access to abstracts, including their limitations. In particular, Crossref, PubMed, and Microsoft Academic may be of special interest, since they are open. Crossref has the limitation of having lots of records for which the abstract has not been made available by the publisher. PubMed has the limitation of being restricted to biomedical research. Microsoft Academic, however, seems to be a rich data source for abstracts.

    "reflects the greater of Scopus": It seems that something is missing here.

    Decision letter by
    Cite this decision letter
    Author Response
    2021/12/14

    *Thank you very much to the reviewers for evaluating the paper and for these comments. Please see below for the changes made in response, marked with *

    Reviewer: 1

    Comments to the Author

    Disclaimer: at the time of writing of this review, I am employed by Elsevier, the company that also publishes the database in topic of this article (Scopus). My review should be seen as based on my knowledge of this database and overall experience in database evaluation in similar settings.

    Due to this potential conflict of interest, I will refrain from a [negative] recommendation about the publication and recommend accept (with minor revisions), leaving it to the editor and authors to decide which of my peer review comments to include.

    Page 2

    line 8: "One source of difference between WoS and Scopus is that WoS aims to generate a balanced set of journals to support the quality of citation data used for impact evaluations (Birkle et al., 2020)"
    This doesn't explain the difference: it refers to WoS definition of index profile, but how does that cause a difference? It may be worth spending a sentence on the different indexing policies and governance.
    ****The following has been added to explain the difference. “Whilst a larger set of journals would be better for information retrieval, a more balanced set would help citation data that is field normalised or norm-referenced within its field (e.g., adding many rarely cited journals to a single field would push existing journals into higher journal impact factor quartiles and increase the field normalised citation score of cited articles in the existing journals).”

    Page 3

    line 55: "The differences between download years should only influence the citation data, unless Scopus has substantially changed its 1996-2013 coverage after 2018, which seems unlikely."
    Validating the degree to which things have changed is possible. Scopus continuously addresses issues, such as deduplication and implementing other quality enhancements, and 3 years seems like a lot of time for quality enhancements to get implemented.
    While I agree that it is not very likely things have changed more than what warrants this study, a validation shouldn’t be too problematic: for instance by comparing volume counts by year between extraction time stamp and current through the existing API, or even more powerful would be through ICSR Lab.
    **** The following has been added to address mention these possibilities, “There may be minor changes due to new de-duplication algorithms or other improvements, however.” We don’t have a straightforward way to compare our data in the way described so we hope that keeping the limitation statement is acceptable – it seems extremely unlikely that any changes would influence the results.

    line 58: "The data was checked for consistency by generating time series for the number of articles per year, per narrow field. Some gaps were identified due to software errors and these were filled by re-downloading the missing data for the narrow field and year."
    It is not clear to me if this may have introduced inconsistencies by re-running the queries at a later time. What was the time span between the iterations?
    **** The following has been added to give this information, “within two months of the original download date”

    Page 4

    line 14 "As in previous papers by the same author"
    The paper will be more inclusive for people less familiar with the field if this can be specified (which of the authors, some example papers). I appreciate this may be to limit the amount of self-citations, in which case perhaps a single example can be served. If the author really is the only one, such niche citations should be justified. Otherwise it could help to not just add one of the author, but also add one of other authors with similar approaches referred to in the context of this statement.
    ****A citation has been added, as follows. “As in previous papers from the authors’ research group (e.g., Fairclough & Thelwall, 2021)”. We don’t see others using the same threshold so we can’t cite an out-of-team paper.

    line 28 " For example, the skewness is enormous at 107 for the 2004 citation counts and even larger for recent years (387 in 2020), whereas the skewness of the normal distribution is 0"
    I don't know what is meant by "skewness" and why 107 is a big number. Can this be elaborated a little bit?
    **** The sentence has been expanded as follows, with extra citations, “The citation count data is not symmetrical (e.g., equally distributed on either side of the mean) but is highly skewed: whilst most articles have 0 or few citations, so their citation counts are slightly less than the mean, the citation counts of a small number of highly cited articles are far greater than the mean (de Solla Price, 1976; Seglen, 1992).”

    line 43 "The extent to which the trend reflects the technical limitations of Scopus and its indexing policy rather than the amount of scholarly publishing is unclear because not journals qualify for indexing"
    sentence seems incomplete.
    **** “all” has been added, “The extent to which the trend reflects the technical limitations of Scopus and its indexing policy rather than the amount of scholarly publishing is unclear because not all journals qualify for indexing (e.g., Mabe & Amin, 2001).”

    line 50: "WoS does not have a kink in 2004, suggesting that this is a Scopus phenomenon (WoS has a similar exponentially increasing shape, with sudden increases in 1996, 2015 and 2019)."
    May be worth showing this visually
    ****This graph has been made and added to the online supplement, “: see the online supplement for a graph”

    Page 5

    Fig 1.
    Suggestion on formatting: I have a little trouble comparing the axes and different lines. A quick and easy suggestion: group content by color, and separate view for axis by using e.g. solid line for left and dashed line for right axis, so that you can immediately see which 2 lines are describing the same data point. It will also help colour blind since now there's two types of blue that are hard to differentiate. It's a minor comment.
    **** The graph has been remade to be colour coordinated and using dashed lines.

    line 39: " ... from 1% in 1990 ...."
    should this be 1900?
    ****Yes, thank you!

    line 43: "In contrast, long abstracts with at least 2000 characters and about 320 words are still rare common, accounting for only 10% of articles in 2020."
    not sure, is "rare common" an English expression (not a native speaker myself) or should it be either rare or "uncommon"
    ****This mistake has been fixed (changed to “rare”).

    line 56 "The trend found here may also partly reflect Scopus ingesting early sources that omitted abstracts, although no evidence was found for this as a cause"
    Use of Abstracts slowly started in the beginning of the 1900. See also https://en.wikipedia.org/wiki/Abstract_(summary)
    **** The Wikipedia 1919 origins of abstracts claim is clearly wrong from our data and the references used to support one of the associated claims also seems to be inappropriate [Bazerman, C. (1988). Shaping written knowledge: The genre and activity of the experimental article in science. Madison: University of Wisconsin Press.] so we are still stuck for references to this issue.

    Page 6

    Section 3.3
    The narrow field definitions focus only on raw count at present. It may be worth to look at it from a proportional angle to, as overall databases/output has been much lower in early years.
    ****This would be interesting but we prefer to just focus on raw counts for the current paper to present a clear message.

    Page 7

    line 37 "Surprisingly, the growth in the number of journals slowed and then stopped by 2020, perhaps due to the increasing number of general or somewhat general megajournals (Siler et al., 2020) adequately filling spaces that new niche journals might previously have occupied."
    The evaluation in this paper also stops at 2020, and 2020 is the most recent year, i.e. the year most close to the point of measure: there may still be additional content flowing in 2021. The count for 2020 is likely an undercount and therefore statements about the trend should be careful.
    I'm also not entirely sure how this has been assessed; what has been counted as a journal? how have they been grouped? has any filtering been applied?
    If I look at the public records of journals in Scopus on:
    https://www.elsevier.com/solutions/scopus/how-scopus-works/content
    where there is a link to:
    https://www.elsevier.com/__data/assets/excel_doc/0015/91122/extlistSeptember2021.xlsx
    and counting by grouping by the first coverage year, I see an ever increasing count of around a 1000 journals each year since 2004, with only a slight drop in 2020, which again is caused by the list not fully populated for new recently added journals; for example, QSS is not in the list even though the articles since 2020 are now covered, but will be in the list on the next update in March 2022.
    ****The points about megajournals are speculation based on common knowledge about them. The following has been added to acknowledge possible future changes, “The journal count for 2020 may also increase as back issues of new journals are added in 2021 and afterwards.” We can’t rely on the Elsevier list, unfortunately, since not all of its journals contain articles (e.g., trade journals, review journals).

    Page 8

    line 30 ". The apparent accelerated growth after 2010 is presumably due to increases in the number and size of online-only megajournals, starting in 2006 with PLOS One (Domnina, 2016)"
    This could be validated / made more clear by exploring the median next to the mean, or exclude mega journals. Or by focusing solely on mega journals.
    ****Defining megajournals is tricky and opens up many definitional problems, which we prefer to avoid for this point since it is not central to the paper. Instead, we have added extra evidence, “The ten largest journals in Scopus in 2020 were all arguably megajournals (Scientific Reports, IEEE Access, Plos One, Sustainability, International Journal of Environmental Research and Public Health, Applied Sciences, International Journal of Molecular Sciences, Science of the Total Environment, Sensors, Energies), with only Science of the Total Environment existing before Plos One.”

    Page 9

    fig 7. I've tried to reproduce these numbers. Some spot checks that caught my eye: 1946 states 13,910 papers for USA in the spreadsheet, over 84,030 in total. Scopus.com currently reports different numbers for that year: USA: 8,339 (across all authorships, because we cannot filter in the UI on first author, but 1st authors will be a subset of this, in ICSR LAB august 01 2021 data: 7,947), total world paper 1946 (ar/j): 61,473. The overall trend lines when plotted look still similar (nothing shocking changes in the order of things), but worth considering some spot checks here.
    I have an impression this may be due to the "/W duplicates" count that exist, as that seems to align with the 1946 total used. I'm not sure what kind of duplicates are counted in the process here. Is that across subjects? If so, why are items with more than one ASJC category assigned counted multiple times in this chart?
    If duplicate counting is the cause of this difference, I would suggest to do the chart without duplicates.
    ****Thank you for taking the care to spot this mistake and for identifying the cause. The figures and FigShare data have been replaced with corrected data, eliminating duplicates (including for the 500 character abstract figures in the online spreadsheets, which had the same issue).

    Page 10

    Fig 8. perhaps good to clarify that the only change from fig7 is the denominator, it took me a while to figure that out.
    ****The text “(i.e., changing only the denominator from the previous graph)” has been added to the second figure caption.

    line 36 "Although articles accrue citations over time, so older articles have longer to be cited and should have more citations than newer articles, other factors being equal, this pattern is only partly evident in Scopus"
    Not entirely clear what this means; perhaps intended as a double sentence?
    ****This has been split into two sentences, “Articles accrue citations over time, so older articles have longer to be cited and should have more citations than newer articles, other factors being equal. This pattern is only partly evident in Scopus, however (Figure 9).”

    line 41; would suggest to order these by (likely) impact. For what it is worth, do not suspect the first "(a) greater technical difficulty in matching citations to articles in older journals" to be of (major) impact, and could use some evidence to support that claim; for instance by inspecting reference lists and pointing out that reference links are more often broken for older articles than they are for newer.
    The most important factor I think is missing: increase in coverage with stronger growth after 2000 combined with the average/median/mode age of papers in reference lists explains the fact that papers at a distance of average reference age from today have a higher likelihood of being cited than papers before that date.
    Consider summing a series of normal distributions, each next normal distribution slightly to the right of the previous, but also increasing in volume: the resulting chart would look somewhat like the figure displayed here. The distribution of cited reference age is not perfectly normal of course (skewed); for 2020, the mode is 2, the average is around 10 and the median is somewhere between 7 and 8 years.
    ****The missing factor has been added and the list reorganised in decreasing order of likelihood but we hesitate to make this order explicit because we lack your first hand knowledge of the indexing system. “The relatively few citations for articles published before 2000 could be due to a combination of factors, but the most likely seem to be (a) shorter reference lists in older papers, (b) a tendency to cite newer research in the digital age due to electronic searching, online first, and preprint archives, (c) fewer references in older papers mentioning journal articles, and (d) greater technical difficulty in matching citations to articles in older journals.”

    line 47: "there are also fewer articles to be cited, so the two factors largely cancel out under a scenario of constant growth"
    But what this argument here doesn't account for, is reference age. the "citation chance" is not equal for a given article for citing articles of different publication years. Also the number of references keeps increasing over time (from around 25 in 1996 to over 40 in 2020), which accounts for another amplifying effect on the increase.
    ****The following has been added, to acknowledge this, “Scholars have increasingly quick access to published articles now (e.g., preprints, online first), which is likely to have shortened average reference ages, disadvantaging older articles.” Reference list changes are already mentioned above this point in (b).

    line 49: "It is also possible that Scopus indexed a smaller fraction of the early scientific literature, therefore losing more old citations than contemporary citations. If true, this may again partly cancel out with Scopus presumably tending to preferentially index the most prestigious journals, therefore increasing the average number of citations per indexed article"
    this sounds speculative, and is very difficult to assess: how should the total body of available content be quantified? there were far less journals being published in earlier years and there is exponential growth of journals in more recent years, also outside what is indexed in Scopus, so it may actually be that the fraction of indexed publications is actually lower in recent years. But this is also speculation.
    ****We want to include the speculation to give context to the results and hope that it is flagged clearly as such with the phrase. “It is also possible”.

    Page 11

    line 45 "... 2004 (start of more rapid expansion)" Scopus adds journals _usually_ in a forward flow (only new content since adding to Scopus), and hence this is reflected in steady a coverage increase since 2004 (launch of Scopus). There are cases where journals are added in a backfill too. But also the fact that Scopus was launched in 2004 and from then on there has been an active strategy to increase coverage by selecting more journals (and more publishers suggesting journals to Scopus as it gained relevance) explains the rapid expansion after 2004.
    ****This has been changed to “(start of more rapid expansion and Scopus launch year)” to flag this important date. And the sentence, “More specifically, the initial release in 2004 and subsequent backfilling projects were surpassed by subsequent expansions of additional journals.” has been added above.

    page 12

    line 17 "If using citation counts from before 2004, acknowledge that long term trends will be influenced by lower average citations for earlier years, whether using a fixed citation window or counting citations to date."
    I general a caution is warranted when comparing citations over time, this holds true in general for all years. Citation patterns change over time, both because of index changes, but also because the scientific process itself evolves (with access to data changing as one element) as well as speed of the publishing process.
    ****The following has been added to acknowledge this, “Lower-level biases may also influence other years, however, as the publishing process evolves (e.g., speed, indexing).”

    Reviewer: 2
    Comments to the Author
    Manuscript QSS-2021-0063
    Scopus 1900-2020: Growth in articles, abstracts, countries, fields, and journals

    CONTENT OF THE PAPER
    - The paper offers a descriptive analysis of the coverage of Scopus from 1900 to 2020 filling a gap in the existing literature focused on Web of Science.

    • The paper has a clear structure, it is based on a descriptive analysis and the results are illustrated with several informative figures.

    • The main limitations of the study are reported in a specific paragraph and the implications of the analysis for the users of Scopus are reported as practical suggestions in the concluding section.

    • The data used in the figures of the paper are available for the users in the supplementary materials.

    COMMENTS/SUGGESTIONS FOR THE AUTHORS

    Overall, we think the paper may be of interest for the readers of Quantitative Science Studies and we have only two minor comments/suggestions for the authors.

    1. The keywords of the paper should be changed because are too general and do not fit with the content of the paper.
      ****The keywords have been changed to, “Scopus; scholarly databases; academic publishing; academic publishing trends”

    2. Figure 9 is only marginally mentioned in the text: please explain it in details explaining why you report both arithmetic and geometric mean of the citations.
      ****The start of this paragraph has been reformulated as follows, “Articles accrue citations over time, so older articles have longer to be cited and should have more citations than newer articles, other factors being equal. This pattern is only partly evident in Scopus, however, since there is a peak in the year 2000 (Figure 9). This peak remains if the geometric mean is used (Fairclough & Thelwall, 2015), so it is not due to a few highly cited articles.”



    Cite this author response
  • pre-publication peer review (ROUND 1)
    Decision Letter
    2021/12/12

    12-Dec-2021

    Dear Dr. Thelwall:

    Your manuscript QSS-2021-0063 entitled "Scopus 1900-2020: Growth in articles, abstracts, countries, fields, and journals", which you submitted to Quantitative Science Studies, has been reviewed. The comments of the reviewers are included at the bottom of this letter. Reviewer 2 is positive about your work and has only some very small comments. Reviewer 1 is an Elsevier employee with a detailed knowledge of the Scopus database. This reviewer has provided more detailed comments on your manuscript. Reviewer 1 is also positive about your work, but the reviewer also has a number of suggestions for improvements.

    Based on the comments of the reviewers, my editorial decision is to invite you to prepare a revision of your manuscript.

    To revise your manuscript, log into https://mc.manuscriptcentral.com/qss and enter your Author Center, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision.

    You may also click the below link to start the revision process (or continue the process if you have already started your revision) for your manuscript. If you use the below link you will not be required to login to ScholarOne Manuscripts.

    PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm.

    https://mc.manuscriptcentral.com/qss?URL_MASK=7184764062ab4e07a977cd48a3fa6b44

    You will be unable to make your revisions on the originally submitted version of the manuscript. Instead, revise your manuscript using a word processing program and save it on your computer. Please also highlight the changes to your manuscript within the document by using the track changes mode in MS Word or by using bold or colored text.

    Once the revised manuscript is prepared, you can upload it and submit it through your Author Center.

    When submitting your revised manuscript, you will be able to respond to the comments made by the reviewers in the space provided. You can use this space to document any changes you make to the original manuscript. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response to the reviewers.

    IMPORTANT: Your original files are available to you when you upload your revised manuscript. Please delete any redundant files before completing the submission.

    If possible, please try to submit your revised manuscript by 12-Jun-2022. Let me know if you need more time to revise your work.

    Once again, thank you for submitting your manuscript to Quantitative Science Studies and I look forward to receiving your revision.

    Best wishes,
    Dr. Ludo Waltman
    Editor, Quantitative Science Studies
    qss@issi-society.org

    Reviewers' Comments to Author:

    Reviewer: 1

    Comments to the Author

    Disclaimer: at the time of writing of this review, I am employed by Elsevier, the company that also publishes the database in topic of this article (Scopus). My review should be seen as based on my knowledge of this database and overall experience in database evaluation in similar settings.

    Due to this potential conflict of interest, I will refrain from a [negative] recommendation about the publication and recommend accept (with minor revisions), leaving it to the editor and authors to decide which of my peer review comments to include.

    Page 2

    line 8: "One source of difference between WoS and Scopus is that WoS aims to generate a balanced set of journals to support the quality of citation data used for impact evaluations (Birkle et al., 2020)"
    This doesn't explain the difference: it refers to WoS definition of index profile, but how does that cause a difference? It may be worth spending a sentence on the different indexing policies and governance.

    Page 3

    line 55: "The differences between download years should only influence the citation data, unless Scopus has substantially changed its 1996-2013 coverage after 2018, which seems unlikely."
    Validating the degree to which things have changed is possible. Scopus continuously addresses issues, such as deduplication and implementing other quality enhancements, and 3 years seems like a lot of time for quality enhancements to get implemented.
    While I agree that it is not very likely things have changed more than what warrants this study, a validation shouldn’t be too problematic: for instance by comparing volume counts by year between extraction time stamp and current through the existing API, or even more powerful would be through ICSR Lab.

    line 58: "The data was checked for consistency by generating time series for the number of articles per year, per narrow field. Some gaps were identified due to software errors and these were filled by re-downloading the missing data for the narrow field and year."
    It is not clear to me if this may have introduced inconsistencies by re-running the queries at a later time. What was the time span between the iterations?

    Page 4

    line 14 "As in previous papers by the same author"
    The paper will be more inclusive for people less familiar with the field if this can be specified (which of the authors, some example papers). I appreciate this may be to limit the amount of self-citations, in which case perhaps a single example can be served. If the author really is the only one, such niche citations should be justified. Otherwise it could help to not just add one of the author, but also add one of other authors with similar approaches referred to in the context of this statement.

    line 28 " For example, the skewness is enormous at 107 for the 2004 citation counts and even larger for recent years (387 in 2020), whereas the skewness of the normal distribution is 0"
    I don't know what is meant by "skewness" and why 107 is a big number. Can this be elaborated a little bit?

    line 43 "The extent to which the trend reflects the technical limitations of Scopus and its indexing policy rather than the amount of scholarly publishing is unclear because not journals qualify for indexing"
    sentence seems incomplete.

    line 50: "WoS does not have a kink in 2004, suggesting that this is a Scopus phenomenon (WoS has a similar exponentially increasing shape, with sudden increases in 1996, 2015 and 2019)."
    May be worth showing this visually

    Page 5

    Fig 1.
    Suggestion on formatting: I have a little trouble comparing the axes and different lines. A quick and easy suggestion: group content by color, and separate view for axis by using e.g. solid line for left and dashed line for right axis, so that you can immediately see which 2 lines are describing the same data point. It will also help colour blind since now there's two types of blue that are hard to differentiate. It's a minor comment.

    line 39: " ... from 1% in 1990 ...."
    should this be 1900?

    line 43: "In contrast, long abstracts with at least 2000 characters and about 320 words are still rare common, accounting for only 10% of articles in 2020."
    not sure, is "rare common" an English expression (not a native speaker myself) or should it be either rare or "uncommon"

    line 56 "The trend found here may also partly reflect Scopus ingesting early sources that omitted abstracts, although no evidence was found for this as a cause"
    Use of Abstracts slowly started in the beginning of the 1900. See also https://en.wikipedia.org/wiki/Abstract_(summary)

    Page 6

    Section 3.3
    The narrow field definitions focus only on raw count at present. It may be worth to look at it from a proportional angle to, as overall databases/output has been much lower in early years.

    Page 7

    line 37 "Surprisingly, the growth in the number of journals slowed and then stopped by 2020, perhaps due to the increasing number of general or somewhat general megajournals (Siler et al., 2020) adequately filling spaces that new niche journals might previously have occupied."
    The evaluation in this paper also stops at 2020, and 2020 is the most recent year, i.e. the year most close to the point of measure: there may still be additional content flowing in 2021. The count for 2020 is likely an undercount and therefore statements about the trend should be careful.
    I'm also not entirely sure how this has been assessed; what has been counted as a journal? how have they been grouped? has any filtering been applied?
    If I look at the public records of journals in Scopus on:
    https://www.elsevier.com/solutions/scopus/how-scopus-works/content
    where there is a link to:
    https://www.elsevier.com/__data/assets/excel_doc/0015/91122/extlistSeptember2021.xlsx
    and counting by grouping by the first coverage year, I see an ever increasing count of around a 1000 journals each year since 2004, with only a slight drop in 2020, which again is caused by the list not fully populated for new recently added journals; for example, QSS is not in the list even though the articles since 2020 are now covered, but will be in the list on the next update in March 2022.

    Page 8

    line 30 ". The apparent accelerated growth after 2010 is presumably due to increases in the number and size of online-only megajournals, starting in 2006 with PLOS One (Domnina, 2016)"
    This could be validated / made more clear by exploring the median next to the mean, or exclude mega journals. Or by focusing solely on mega journals.

    Page 9

    fig 7. I've tried to reproduce these numbers. Some spot checks that caught my eye: 1946 states 13,910 papers for USA in the spreadsheet, over 84,030 in total. Scopus.com currently reports different numbers for that year: USA: 8,339 (across all authorships, because we cannot filter in the UI on first author, but 1st authors will be a subset of this, in ICSR LAB august 01 2021 data: 7,947), total world paper 1946 (ar/j): 61,473. The overall trend lines when plotted look still similar (nothing shocking changes in the order of things), but worth considering some spot checks here.
    I have an impression this may be due to the "/W duplicates" count that exist, as that seems to align with the 1946 total used. I'm not sure what kind of duplicates are counted in the process here. Is that across subjects? If so, why are items with more than one ASJC category assigned counted multiple times in this chart?
    If duplicate counting is the cause of this difference, I would suggest to do the chart without duplicates.

    Page 10

    Fig 8. perhaps good to clarify that the only change from fig7 is the denominator, it took me a while to figure that out.

    line 36 "Although articles accrue citations over time, so older articles have longer to be cited and should have more citations than newer articles, other factors being equal, this pattern is only partly evident in Scopus"
    Not entirely clear what this means; perhaps intended as a double sentence?

    line 41; would suggest to order these by (likely) impact. For what it is worth, do not suspect the first "(a) greater technical difficulty in matching citations to articles in older journals" to be of (major) impact, and could use some evidence to support that claim; for instance by inspecting reference lists and pointing out that reference links are more often broken for older articles than they are for newer.
    The most important factor I think is missing: increase in coverage with stronger growth after 2000 combined with the average/median/mode age of papers in reference lists explains the fact that papers at a distance of average reference age from today have a higher likelihood of being cited than papers before that date.
    Consider summing a series of normal distributions, each next normal distribution slightly to the right of the previous, but also increasing in volume: the resulting chart would look somewhat like the figure displayed here. The distribution of cited reference age is not perfectly normal of course (skewed); for 2020, the mode is 2, the average is around 10 and the median is somewhere between 7 and 8 years.
    line 47: "there are also fewer articles to be cited, so the two factors largely cancel out under a scenario of constant growth"
    But what this argument here doesn't account for, is reference age. the "citation chance" is not equal for a given article for citing articles of different publication years. Also the number of references keeps increasing over time (from around 25 in 1996 to over 40 in 2020), which accounts for another amplifying effect on the increase.

    line 49: "It is also possible that Scopus indexed a smaller fraction of the early scientific literature, therefore losing more old citations than contemporary citations. If true, this may again partly cancel out with Scopus presumably tending to preferentially index the most prestigious journals, therefore increasing the average number of citations per indexed article"
    this sounds speculative, and is very difficult to assess: how should the total body of available content be quantified? there were far less journals being published in earlier years and there is exponential growth of journals in more recent years, also outside what is indexed in Scopus, so it may actually be that the fraction of indexed publications is actually lower in recent years. But this is also speculation.

    Page 11

    line 45 "... 2004 (start of more rapid expansion)" Scopus adds journals _usually_ in a forward flow (only new content since adding to Scopus), and hence this is reflected in steady a coverage increase since 2004 (launch of Scopus). There are cases where journals are added in a backfill too. But also the fact that Scopus was launched in 2004 and from then on there has been an active strategy to increase coverage by selecting more journals (and more publishers suggesting journals to Scopus as it gained relevance) explains the rapid expansion after 2004.

    page 12

    line 17 "If using citation counts from before 2004, acknowledge that long term trends will be influenced by lower average citations for earlier years, whether using a fixed citation window or counting citations to date."
    I general a caution is warranted when comparing citations over time, this holds true in general for all years. Citation patterns change over time, both because of index changes, but also because the scientific process itself evolves (with access to data changing as one element) as well as speed of the publishing process.

    Reviewer: 2

    Comments to the Author
    Manuscript QSS-2021-0063
    Scopus 1900-2020: Growth in articles, abstracts, countries, fields, and journals

    CONTENT OF THE PAPER

    • The paper offers a descriptive analysis of the coverage of Scopus from 1900 to 2020 filling a gap in the existing literature focused on Web of Science.

    • The paper has a clear structure, it is based on a descriptive analysis and the results are illustrated with several informative figures.

    • The main limitations of the study are reported in a specific paragraph and the implications of the analysis for the users of Scopus are reported as practical suggestions in the concluding section.

    • The data used in the figures of the paper are available for the users in the supplementary materials.

    COMMENTS/SUGGESTIONS FOR THE AUTHORS

    Overall, we think the paper may be of interest for the readers of Quantitative Science Studies and we have only two minor comments/suggestions for the authors.

    1. The keywords of the paper should be changed because are too general and do not fit with the content of the paper.

    2. Figure 9 is only marginally mentioned in the text: please explain it in details explaining why you report both arithmetic and geometric mean of the citations.

    Decision letter by
    Cite this decision letter
    Reviewer report
    2021/12/11

    Manuscript QSS-2021-0063
    Scopus 1900-2020: Growth in articles, abstracts, countries, fields, and journals

    CONTENT OF THE PAPER

    • The paper offers a descriptive analysis of the coverage of Scopus from 1900 to 2020 filling a gap in the existing literature focused on Web of Science.

    • The paper has a clear structure, it is based on a descriptive analysis and the results are illustrated with several informative figures.

    • The main limitations of the study are reported in a specific paragraph and the implications of the analysis for the users of Scopus are reported as practical suggestions in the concluding section.

    • The data used in the figures of the paper are available for the users in the supplementary materials.

    COMMENTS/SUGGESTIONS FOR THE AUTHORS

    Overall, we think the paper may be of interest for the readers of Quantitative Science Studies and we have only two minor comments/suggestions for the authors.

    1. The keywords of the paper should be changed because are too general and do not fit with the content of the paper.

    2. Figure 9 is only marginally mentioned in the text: please explain it in details explaining why you report both arithmetic and geometric mean of the citations.

    Reviewed by
    Cite this review
    Reviewer report
    2021/11/05

    Disclaimer: at the time of writing of this review, I am employed by Elsevier, the company that also publishes the database in topic of this article (Scopus). My review should be seen as based on my knowledge of this database and overall experience in database evaluation in similar settings.

    Due to this potential conflict of interest, I will refrain from a [negative] recommendation about the publication and recommend accept (with minor revisions), leaving it to the editor and authors to decide which of my peer review comments to include.

    Page 2

    line 8: "One source of difference between WoS and Scopus is that WoS aims to generate a balanced set of journals to support the quality of citation data used for impact evaluations (Birkle et al., 2020)"
    This doesn't explain the difference: it refers to WoS definition of index profile, but how does that cause a difference? It may be worth spending a sentence on the different indexing policies and governance.

    Page 3

    line 55: "The differences between download years should only influence the citation data, unless Scopus has substantially changed its 1996-2013 coverage after 2018, which seems unlikely."
    Validating the degree to which things have changed is possible. Scopus continuously addresses issues, such as deduplication and implementing other quality enhancements, and 3 years seems like a lot of time for quality enhancements to get implemented.
    While I agree that it is not very likely things have changed more than what warrants this study, a validation shouldn’t be too problematic: for instance by comparing volume counts by year between extraction time stamp and current through the existing API, or even more powerful would be through ICSR Lab.

    line 58: "The data was checked for consistency by generating time series for the number of articles per year, per narrow field. Some gaps were identified due to software errors and these were filled by re-downloading the missing data for the narrow field and year."
    It is not clear to me if this may have introduced inconsistencies by re-running the queries at a later time. What was the time span between the iterations?

    Page 4

    line 14 "As in previous papers by the same author"
    The paper will be more inclusive for people less familiar with the field if this can be specified (which of the authors, some example papers). I appreciate this may be to limit the amount of self-citations, in which case perhaps a single example can be served. If the author really is the only one, such niche citations should be justified. Otherwise it could help to not just add one of the author, but also add one of other authors with similar approaches referred to in the context of this statement.

    line 28 " For example, the skewness is enormous at 107 for the 2004 citation counts and even larger for recent years (387 in 2020), whereas the skewness of the normal distribution is 0"
    I don't know what is meant by "skewness" and why 107 is a big number. Can this be elaborated a little bit?

    line 43 "The extent to which the trend reflects the technical limitations of Scopus and its indexing policy rather than the amount of scholarly publishing is unclear because not journals qualify for indexing"
    sentence seems incomplete.

    line 50: "WoS does not have a kink in 2004, suggesting that this is a Scopus phenomenon (WoS has a similar exponentially increasing shape, with sudden increases in 1996, 2015 and 2019)."
    May be worth showing this visually

    Page 5

    Fig 1.
    Suggestion on formatting: I have a little trouble comparing the axes and different lines. A quick and easy suggestion: group content by color, and separate view for axis by using e.g. solid line for left and dashed line for right axis, so that you can immediately see which 2 lines are describing the same data point. It will also help colour blind since now there's two types of blue that are hard to differentiate. It's a minor comment.

    line 39: " ... from 1% in 1990 ...."
    should this be 1900?

    line 43: "In contrast, long abstracts with at least 2000 characters and about 320 words are still rare common, accounting for only 10% of articles in 2020."
    not sure, is "rare common" an English expression (not a native speaker myself) or should it be either rare or "uncommon"

    line 56 "The trend found here may also partly reflect Scopus ingesting early sources that omitted abstracts, although no evidence was found for this as a cause"
    Use of Abstracts slowly started in the beginning of the 1900. See also https://en.wikipedia.org/wiki/Abstract_(summary)

    Page 6

    Section 3.3
    The narrow field definitions focus only on raw count at present. It may be worth to look at it from a proportional angle to, as overall databases/output has been much lower in early years.

    Page 7

    line 37 "Surprisingly, the growth in the number of journals slowed and then stopped by 2020, perhaps due to the increasing number of general or somewhat general megajournals (Siler et al., 2020) adequately filling spaces that new niche journals might previously have occupied."
    The evaluation in this paper also stops at 2020, and 2020 is the most recent year, i.e. the year most close to the point of measure: there may still be additional content flowing in 2021. The count for 2020 is likely an undercount and therefore statements about the trend should be careful.
    I'm also not entirely sure how this has been assessed; what has been counted as a journal? how have they been grouped? has any filtering been applied?
    If I look at the public records of journals in Scopus on:
    https://www.elsevier.com/solutions/scopus/how-scopus-works/content
    where there is a link to:
    https://www.elsevier.com/__data/assets/excel_doc/0015/91122/extlistSeptember2021.xlsx
    and counting by grouping by the first coverage year, I see an ever increasing count of around a 1000 journals each year since 2004, with only a slight drop in 2020, which again is caused by the list not fully populated for new recently added journals; for example, QSS is not in the list even though the articles since 2020 are now covered, but will be in the list on the next update in March 2022.

    Page 8

    line 30 ". The apparent accelerated growth after 2010 is presumably due to increases in the number and size of online-only megajournals, starting in 2006 with PLOS One (Domnina, 2016)"
    This could be validated / made more clear by exploring the median next to the mean, or exclude mega journals. Or by focusing solely on mega journals.

    Page 9

    fig 7. I've tried to reproduce these numbers. Some spot checks that caught my eye: 1946 states 13,910 papers for USA in the spreadsheet, over 84,030 in total. Scopus.com currently reports different numbers for that year: USA: 8,339 (across all authorships, because we cannot filter in the UI on first author, but 1st authors will be a subset of this, in ICSR LAB august 01 2021 data: 7,947), total world paper 1946 (ar/j): 61,473. The overall trend lines when plotted look still similar (nothing shocking changes in the order of things), but worth considering some spot checks here.
    I have an impression this may be due to the "/W duplicates" count that exist, as that seems to align with the 1946 total used. I'm not sure what kind of duplicates are counted in the process here. Is that across subjects? If so, why are items with more than one ASJC category assigned counted multiple times in this chart?
    If duplicate counting is the cause of this difference, I would suggest to do the chart without duplicates.

    Page 10

    Fig 8. perhaps good to clarify that the only change from fig7 is the denominator, it took me a while to figure that out.

    line 36 "Although articles accrue citations over time, so older articles have longer to be cited and should have more citations than newer articles, other factors being equal, this pattern is only partly evident in Scopus"
    Not entirely clear what this means; perhaps intended as a double sentence?

    line 41; would suggest to order these by (likely) impact. For what it is worth, do not suspect the first "(a) greater technical difficulty in matching citations to articles in older journals" to be of (major) impact, and could use some evidence to support that claim; for instance by inspecting reference lists and pointing out that reference links are more often broken for older articles than they are for newer.
    The most important factor I think is missing: increase in coverage with stronger growth after 2000 combined with the average/median/mode age of papers in reference lists explains the fact that papers at a distance of average reference age from today have a higher likelihood of being cited than papers before that date.
    Consider summing a series of normal distributions, each next normal distribution slightly to the right of the previous, but also increasing in volume: the resulting chart would look somewhat like the figure displayed here. The distribution of cited reference age is not perfectly normal of course (skewed); for 2020, the mode is 2, the average is around 10 and the median is somewhere between 7 and 8 years.
    line 47: "there are also fewer articles to be cited, so the two factors largely cancel out under a scenario of constant growth"
    But what this argument here doesn't account for, is reference age. the "citation chance" is not equal for a given article for citing articles of different publication years. Also the number of references keeps increasing over time (from around 25 in 1996 to over 40 in 2020), which accounts for another amplifying effect on the increase.

    line 49: "It is also possible that Scopus indexed a smaller fraction of the early scientific literature, therefore losing more old citations than contemporary citations. If true, this may again partly cancel out with Scopus presumably tending to preferentially index the most prestigious journals, therefore increasing the average number of citations per indexed article"
    this sounds speculative, and is very difficult to assess: how should the total body of available content be quantified? there were far less journals being published in earlier years and there is exponential growth of journals in more recent years, also outside what is indexed in Scopus, so it may actually be that the fraction of indexed publications is actually lower in recent years. But this is also speculation.

    Page 11

    line 45 "... 2004 (start of more rapid expansion)" Scopus adds journals _usually_ in a forward flow (only new content since adding to Scopus), and hence this is reflected in steady a coverage increase since 2004 (launch of Scopus). There are cases where journals are added in a backfill too. But also the fact that Scopus was launched in 2004 and from then on there has been an active strategy to increase coverage by selecting more journals (and more publishers suggesting journals to Scopus as it gained relevance) explains the rapid expansion after 2004.

    page 12

    line 17 "If using citation counts from before 2004, acknowledge that long term trends will be influenced by lower average citations for earlier years, whether using a fixed citation window or counting citations to date."
    I general a caution is warranted when comparing citations over time, this holds true in general for all years. Citation patterns change over time, both because of index changes, but also because the scientific process itself evolves (with access to data changing as one element) as well as speed of the publishing process.

    Reviewed by
    Cite this review
All peer review content displayed here is covered by a Creative Commons CC BY 4.0 license.