Content of review 1, reviewed on February 10, 2021

General
I think this is an important and interesting contribution presenting a pipeline including several methods that could prove useful for conservation biologists and ecologists.
If this is indeed a proof of concept as stated in line 81, I think it would be beneficial to add a bit more detail on where this pipeline as a whole or its different components could be used to answer other questions, and what would that entail.

methods
Lines 151-156 – You mention here that you lowered the Cosine threshold from 1 to 0.95 heuristically – did you later confirm (on a sample greater than 50) that indeed there were no article duplicates with a lower threshold values?
Also your could you give more details on the consequence of choosing other prediction decision boundaries other than 0.4?
I would include the topic modelling info in your main text – this is another important tool, that I suspect many readers may find useful.

language
In your methods section you need more consistency with the language tense, in some cases you describe what you did in the present tense.
You also have an unconventional use of italics for common species names, and higher taxa.

Results
The section on taxonomic representation and Figure 7 could probably do with a bit more clarification. Furthermore, it would be useful to compare these taxonomic representations to taxonomic representation in the CITIES I list itself.

Figure captions could use more detail.

Minor comments
Lines 4, 310 – double brackets around the reference
Line 101 – probably missing a referral to these being ‘threat’ categories.
Lines 176-177 – I think this sentence needs some amendment.
Line 180 – an ‘and’ is missing.
Line 277 – I am not sure that the results mentioned here regarding the Coelacanths and Bivalvia are that surprising considering the species numbers and charisma of these groups.
Line 295 – space missing after ‘MONEY’
Lines 298-299 – the number of species mentioned here is 569 for 3 entities and 568 for another – this is a bit suspicious. Is it possible that the NER procedure did not work for 16 of the species in your dataset?
Line 310 – why double brackets around the reference?
Line 342 – I guess the use of the word ‘thrived’ here is a mistake.
Line 346 – a space is missing after the period.

Fig. 1 – you mention 2 tweeter handles here, but explored the outputs of three…
Fig. 8 – perhaps consider in these word-clouds to remove the names of the species to highlight the other terms related to them.

Source

    © 2021 the Reviewer.

References

    Ritwik, K., Enrico, D. M. 2021. Automated retrieval of information on threatened species from online sources using machine learning. Methods in Ecology and Evolution.