Content of review 1, reviewed on December 19, 2020

“Automated retrieval of information on threatened species from online sources using machine learning” is a very interesting study that demonstrates the potential of using natural language processing and machine learning to gather relevant online information on species threatened related to wildlife trade and conservation. Wildlife trade is currently one of the most discussed threats to wildlife and that Conservation Culturomics is a field in expansion; therefore, developments on automated data collection are timely and of interest of a variety of researchers for different purposes in conservation. The pipeline developed and here presented by the authors is a great advance in terms of operationalizing collection of truly relevant data, in especial the neural network developed for data cleaning. This is an important topic worth publishing on. The authors may find below some suggestions as an attempt to improve the current manuscript.

The phase of filtering for relevant data is one of the major current barriers to apply automated data collection, and the authors present here a successful implementation of automated data cleaning tools for wildlife trade data. The manuscript will beneficiate from emphasizing more in the introduction the high costs in terms of time and perhaps financial costs of cleaning the amount of data collected by scrappers and other methods. There are some estimative for fields other than conservation and wildlife trade.
L 24: There are further examples of studies that already used machine learning to collect data on wildlife trade that can be used to support better this statement.

Roberts, D., Mun, K., & Milner-Gulland, E. J. (2020). A systematic survey of online trade: trade in Saiga Antelope horn on Russian-speaking websites. Oryx.

Hernandez-Castro, J., & Roberts, D. L. (2015). Automatic detection of potentially illegal online sales of elephant ivory via data mining. PeerJ Computer Science, 1, e10.

L 30 “Appendix I includes most threatened species by the trade”: The statement leads the reader to believe that Appendix I only include species that are threatened the most by trade. Appendix I includes species threatened with extinction (taking into account several sources of threat, not only by trade) to establish restrictions on commercial trade. At the same time, there are some species considerably threatened by the trade that are not included in Appendix I.

L34 “automated methods for extracting this information are still missing in conservation science”: Automated methods are becoming a tool for conservation science (especially API), although it will certainly increase over the next years and it is far less used than manual data collection, they are not ‘still missing’. I would suggest the authors reformulating the statement. Below I list some studies that had applied automated methods, only for wildlife trade. Those studies can also be used to improve your discussion on what has been done so far and how our manuscript goes beyond solving common issues:

Roberts, D., Mun, K., & Milner-Gulland, E. J. (2020). A systematic survey of online trade: trade in Saiga Antelope horn on Russian-speaking websites. Oryx.

Hernandez-Castro, J., & Roberts, D. L. (2015). Automatic detection of potentially illegal online sales of elephant ivory via data mining. PeerJ Computer Science, 1, e10.

Marshall, B. M., Strine, C., & Hughes, A. C. (2020). Thousands of reptile species threatened by under-regulated global trade. Nature communications, 11(1), 1-12.

Lamba A, Cassey P, Segaran RR, Koh LP. 2019. Deep learning for environmental conservation. Current Biology 29:R977–R982.
Singrodia V, Mitra A, Paul S. 2019. A Review on Web Scrapping and its Applications. Pages 1–6 2019 International Conference on Computer Communication and Informatics (ICCCI).

L 35 “studies manually collected information on certain species”: Here I suggest recently published study on jaguar trade widely covered by the media that collected manually hundreds of available seizures on Google.

Morcatty, T., Macedo, J. C. B., Nekaris, K. A. I., Ni, Q., Durigan, C., Svensson, M. S., & Nijman, V. (2020). Illegal trade in wild cats and its link to Chinese‐led development in Central and South America. Conservation Biology. 34(6): 1525-1535.

L 36 “However”: Therefore seems more appropriate.

L 41: As I mentioned before, the major challenge lies in the massive effort for filtering the relevant results, and I believe it can be further explored in the introduction.

L 81 “As a proof of concept, online news articles are collected using two channels”: Online news articles were collected. Please check for the need for the use of past tense throughout the manuscript.

L 134: There are several cases where different bodies cover the same event, and therefore, the text may not be substantially similar to be picked up by the code as duplicated, including the presence/absence of some details that such as precise location, detailed numbers per taxa involved and names of people/institution. I assure it is not rare to happen, especially for high profile species. How to deal with these cases in terms of avoiding duplication? Please, consider including this ‘limitation’ and/or suggestion for dealing with it in the discussion.

L 253: Is there any Appendix-cited species that was never covered by online articles?

L 263: Missing parenthesis.

Figure 9: The author did not let clear what kind of additional information can come with entities. I.e. when extracted, does the entity come on a context (the whole statement such as shown in Figure 9) or does it come only as of the relevant information (in bold) (e.g. only $100)? I am wondering about how to guarantee precisely to what the entity refers, e.g. in case of an applied study to estimate the volumes of traded specimens considering that the report may contain data (such as quantity) of more than one species or include additional data not related to that instance of trade covered.

L 325 “There is a heavy bias for animal species compared to plants and within animals, a bias towards mammals followed by birds and reptiles”: As it is, this statement leads the reader to understand that this is a result of a methodological problem. Besides, the normalized data showed that there is an actual bias towards Coelacanthi, Actinopteri and mammals, but not birds or reptiles.

L 328-330 “The popularity of charismatic species is reflected in the high number of articles dedicated to them”: I do not think this is a fair message. The highest bias was found for two groups of fish that are not necessarily classified as charismatic species (Coelacanthi, Actinopteri). This may be true to mammals, but it is not the main reason behind the whole pattern found.

L 335 “same methods can be applied to many more digital platforms”: How hard would be to adapt it from Twitter to other social media platforms considering the difference in structures among them? Rather than superficially suggest the use, it would be interesting to see what is needed and what are the challenges when adapting this routine developed to other sources of online data. How would that work on platforms that require an account to access or provide content with privacy restrictions (e.g. forums and private Facebook groups)?

L 341-343: Please, include some examples for errors that can potentially create noise and how next researchers that use a similar pipeline could work to reduce even more the errors?

Based on your findings, in what level or situation is there still a need for human data cleaning or labelling? Please discuss it, especially in terms of quantifying precisely the amount of traded individuals or defining precisely prices or locations for instances of trade, as suggested in the abstract and results.

I also believe that it is worth mentioning the consideration of the ethics when using automated data collection, which currently encompasses a legal grey area. Zamora (2019) and Zimmer (2010) bring an important discussion on accessing “public” information through automated data.
Zamora A. 2019. Making Room for Big Data: Web Scraping and an Affirmative Right to Access Publicly Available Information Online. Journal of Business, Entrepreneurship and the Law 12:203–228.
Zimmer M. 2010. “But the data is already public”: on the ethics of research on Facebook. Ethics and Information Technology 12:313–325.

Thais Morcatty

Source

    © 2020 the Reviewer.

References

    Ritwik, K., Enrico, D. M. 2021. Automated retrieval of information on threatened species from online sources using machine learning. Methods in Ecology and Evolution.