Review of A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics

Content of review 1, reviewed on October 03, 2023

Hi, I am Stian Soiland-Reyes https://orcid.org/0000-0001-9842-9718 and have pledged the Open Peer Review Oath :

Principle 1: I will sign my name to my review
Principle 2: I will review with integrity
Principle 3: I will treat the review as a discourse with you; in particular, I will provide constructive criticism
Principle 4: I will be an ambassador for the practice of open science

This review is licensed under a Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/
and is also available at the (for now) secret URL https://gist.github.com/stain/ea97ea58004e644eafb40ad73ac93869

Note that I am not an ecologist, but a computer scientist with expertise in workflows and reproducibility, so that is my primary focus in this review. Although I have some experience with a similar image segmentation workflow for digitizing natural history specimens , I will not review the suitability of the particular tools and algorithm applied.

This well-written article presents a computational workflow for image analysis for purposes such as identification of species and their traits.
The article proposes a modular approach and a generic machine learning-based image analysis workflow ("imageomics") that can then be specialized for particular scientific questions in ecology. This is exemplified using a use case with identifying traits of fish, which is well explained.

The authors have gone to an extensive length to follow best practices for designing computational workflows, and the article explains well the reasoning for their modular design, use of containers and using the workflow management system Snakemake.

Reusing

It is perhaps less clear how the conceptual workflow can be reused for other use cases than the fish traits, so this is something the authors could expand on, even if theoretically. When I explored the code it shows how the use case is truly extending the separately maintained conceptual workflow, so this should be emphasized. At the same time other users are likely to need to do similar development steps as the authors, e.g. containerizing tools.

The authors mention that the original developers were involved in containerizing each of the tools used by the workflow, this should be flagged as a potential barrier for other extensions who may want to use third-party tools and keep their workflow reproducible.

Maintenance effect of modularisation

One downside of the modularisation is that understanding the details of the workflow is a bit more convoluted. For instance I had a bug (see below) in the _segmentation_ module, but its source code was kept in a separate GitHub repository. The main workflow however had an explicit declaration of that repository so it was easy to track. The article touches on this and the advantages of using GitHub Actions to automate their updates, but the authors could add for balance that maintenance of a modular workflow can be more challenging than a more traditional monolithic code base.

I would agree on a gradual approach; for instance starting with a single repository prototype with embedded scripts and tools (and correspondingly large list of dependencies), which once working can be modularised and versioned into separate reusable containers, to match the provided modules. I believe this was also the approach of the authors, so perhaps the article can highlight better the advantages of this thinking, so readers are not led to believe they need to go "all in" 100% modular FAIR workflows right away.

FAIR workflows

As the author's approach is in effect following best practice for FAIR workflows as developed in fields like genomics and computational biology, I think more citations for this could be incorporated, for instance -- in addition recommendations for preparing tools for use by workflows and containers

Retrieving the workflow

The authors have dutifully provided the base and use case workflows and all their modules as open source. These are linked to from Supplement and have been archived with Zenodo DOI as is best practice. For the final edit I would promote data and software to be actual citations in the main article (see ) so they are directly accessible from the published article and become part of its crossref metadata (e.g. for attribution as the authors point out). The printed URLs in the supplement PDF are also suffering from line breaks and not being clickable.

I managed to retrieve the workflow, however I ran into a couple of issues. One issue is the use of datasets from Fish-AIR, which require an API key. As this is by personal email to the Fish-AIR maintainer I do not consider this to be FAIR Open Data. The authors have kindly provided a Dryad dump of the input data -- this should be tidied up to have a DOI and also to reflect the Fish-AIR CC-BY-NC license (as this restricts commercial use). The README of the workflow could then link to this DOI in its instructions after "the Fish-AIR input files from Dryad".

The Fish-AIR also claim to be using ARK identifier, however identifiers like nx850s9w that I found in the CSV files from Dryad are not complete ARK identifiers as documented by , e.g. do not resolve. It seems an ARK NAAN prefix is missing to make these identifiers globally unique.

There are several URIs for semantic terms in Fish-AIR like however these all give 404 Not Found errors, and should be removed from the manuscript as this can be misleading for the reader. Alternatively, if the authors have contact with Fish-AIR maintainer, they may request the vocabulary to be published with redirects to or similar best practices for publishing vocabularies .

I would like the authors to add consideration as to longevity of a workflow that relies on a service-based dtaset like Fish-AIR, particularly as its data does not seem to be available in long-term archives like Zenodo or with persistent identifiers. This can be a warning for workflow decay which we have a fair bit of literature on (genomics workflows were often Web Service based rather than command line around year 2010). I think however this service dependency is only specific to this particular use case, so the authors can clarify that it does not affect the conceptual base workflow.

It is unclear which machine requirements are needed for this pipeline, given that in the README the workflow is proposed to be exeuted on a HPC cluster, and I am not sure if I should be able to run it on my workstation (64 GB ram, 16 cores). The hardware and software requirements could be more detailed in the README, as well as in the article.

Checking reproducibility

Given that the authors have done their uttermost to provide the workflow for full reproducibility, I had to try to run it from my Ubuntu 22.04 workstation, which has Docker 24.0.5 and Conda 4.14.0.

Although adapting a workflow system can reduce or automate the burden of installing the dependencies of a pipeline, it means the user now have to ensure they have correctly configured and installed the corresponding container and workflow systems. In this case even suggest the installation of Conda for the purpose of installing the Snakemake workflow system. Personally I am comfortable with all these technologies, however this may not be true for an average ecologist wanting to take up Imageomic using the authors' workflow as an example. So a bit more specific instructions and deep-links may be needed to assist such readers.

After retrieving the Fish-AIR input from Dryad (as hinted by the README), I was able to start running the workflow with snakemake --jobs 8 --cores 12 -- this seemed to take a very long time to set up initially, as it was compling a series of R CRAN modules with multiple Fortran warnings. I did not see any Docker containers being retrived, so I believe in this mode the workflow was not using the containers described by the authors, but was rebuilding the dependencies from scratch. This can be much more fragile as time goes on and transitive dependencies change. Another hint of this not using containers is the failure of seg\_generate\_metadata not finding gen\_metadata.py -- on inspection this is meant to come from the container "drexel_metadata", but it failed as no equivalent Conda-compatible source for the script is provided.

The Minnow_Segmented_Traits README should be updated to reflect that only container-based execution is available, and give more complete instructions on how to start the workflow on a non-Slurm system. It is unclear if Singularity is required, or just Docker, but even after I installed Singularity 3.11.4 and used the snakemake "--use-singularity" option I got an error relating to my home directory mounts on BTRFS -- while I could be debugging this, requiring a more advanced container system like Singularity is unfortunately a barrier for entry compared to Docker.

Considering this, I tried following the BGNN_Core_Workflow README's proposal to run "docker run --privileged" with a nested Singularity, but this failed with being unable to find R.

By moving to a fresh Linux user I was able to avoid the Singularity home directory error and executed the workflow using snakemake --jobs 8 --cores 14 --use-singularity which created results including a heatmap.

This heatmap I reproduced was however not corresponding to the ones linked from Dryad, as it only included _Notropis volucellus_ (2) and _Cyprinella spiloptera_ (3), while the Dryad data is much larger, including 13 species and ~200 specimens. I understand this is due to using a limited test dataset (which did execute in reasonable time on my workstation), and the authors explain this as a part of the workflow configuration which only select 20 images by default. The authors should be commended for providing test data, again this is following best practice for workflows.

As I mentioned I have some questions on the Fish-AIR dataset availability, and the Dryad dump still only contain the CSV files that point to Fish-AIR URLs for the images (retrieving these conveniently do not require API keys). Perhaps for documenting a workflow run, a "full" Zenodo archive could be made that includes all these individual images without requiring Fish-AIR downloads.

I've provided some debug messages from my execution attempts at

Overall I recommend rewriting the README of the workflow to have runnable instructions out of the box for new users, including data access and setting up the workflow.
This means moving out the institution-specific cluster details more relevant to the authors, and adding the general-purpose use, such as the snakemake commands.

Other suggested edits

https://doi.org/10.5072/FK2/CGWDW4 does not resolve, this seems to be a test DOI
https://doi.10.57967/hf/0904 -> https://doi.org/10.57967/hf/0904
Avoid linebreaks in supplement PDF as it breaks the URLs, also make them hyperlinks. Ideally, lift all to supplement data/software DOI and citations to main article text so they get counted as software citations
p4. Explain how "unique identifier for reproducibility" is formed or how it helps reproducibility.
p5. Environment: Expand on to which extent these R dependencies are compiled locally or why these are not part of container images
p8: "gereate_metadata" -> "generate_metadata"
p11. Workflow Manager: Add that Snakemake has capabilities for using containers, which simplifies use of modules

Summary

Overall I am very pleased with this article and would like to recommend it for publication. I believe the ecology community can see great benefits from using workflow technologies in the way presented by the authors.

I have proposed some minor corrections and recommendations, in addition to ensuring the workflow is also reproducible "out of the box" from its own documentation. For longevity of this article I also ask for the input data to be made available in a Zenodo deposit similar to the software.

Source

Content of review 2, reviewed on March 11, 2024

The authors have addressed all the comments and suggestions from the peer review process in a diligent and detailed manner. This has significantly improved the manuscript, in particular for the aim of the FAIR principles, as well as to explaining well how these principles can be used for computational analysis within the field of ecology.

I would like to thank the authors, my recommendation is to accept this article for publication.

Source

References

A., B. M., John, B., M., M., Bahadir, A., Yasin, B., L., B. J. H., David, B., R., F. C., Jane, G., Anuj, K., Kevin, K., Paula, M., Joel, P., Dom, J., Thibault, T., Xiaojun, W., Hilmar, L. 2024. A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics. Methods in Ecology and Evolution.

Pre-publication Review of

A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics

Reviewed On October 03, 2023 , and March 11, 2024

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on October 03, 2023

Reusing

Maintenance effect of modularisation

FAIR workflows

Retrieving the workflow

Checking reproducibility

Other suggested edits

Summary

Source

Content of review 2, reviewed on March 11, 2024

Source

References