Review of metaXplor: an interactive viral and microbial metagenomic data manager

Content of review 1, reviewed on August 11, 2020

Sempéré et al present metaXplor, an interactive database for microbiome and viral data. The authors claim that current available platforms fail to meet the needs of multiple study management, data querying work, and meta analysis. metaXplor is built on a mongoDB backend, which uses an Oracle database to facilitate blast searches and computationally intensive tasks. A docker container is available, but it is unclear whether the docker container requires an internet connection - i.e. does metaXplor require this oracle backend to function. For users, the database accepts sample information, an already computed table, a fasta of feature sequences and taxonomy. The results produced include a krona plot, and a map view of the results.

One of the major issues that remains unaddressed in the manuscript is how this compares to the qiita database and associated infrastructure - a comprehensive database built for storing and analyzing studies.

I would also like to try the interface, which I assume is what most end users will use. Is it possible for the authors to provide an anonymous reviewer login to explore/test the system?

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors response to reviews: REVIEWER #1

In general the webapp requires more documentation and online assistance, e.g. descriptions of terms used. For example, within the admin page to upload a new dataset the text box titles are very brief and have no additional explanation as to what is expected in each box. What exactly does "Samples available? *" mean? is it asking about the BioSample accessions for the sequence data or is it meaning the physical sample vouchers or something else? We added tooltips wherever possible, including in the data import form. We also significantly extended the online documentation by adding four sections that will help users better understand how to use the system.
An opportunity seems to have been overlooked with respect to the sample metadata, while it is admirable that the system will accept any user specified attribute name this does make the ability to search and filter on specific terms more difficult. Perhaps there is scope to strongly encourage the use of attribute names that are recognised by INSDC or other authoritative bodies such as the Genomics Standards Consortium (GSC, - full disclosure, I am on the GSC board). By way of example; the three mandatory columns in the sample.tsv file (sample, gps_position, date_collect) are not consistent with terms of the same meaning in the BioSamples database or GSC recommendations; sample alias, collection_date, geographic location (latitude and longitude) or just "lat_lon". We modified these three field names to match the ones you pointed to. We also added a highlighted notice in the import form and the documentation page to advise users to use such standard names as much as possible.
The assignments.tsv is a mandatory input file, but its format appears to be unique to this tool. Or perhaps its a variant of BLAST output I've not seen before. Is there an existing method to parse blast output to this format? Would it be possible to include a parser for users to import blastx tabular output directly? We consider that the supported input format is very open, and do not think it would be feasible to support file formats generated by the many existing processing pipelines. However, we understand that BLAST is a very common tool for performing the assignment step. The compromise we opted for was thus to write a bash script that translates BLAST format 7 -"Hit Table (text)" in NCBI interface- outputs into the expected input format. This script is linked from the documentation page and may be used as an example for different kinds of pipeline outputs.
I am intrigued to know how the system would cope with the large datasets being generated, for example the Mouse gut gene catalog (GigaDB dataset DOI:10.5524/100114) with 2.6 million genes from 184 samples. Currently this dataset has not been subjected to comparison to the NCBI database to determine the best-hit NCBI accessions required by metaXplore, instead the authors of that dataset used other methods to determine taxonomic content and functionality. We considered importing this dataset into metaXplor, however it raises two main difficulties : - the fact that the originally submitted version could not be fed directly with taxon IDs but inevitably required NCBI accession names ; - the fact that metaXplor expects raw counts to express the sample-to-sequence relationship whereas the mentioned dataset does not proceed so. We were happy to address the first problem / limitation by amending the data model and import procedure to accept a taxonomy_id value when sseqid is not provided (now mentioned in the documentation). Unfortunately we did not succeed in refactoring the Mouse gut data to revert to raw counts and therefore could not import it into our system. However, we are confident that our server would be able to handle this dataset because although our largest database (BGPI_MicroQuar) contains less assigned sequences (622,266 when including a private project you may not see) its total number of assignments is 3,550,635. Also, the number of involved samples is <200 for the Mouse gut dataset and close to 1,500 in BGPI_MicroQuar.
There is a minor point that is worth noting somewhere in the manuscript; the table of results displays all matches to the filters regardless of "Assignment method" unless you specifically filter on that facet. However, the other views (phylogenetic tree or Krona) both automatically filter on assignment method and cannot display all methods together. I understand why they do this (different methods gives multiple results for some sequences and therefore makes numerical interpretation impossible) but the fact the tabular data displays all by default makes the transition between table and other views a little confusing, i.e. the numbers are immediately different and you have to workout why. A simple solution might be to flag the assignment methods as a default filter in the tabular, with the option to display all, this way users are immediately alerted to this feature and its then expected when moving to the other views. We proceeded as suggested and : - made the assignment method (and best hit where applicable) filters active by default in the exploration interface when several methods were used to generate the selected project(s) - clarified this point in the documentation and the manuscript
This feature utilises BLAST, it appears to work in the webtool provided and is a desirable feature for some. I would question how well this part of the webapp will scale with increased numbers of large datasets. It maybe necessary to enable admin users to restrict its use on certain datasets? The idea behind using a job manager like SGE is that the administrator can set a maximum number of concurrent jobs for metaXplor to launch, this number being meant to take hardware possibilities into account. Jobs are put in a queue and only run when a slot is available for them. However in order to let this feature run smoother, we additionally implemented Diamond searches as a significantly faster alternative to BLAST.
The phylogenetic assignment tool appears to be a stand-alone tool that is not linked to the datasets available in the webapp, i.e. just clicking that button takes you to a page where you have to upload your own fasta file of sequences and select and appropriate reference tree. We added some text on the feature’s form submission page, and in the documentation, explaining how to assign sequences found in the system.
Perhaps offering the user (or admin users only) the ability to run on any of the uploaded datasets to enrich them with phylogenetic assignments would be a nice addition, along with more comprehensive documentation on how to use the tool for user defined datasets. The phylogenetic assignment tools allow a more thorough investigation of sequences relationships. Whereas it is indeed run in an additional step, the results of the phylogenetic classification can be written in the assignation table within the main project database. Unfortunately we do not think it would be feasible to let users launch an online phylogenetic placement on an entire project’s data. The reason behind this is that this assignment pipeline starts with a mafft multiple alignment step, which can be pretty time and resource consuming. Especially if we were to use a global refpkg as reference dataset.

REVIEWER #2

Generally, there is a lack of citation for the opening assertions and a rather arbitrary citation list for landmark works. In particular, the Background does not stipulate the domain-specific problem that this tool proposes to solve or improve. We added citations for the opening assertions and tried to clarify our purpose in developing metaXplor.
A review of similar and/or competing resources, or the lack thereof, would help to better characterize the contribution of the present work and help to avoid potential criticism of a derivative connection to Gigwa v2. Five citations were added to the article’s Background section, including one mentioning the differences with the Qiita database. We do not understand how metaXplor could be considered to derive from Gigwa. Although it is based on the same technology, it addresses totally different problems by storing types of data that have little in common.
I am impressed by the authors' decision to use an industry standard software development paradigm. I feel that it would further strengthen this work to briefly remark on the durability, maintainability, and extendibility of their implementation, as a consequence of having elected to use freely available and well-accepted standards, such as the Spring Framework and Apache Tomcat. In my opinion, this approach lends significant credibility to the application architecture and is a refreshing change in a research domain presently inundated by ad hoc python scripting. Many thanks for this positive remark. We added a paragraph highlighting these facts to the manuscript’s « Application architecture outline » section.
I can appreciate the decision to use NoSQL and the Data Model is well illustrated in Figure 5. However, given the prevalence of relational database management systems, it is necessary to briefly explain and justify the choice to use NoSQL in the present work, which is assumed to be motivated by the need for a schemaless model. The newly written, previously mentioned paragraph also states this.
A more explicit discussion of how this tool relates to FAIR standards, or even just interoperability in general, would benefit this work. In order to enhance metaXplor’s usefulness, we implemented an additional feature that provides means to push exported data into external online tools such as Galaxy, which may then be used for further online analyses. This feature is mentioned in the new manuscript version, and the fact that metaXplor can contribute to make data FAIR is now mentioned in the conclusion.

REVIEWER #3

it is unclear whether the docker container requires an internet connection - i.e. does metaXplor require this oracle backend to function We would like to point out that Oracle/Sun Grid Engine is not a database software but a HPC job scheduling software. The Docker containers can communicate with one another without an internet connection, as long as they are on the same local network. They can even run on the same server. However your remark is relevant because the system queries NCBI services to feed its accession cache and find their relationships with taxonomy. Therefore we specified in the manuscript’s Requirements section that an internet connection is required.
One of the major issues that remains unaddressed in the manuscript is how this compares to the qiita database and associated infrastructure - a comprehensive database built for storing and analyzing studies. We referred to Qiita in the article’s Background section and mentioned the major differences that distinguish both systems.
I would also like to try the interface, which I assume is what most end users will use. Is it possible for the authors to provide an anonymous reviewer login to explore/test the system? We are surprised that you could not find the CIRAD online instance (https://metaxplor.cirad.fr/) highlighted on the project homepage (https://github.com/SouthGreenPlatform/metaXplor) which is itself mentioned in the original manuscript’s « Availability and requirements » section. Reviewer #1 contacted us directly and was immediately provided with a user account and some sample data that allowed him to even test data imports. Please let us know if you would also like to do so.

Thanks again for your feedback and suggestions. Best regards,

Guilhem Sempéré

Source

Content of review 2, reviewed on November 26, 2020

Thank you to the authors for the changes. However, they have not fully addressed my concerns: citing qiita is not suffecient to explain how this database is functionally different. (Different analyses, yes, but how do I get biological insight from a largely descriptive krona plot and a map? What does the pplacer buy me over export into something like iTOL? I'm still unclear how this tool specifically adds value over well described existing tools. I think this is mostly an issue of writing and possibly considering what features will be useful going forward.

At this point in the process, it feels like there is still potential value is single blind peer review, or at least that during review single blind peer review is the system. Is it possible to get a login through the journal so I don't have to reveal my identity during review?

I declare that I have no competing interests.

Authors response to reviews: Reviewer #1:

REMARK: on https://metaxplor.cirad.fr/metaXplor/main.jsp I clicked "Explore data", then "search" without changing any of the values/dropdowns just as it appears by default. Then I clicked the Taxonomy tree view icon, when I selected "BlastX" from the dropdown menu a dialog box popped onto the screen saying just "metaxplor.cirad.fr says" with nothing else, and no content to the page. I'm guessing that there are no blastx results to be displayed, but the dialog box doesn't say anything?! When I change the dropdown back to BlastN it all works fine again. REPLY: Thanks for identifying that bug, it has been fixed on the live instance and applied to the latest Docker images

Reviewer #3:

REMARK: citing qiita is not suffecient to explain how this database is functionally different REPLY: Thanks for the comment. After presenting alternative tools (Qiita and Metavir), we now clearly specified at the bottom the Findings' section's Background paragraph that these tools "are not specifically designed for the identification of distant homologies or tracking newly discovered sequence/gene families". Conversely, these functionalities are at the root of metaXplor's design, making it "sequence-centric" in opposition to the more "project centric" functionalities of Qiita and Metavir.

REMARK: Different analyses, yes, but how do I get biological insight from a largely descriptive krona plot and a map? What does the pplacer buy me over export into something like iTOL? REPLY: The interest in running pplacer from within metaXplor is that it saves exporting / reimporting into another tool, and that it allows to enrich database contents with newly found assignments (both facts are now mentioned in the manuscript)

Hoping that these modifications will satisfy your requests, we remain available for further instructions.

Best regards,

Guilhem Sempéré

Source

References

Guilhem, S., Adrien, P., Magsen, A., Pierre, L., Philippe, R., Frederic, M., Gael, B., Denis, F. metaXplor: an interactive viral and microbial metagenomic data manager. GigaScience.

Pre-publication Review of

metaXplor: an interactive viral and microbial metagenomic data manager

Reviewed On August 11, 2020 , and November 26, 2020

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on August 11, 2020

REVIEWER #2

REVIEWER #3

Source

Content of review 2, reviewed on November 26, 2020

Source

References