Review of VirAmp: a galaxy-based viral genome assembly pipeline

Content of review 1, reviewed on August 07, 2014

In this manuscript, the authors provide a Galaxy-based pipeline to assemble viral genomes from high-throughput sequencing datasets and analyze variations between these newly assembled genomes and reference genomes. Already existing packages were wrapped using python scripts and XML Galaxy files and are provided as standalone Galaxy tools. In addition, the packages were linked together using another python script to generate the VirAmp analysis pipeline, which is provided as a separate “integrated” Galaxy tool. The development infrastructure is clean and transparent: A GitHub repository can be browsed and/or cloned; a “readthedoc” documentation is implemented, which provides the ground for advance, versioned documentation; an Amazon Image has been implemented that will allow to deploy a customized VirAmp Galaxy instance for users with specific needs. The code is clean and the developed pipeline works as described by the authors. Importantly the manuscript is well written and provides a sufficient level of details to understand in depth the approach and methodology used by the authors. In summary I believe that this work will be likely suitable for publication and that it will be of a high interest for virologists, and maybe for biologists working with larger genomes. However, I think that a little amount of work would greatly improve the Accessibility and the Visibility (these two terms are taken on purpose from the Galaxy-related literature) of VirAmp for biologists without advanced expertise in Bioinformatics.

• Major Compulsory Revisions

1/ My first main criticism is that VirAmp is not available in the form of a Galaxy Workflow. This is barely understandable since almost – if not all – the individual steps of the pipeline have been wrapped to provide standalone Galaxy tools in the VirAmp Galaxy instance. In the matter, a workflow is important for Transparency: it will ensure that the users have full control on the parameters and the versions of the individual packages that are run by the VirAmp pipeline. To illustrate my purpose, I will give a simple example: In the VirAmp pipeline script (quick_assemble.py), one can see that the paired reads are trimmed by quality [ quality_trim(args.l, READ_1) & quality_trim(args.e, READ_2)], then are merged [merge_pair(READ_1, READ_2, MERGE_FILE)] BEFORE to be subjected to diginorm. Thus there is a non-trivial treatment before “diginormalization” which is only accessible to people that are able to dig into the code. Most of biologists without specific skills in programming will rush on the standalone diginorm tool and the 2 paired-end fastq files… and encounter an error as diginorm needs a merged fastq file. If VirAmp was available in the form of a workflow, they would in contrast immediately see the three preliminary steps mentioned above. Frankly, I think that VirAmp would be much more compliant to the Galaxy concepts if it was available in the form of a workflow rather than in the form of an integrating tool. Along the same line (but not compulsory at this stage) the standalone tools that are linked in VirAmp could be (easily) provided as tools (or a tool suite) in the Galaxy Toolshed. Thus, administrators could quickly integrate VirAmp in their local Galaxy instances

2/ My second major point is related to the transparency of the sequencing datasets that have been used to test VirAmp and to illustrate its capabilities. Here, the details on the experiments that have been performed to generate the sequences are lacking (cell types, infection protocols, approaches used for library preparations, for getting rid off of the host genome sequences, etc.). Here I believe that transparency will broaden the audience of the manuscript: some virologists may not realize that they could benefit from the VirAmp pipeline until they realize that the preliminary steps before entering VirAmp are actually experiments that they are doing or would be able to do.

• Minor Essential Revisions

Page 5: “3) initial genome assembly”. Why not “de novo genome assembly” ? Page 5: “9) alterative deployments of the VirAmp pipeline”. The statement is rather “cryptic” at this stage of the manuscript. Page 6: “diginorm [8] approach into our pipeline.” The reference 8 is corrupted in the reference list. Page 6: “reference guided multiple sequence alignment algorithms”. Maybe “multiple” is not required in this sentence Page 7: c) VINCUNA. It is not clear that this approach is not reference-guided Page 9: The N50 is not well explained, please rephrase. Page 9: The same for Circos (Circos draw… is not crystal clear) Page 11: after Galaxy [3], a subject is missing in the sentence. Page 11: the customized Amazon disk image is missing, or I could not find it. I am not a Guru of the Amazon EC2 but I am probably far beyond a vast majority of my biologist colleagues… Here I would like to add a personal note on that matter: Here I would bet that most of the people able to launch an Amazon Instance in the EC2 will rush on the codes to implement them in their own local Galaxy instance. The AMI is a nice professional option, but having a transparent Galaxy workflow for VirAmp would be much better at final. Page 11: “HSV-1 is one of the most common human pathogens in terms” sounds weird. Hard to figure out what the “common” quality is. Page 12. The three datasets should be provided somewhere. See also my major point above. Page 12 “it is the most appropriate parameter than N50”. Grammar. Page 13. “These data demonstrates that reference-guided assembly does not influence the assembly;”. Explain better. What would be "influence" ? Moreover, the previous sentence states 2 alternative options. Not clear which one was proven. Page 26. “The height of the line starts at”. Grammar ? (It is not clear to me whether a height can start somewhere)

There are some typos in the figures. I think it is due to errors in the pdf conversion.

Level of interest An article of outstanding merit and interest in its field Quality of written English Needs some language corrections before being published Statistical review No, the manuscript does not need to be seen by a statistician. Declaration of competing interests I declare that I have no competing interests.

Authors' response: (http://www.gigasciencejournal.com/imedia/9832112691462111_comment.pdf)

Source

Content of review 2, reviewed on October 20, 2014

All my concerns have been addressed. I understand the issue of clean implementation of tool in the Galaxy tool shed, to be discussed with BG. Yet, I am looking forward to this future implementation. Level of interest An article of outstanding merit and interest in its field Quality of written English Acceptable Statistical review No, the manuscript does not need to be seen by a statistician. Declaration of competing interests I declare that I have no competing interests.

Source

References

Yinan, W., W., R. D., Istvan, A., L., S. M. 2015. VirAmp: a galaxy-based viral genome assembly pipeline. GigaScience.

Pre-publication Review of

VirAmp: a galaxy-based viral genome assembly pipeline

Reviewed On August 07, 2014 , and October 20, 2014

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on August 07, 2014

Source

Content of review 2, reviewed on October 20, 2014

Source

References