Journal

GigaScience

Official partner
About

GigaScience aims to revolutionize data dissemination, organization, understanding, and use. An online open-access open-data journal, we publish 'big-data' studies from the entire spectrum of life and biomedical sciences. To achieve our goals, the journal has a novel publication format: one that links standard manuscript publication with an extensive database (GigaDB) that hosts all associated data, as well as provides data analysis tools through our GigaGalaxy server. Further promoting transparency in the review process, we have open review as standard for all our peer-reviewed papers.

Our scope covers not just 'omic' type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.

Published by
Review policy on Publons
  • Allows reviews to be publicly displayed
  • Allows reviewers to display the title of the article they reviewed
Reviews

253

Interested in reviewing for this journal?
Editors on Publons
Top reviewers on Publons (Manuscripts reviewed in last 12 months)
Endorsed by

Reviews

  • In this manuscript, the authors conducted the analysis to the meta-barcoding and shotgun metagenomic data of spontaneous wine fermentation, showing high correlations in abundance measurement. Furthermore, the comparison between the meta-barcoding and shotgun metagenomic data showed that there is strong bias in the meta-barcoding data for the genus Metschnikowia. In general, the manuscript was well written, with appropriate structure and comprehensive description of the methods and results.

    Major Comments: 1. About Figure 1: 1)The description about Figure 1 needs improvement. For example, Line 537, "both plots", what are the two plots specifically? (should be A and B, but it should be more clear) Also it seems that plot (B) plot(C) were mislabeled so now the description does not match the labels in the figure. Also the last sentence "Abundance values are presented as in Fig 1A" seems not accurate here, The legend of the values is located at the bottom part of the figure, not in subplot A. 2) In subplot C, how were the nodes about the top 30 genera positioned in the plot? How was the distance between nodes calculated? 2. line 215, the reference genomes used for alignment were "assembled from existing genomic resources for fungal and bacterial genera that were known, or suspected of being wine-associated". What exactly are the genomic resources? On line 408-413 in Methods section, "whole genome sequences were collected, when possible", here is "collected" the same as "assembled"? If it is assembled, what is the specific assembler used here? This is not very clear and may need further clarification. 3. About Figure 2A: "only the abundance measures for species within the Hanseniaspora genus" are depicted. Why this genus is picked? On line 258-260, it is mentioned, the identity values are significantly lower for "Mucor circinelloides, Pseudomonas syringae and Hanseniaspora valbyensis", why not pick these genera to be presented in 2A?
 4. Table 2: 1) line 276, "two cases", what are the two cases? 2) In Table2, What does the gray color represent? Also, the boxes for "Total OTUs" column are grey and not grey respectively. Should this be more consistent? 3) Should the "Control mix 1" for "AWRI1498" be "1x10**6" as "AWRI796"?

    Minor Comments: 1. line 49, "To the map" -> "To map" 2. line 281, "D1 T1 and T2" , a comma is missing between "D1" and "T1". 3. line 293, "were not within a five-fold range", but on line 290, it is "two-fold". Why "five-fold"? 4. line 560, "to the total the abundance of", is there an extra "the" in the sentence? 5. line 565, "an abundance of S. cerevisiae of 1 million reads per million", what does "per million" mean hear?

    Submitted to
    Reviewed by
    Ongoing discussion
  • The manuscript describes the dataset and bioinformatics pipeline enabling the construction of a haplotype map based on sequences of ~1,200 maize accessions. Haplotype map developed from thousands of accessions is a key resource of information for maize genetic and breeding. Beyond the importance of this work for the maize community, the pipeline describes here will be very useful for other crops with similar issues than maize, including a large genome with a high number of duplicated regions, gene copies and repetitive elements. Overall, the manuscript is well written with adequate level of information provided. The work described here is scientifically sound and explained logically. Although it's the third generation of maize haplotypes map, the manuscript includes enough novelty to justify a publication, with the improvement of the pipeline for minor allele calls and heterozygous calls. The use of flag in the VCF files indicating the characteristics of the variant sites will also help the readers to appropriately use the data. The data support the conclusion p 15 which is an important message for the readers.

    I have some minor corrections and a couple of points that need clarification: -p 3 line 16: add the acronyms CAU for China agricultural university, as used later in the text -add a space between numbers and bp unit all through the text -Tables 1 and 2 are difficult to read. It would be easier to separate the results in Table 1 under "Coverage per taxon" in different columns instead of separating them by a coma. Please separate 3.1.1 from 3.2.1unimp and 3.2.1imp in different columns in Tables 1 and 2. -There a few typo such as CIMYYT page 5 line 46. -p 5: the acronyms should be described in the text: base quality (BQ) in line 51, mapping quality (MAPQ) in line 53. -p 5 line 53: why did the authors choose 30 as a threshold for MAPQ? -p 5 lines 58-60: the sentence is unclear to me. What is the null hypothesis here? "Sites of high probability" of what? -p 7 line 25: delete the dot point at the beginning of the sentence. -p 7 line 57: why a local realignment wasn't done? -p 8 line 51: what is the purpose of calculating inbreeding coefficient? The paragraph should start by explaining this. -p 8 line 54: Is lower threshold q1? If so, please spell it out. -p 9 lines 7-9: This sentence explains the aim of Figure 4 and should be at the beginning of the paragraph. -p 10 line: please spell out the acronym DUP. Since M&M is at the end, it's best to assume that the reader hasn't read it yet. -p 11 line 41: spelling mistakes (failed the LD filter?) -p 13 lines 26 and 29: replace coma by semi-colon to clearly separate the % results. -p 14 line 14: "to capture real signal related to phenotypic expression". I find this expression a bit odd. Do you mean "to capture a true association with phenotype"? -p 15 line 31: replace coma by colon in (teosinte lines: 17 Z. mays…) -p 15 line 51: spell out the acronym (North Central Regional Plant Introduction Station) -p 16 lines 4-21: this paragraph doesn't match the writing quality of the rest of the paper as if it wasn't written by the same author or hasn't been reviewed. The writing is unnecessary complicated and a bit confusing for the reader. -p 16 line 32: replace 113.702 billions by 113.7 billions for ease to read; add "were obtained on 1,218 taxa". -p 16 line 44: replace better by higher -p 17 line 31: I find this sentence unclear. By "reads with non-zero mapping quality", do you mean reads with a correct location? -p 21 lines 20-39: I find this paragraph difficult to understand. -p 22 line 46: why did you choose the number of 70 sites in best LD? -p 22 line 24: replace coma by semi-colon -p 22 line 58: "taxa with less than 50% non-missing genotypes". It would be simpler to say taxa with more than 50% missing genotypes. -p 23 line 11: delete the in "this information is the used to compute" -p 25 lines 7-12: Some acronyms are missing: AGP, MAPQ, BQ, LDKNN, NI5, LLD, NO, DUP, VCF. - There are a lot of jargon and acronyms in Figures that make them difficult to read. As most people read the figures first, I suggest you add information in thee titles (acronyms and purpose or conclusion).

    Submitted to
    Reviewed by
    Ongoing discussion
  • I'll take this in two sections: comments about the web app and comments about the manuscript.

    Web App This is a nice web-app that attempts a difficult job. It is great that it is presented in multiple forms, including standalone and VMs.

    Often people do not like the idea that their email will be taken, even if it does mean that they can retrieve results later. A less intrusive mechanism is to provide a job id on submission which can be used to get at results, either through a form or a URL.

    The help mouse overs on the main page do not work.

    Manuscript The language describing what a reciprocal BLAST can achieve is inconsistent. Although it is clear later in the manuscript you are aware that a RB can only find putative orthologues, it should be clearly stated throughout. At some point you might want to clarify how orthologue detection requires strong phylogeny.

    The manuscript as written kind of makes a trap for itself. It tries to claim that RBH and this tool will help large scale analysis where only small scale analysis could be done before, but in doing so it invites us to consider RBH in a new (ish) mode - that of big data tool. And if we are to use it there, well, we need to know how well it does in that domain. In this case then this manuscript must do that large scale comparison and sadly, the manuscript is a bit lacking.

    Considering RBH as a method for big data then, I don't think the utility of the method and the tool is sufficiently well tested. A glaring omission is that there is no attempt to describe the error of the method, that is the number of false putative ortholgues or any attempt to develop a metric for the believability of the whole set of data. For a tool whose main selling point is that it can find broad patterns in large datasets, then it really needs some measure of how many of the putative orthologues it classifies are right or wrong. The experiment carried out is simply a run of the tool. As a minimum I'd hope that you'd compile a list of known orthologues curated carefully and manually (there are lots of databases with these, (orthologene.org, orthoMCL, compara @ EMBL) , then run your tool and assess how many of these you'd found. Until a true benchmark experiment is done, then the manuscript lacks a sufficient demonstration of the tool's utility.

    If we aren't supposed to consider this as a new big data tool and just a useful implementation of RBH, then the language and claims about being able to compare gross patterns across large phylogenetic distances need to be toned down in the manuscript.

    Submitted to
    Reviewed by
    Ongoing discussion
  • This manuscript analyses the fungal community of different stages of spontaneous fermentation through two approaches, ITS-phylotyping and Shotgun sequencing. I found thee manuscript very interesting. It is the first study to my knowledge that uses shotgun approach in wine environment so I do believe it is very innovative. However, my main criticism is that after all the work done and (money invested), the "application " section and the" abstract", just highlight that shotgun was able to uncover an amplicon bias towards Metschnikowia spp. So i do believe the discussion/conclusion (an abstract) needs to be expanded and several inserting results that could be taken out from the study should be included. I would have expected to see an extended discussion/conclusion about how this methodology was able to reach to a higher taxonomic depth for certain taxa, even able to detect some strains, and some bacteria too. The fact that around 85% of the reads mapped to a wine related microorganism reference genome is something to be definitely highlighted as a conclusion. No deep discussion is made about the community dissimilarities found among samples with the two approaches ( e.g comparing the samples clustering obtained in figure 1c and figure 2c). Another drawback is that the work does not include any metadata at all regarding the must/vineyard characteristics, and thus it lacks a discussion about why first stages of must and ferments coming from different wineries are that different

    I am a bit confused about certain sections of the paper that I think should be further explained. And also certain sections from the discussion should go into material section. For example, I think that authors should explain how the "Control populations" were maid and this should be included in the methods section and well indicated with a header ( e.g in line 323)

    The analysis and statistical approaches are in overall sound. However, I found some figures conclusions to be difficult to follow (In particular the conclusions get in figure 2c are misleading for me, line 278-283)

    Comments/Corrections

    -Line 30 and line 49-"To the map" change it by " To map" - line 91: what is "with several amplicon based methods" referring to? That they use different primers, different sequencers…? I would say " several amplicon based studies have being conducted". - Line 118 : "D0 ( at inoculation)"?? But this is spontaneous fermentation! - Line 119- "As seen with the selective plating…" here it is my understanding that you are talking about previous culture dependent works? This sentence needs a reference. -line 119-122- which data are you referring to for this statement?is it figure 1??? You need to add it -Suggestion: move 122-124 to 118 (as it refers to the fermentation characteristics, then talk about microbes) -line 129-132 I would suggest this to be part of methods, not in this section -line 139 "all the of otus" change for "all the otus" -Line 139 : why 78 samples ? but there are 66 -line 171 "figure (1B)" should be "figure (1C)" -line 192, I would say move this lines above, after line174 to link it with the PCOA -Line 194- there is not vintage information in table 1 - line 211- I am curious to know how many of the reads aligned to vitis and were discarded? you could include the values in table 1 -line 216: having only 15% of reads unable to align is really good result, it seems like a "too" good results. Does it mean that the wine environment microbes are very well characterized/(we have the genome available for most of the wine environment microbes…?) or is it that when you mapped your reads with the genome references you were not very restrictive on the aligment? (only q10?) I would say this is something that might be interesting to discuss in the paper line 236- authors have identified several Prokaryotes. I suggest you compare those taxa identified by shotgun with what is already known in other studies that have used 16s RNA phylotyping. does it fit with the most abundant bacteria community found in other ferment samples? -Line 248: why not taking this reads and for example blast them to see whether they are from mitochondria? -line 253-260; this paragraph seem to be referring to Fig 2A, but in line 260 authors talk about other microorganisms rather than the ones in Fig 2a. An the results from the microbes specified in fig2a are not discussed line 278-280: Figure 2C. "D1 samples (T1,t2 and Y3) largely differentiated …" I can see T1 and T2 as a differentiated group, but Y3 does not seem to me that different from Y1d1? I guess I am not interpreting the ordination plot correctly…. ( Any way, it would have been nice to do a PCOA with ItS data and another PCOA with shotgun data with just the samples that are in common, in order to see if the clustering of the groupings is similar with both methods ( as it seems to be), maybe as supplementary table? Line 287: 23 taxonomic identifiers. Are authors comparing the shotgun data that were mapped with the reference genomes?( or also the ones with metaphlan?) are there only 23 taxa in common in both method? -line 323- I understand this refers to the control samples I am confused about this one, as this is included under the header "ferment samples"? -line 393 -how much of the total abundane do this 30 otus make? Why just doing it with this 30? -Line 396 - change "and two winery samples " with "from two winery samples" - line 437: in methods nothing is said about how did you compare both ITS and shotgun sequencing results

    Figures and tables

    FIG1c: despite the colors, I would suggest authors to write the whole name ( e.g T1D0,T1D1…) to refer to the samples rather than just referring to them by the color and fermentation stage. ( same for figure 2c) Fig 1b:. include species name in the figure -Table s3 some of the species do not have the accession numbers, they have a blank cell -table 3: In the legend authors need to explain that " alignment" means aligning to a reference genome -table 4: I am confused about the spacer, what is the dot referring to?( to any nucleotide ( in that case better specify it as "n" or the lack of a nucleotide?

    Submitted to
    Ongoing discussion
  • RecBlast provides a tool to find orthologs across genomes. I think that, in general, the manuscript misses information on benchmarking and application. How does this method compare in time and accuracy to other methods that attempt to find orthologs? I tried to install the standalone software and notice that the dependency of the "mygene" package should be added as an additional dependency. "seaborn" is also needed for the script to run and is not a standard Python package. It doesn't seem like the script is able to take in a list of genes in FASTA format and find orthologs. This seems to be a major limitation of the script. I tried to download the results from the web page to see what the output looks like, but I was denied access, so I'm unclear of how useful the results may be. One way to improve the utility of the script would be to include output that contains concatenated proteins from orthologs that could be used in phylogenetics. In that way, trees could be created from orthologs from distantly related taxa.

    Specific comments:

    L39: define what you mean by "the evolutionary tree". L4-L29: The description of the algorithm is confusing to me. Sometimes the authors use "organism", "target organism", "target species", "original organism". I would change the text to clarify what you are comparing. L14: Clarify what you mean by "matches a protein". What criteria is this based on? L19: I would change "cater for" to "cater to" L34: I'm not clear what you mean by "any computational background" L45: I would list the github address here. I see that it's in the supplemental info as well

    Figure 2a. Does this mean the number of human orthologs between these different species? Do these results make sense? Figure 2b. Does this clustering make sense from a biological perspective?

    Submitted to
    Reviewed by
    Ongoing discussion
  • In the article "fastBMA: Scalable Network Inference and Transitive Reduction", the authors present an improved tool, fastBMA, as an extension of their prior work on the inference of genetic networks. Most comments below are related to the article text and writing style, rather than any major concerns related to the scientific results. However, there should be significant adjustment to the text to improve the scientific clarity of the findings (ie, not all figures in the article are referenced in the text, article does not follow style guidelines, etc).

    1. For technical notes, the article sections should include only "Findings" and "Methods", which can then be broken down into subsections. While this has been done for a portion of the article, the article flow could be improved significantly to increase the clarity of the scientific content. Conclusions should also be moved into a subheading of "Findings", instead of falling after "Methods". Results should also be integrated into the "Findings" section.
    2. Would recommend placing "Related Work" in the background and integrating "Our Contributions", rather than including this as a separate section.
    3. Several references to the speed of fastBMA are made in the Background/Contributes/Related Work sections, without any supporting evidence or figures in those sections.
      • Second paragraph of "Our Contributions" in 2 locations
      • In "Estimating model posterior probabilities" and others, should indicate/explain what is meant by "faster C++ code" for fastBMA -- do the other applications use a different language? Less performant algorithms?
    4. The implementation methods of fastBMA are also described in the "Our Contributions" section, prior to "Related Work"
    5. Methods are written more like results (ie, "Algorithmic outline..." discusses the performance enhancements rather than just the approach) and discussion sections instead of being used as an explanation of implementation details and data sets
      • "Replacing the hash table" has similar issues, and also discusses "crashing a 56 GB machine" with minimal explanation (possibly out of memory? unclear how large of a dataset for this to occur).
      • Most of the "Replacing the hash table" section appears to reference ScanBMA rather than fastBMA -- would focus on methods of fastBMA and how this improves on the prior work in the findings, instead of going into in-depth explanations in the methods
      • The end of this section states that fastBMA is much faster than using a full hash table, but no supporting data are provided (only a description of the approach)
    6. Figure 3 is never referenced in the text
    7. The text in the section "Transitive reduction to eliminate redundant edges" is not entirely clear. While the purpose is in the title, the text does not necessarily support the title, nor offer any evidence (figures, data) to support the conclusions in the section
    8. While the fastBMA results in Fig 4B cannot all be compared to ScanBMA since runs with equivalent data were not possible, the statement that all fastBMA lines are to the left of ScanBMA should be better explained in the text, as the larger fastBMA data with (without priors) takes as long or longer than ScanBMA (agree these cannot be compared, but the text does not explain this as currently written). This may be clarified by splitting references to Fig 4A and 4B in the text, rather than only referencing "Figure 4". May also want to explain why running with priors takes substantially less time than running with priors on fastBMA.
    9. More background on what informative priors were used from external data sets may be of benefit
    10. For the 32 core cluster, was this multiple machines totaling 32 cores? Or a single 32 core node?
    11. Some discussion as to why the AUC is better in Fig 4A for fastBMA 8 core compared to fastBMA 1 core would be warranted
    12. The OR parameter used for fastBMA in Figure 5 should be stated, to better compare results from the AUC and Precision-Recall curves
    13. Can reduce the number of times links to the software in the article are referenced (ie, the Docker images are noted in the abstract, contributes, and conclusion)
    14. For DREAM4 data set, both 10-gene and 100-gene data are referenced in the "Datasets" section, but not indicated which was used in the results/figures
    15. A prior ScanBMA article appears to have used all 3556 variables in the Yeast data set (http://bmcsystbiol.biomedcentral.com/articles/10.1186/1752-0509-8-47) -- any reason that ScanBMA was only run with 100 variables+prior here, instead of including the 3556 without prior?
    16. Explanation of the software environment setup and its impact on performance/run time should be included -- were all tools installed on a single virtual machine? Running the same OS? Were they run within Docker containers? Any potential performance changes due to the use of shared/virtual hardware? Were the applications run a single time, or were they run multiple times to determine if there was any variability between runs based on potential storage/network capacity within the shared environment? Were data sets stored locally on within the instance?

    Submitted to
    Reviewed by
    Ongoing discussion
  • Some comments below: In the "Classes of multiplicity for analyses of track suites" section, authors mentions figure3a, 3b ... instead of 4a, 4b, ... Moreover, on the Figure 4, there is no a), b), c) or d). Testing https://hyperbrowser.uio.no/hb/#!mode=basic (warning, as I have had very low internet connection, issues mentionned here are maybe due to this technical limitation and not to a Galaxy instance problem): In the basic mode section / - click here to load a sample track with Multiple Sclerosis-associated regions, expanded 10kb in both directions (you will be automatically redirected back to this page if you choose this option); Not automatically redirected to the mentionned page, but on the home page Same with 2b step - click here to load a sample GSuite of DNaseI accessibility for different cell types (you will be automatically redirected back to this page if you choose this option); *As you are using "customhtml" output to propose to the user to export raw data table, it seems that user can't easily create a workflow using this output table. Maybe it can be of interest to propose directly on the tool formular an option for a "classical" text export.

    Submitted to
    Reviewed by
    Ongoing discussion
  • The authors of the manuscript titled: "fastBMA: Scalable Network Inference and Transitive Reduction" have developed a fast and scalable gene regulatory network reconstruction algorithm which is a faster and more accurate version of their previous algorithm scanBMA. It also features a network post-processing method based on transitive reduction of graphs. Below are my comments on this manuscript.

    In general the manuscript is relevant to current research, especially in the field of systems biology and biostatistics. It is well written and clearly understandable. It is a welcome addition to the arsenal of scalable algorithms for gene regulatory network inference. However, I think the paper can be improved significantly by addressing the following comments.

    1) The authors claim that the transitive reduction based network post-processing method is a novel and important feature of their algorithm. Firstly, very similar techniques were previously used in many papers, some of which were cited by the authors in their manuscript. Therefore, I do not think it is appropriate to call it novel. Secondly, in the benchmarking studies, the transitive reduction method did not seem to improve the accuracy of the networks inferred by the fastBMA algorithm. If it does not improve the performance of fastBMA then why is it being packaged together with fastBMA and being presented as an important feature of the fastBMA algorithm?

    2) In the "Background" section (under "Findings") the authors cited many relevant research papers. However, in the regression based methods category the authors mostly cited their own work. I thinks the authors should cite other similar works in the same category, e.g. doi:10.1038/srep37140, http://dx.doi.org/10.1039/C4MB00053F,https://doi.org/10.1093/bioinformatics/bti487.

    3) It seems that the underlying principles of the fastBMA algorithm is written under the heading "Related work". This is confusing since "related work" typically refers to similar work by other researchers.

    4) The authors claimed that their algorithm can incorporate prior knowledge of the network topology in the inference process. In the benchmarking studies they have shown how prior knowledge improve the performance of their algorithm. However, I did not find a description of how prior knowledge is incorporated in the core algorithm. A brief description of this process will help readers understand the algorithm in its entirety.

    5) The benchmarking studies performed in this manuscript are not convincing. The authors did not compare the performance of their algorithm with some of the most well known methods such as GENIE3 (http://dx.doi.org/10.1371/journal.pone.0012776 and JUMP3), JUMP3 (10.1093/bioinformatics/btu863) which were shown to be significantly superior to algorithms such as ARACNE, MRNET, CLR etc. which were used to compare the performance of scanBMA whose performance was compared with the fastBMA algorithm in this manuscript. To gain a better understanding of where their algorithm stands in terms of accuracy, compared to the current state of the art, they should compare the performances of their algorithm with the current top performers.

    6) The authors did not properly discuss the weaknesses of their algorithm, for instance in which scenarios their algorithm is not expected to perform well?

    Submitted to
    Reviewed by
    Ongoing discussion