Content of review 1, reviewed on May 10, 2020
[This review follows specific guidelines provided by the Publons team to secure rapid critical appraisals of papers related to the COVID-19 outbreak. My assessment scrutinizes this paper for sound scientific practices, in the hope that this will help researchers working on the COVID-19 outbreak.]
It has been a pleasure to review this pre-print by Holland et al., entitled "BioLaboro: A bioinformatics system for detecting molecular assay signature erosion and designing new assays in response to emerging and re-emerging pathogens." I am formatting this review as per the Publons guidelines for COVID-19 pre-prints and publications. Please see below for my main points and detailed feedback.
Reviewer's note: Content coming from the Publons COVID-19 rapid review template are put within square brackets to clarify which content is mine.
I am not currently working directly on the COVID-19 response, but I consider myself an expert in the methods and core concepts used in this paper. I work in my spare time on developing bioinformatic software for analyzing COVID-19 data.
[Please select one or more of the following sub-topics within COVID-19 research to which this paper relates:]
Epidemiology Virology Genetics
The corresponding author, Dr. Shanmuga Sozhamannan, has worked extensively in the biodefense industry, and has been conducting research in analyzing infectious diseases for over two decades.
Brief overview of the paper and its main findings
This pre-print discusses a bioinformatics pipeline supporting various work-flows involving qPCR primer-probe sets ("primer sets"), such as evaluating the accuracy of existing primer sets, or developing new primer sets /de novo/.
The main findings of this report is that one can build an integrated bioinformatic pipeline for developing /de novo/ candidate primer sets, ready for /in vitro/ validation, in a matter of hours. They also demonstrated that existing primer sets for Bombali ebolavirus have "eroded", meaning that as new genomes become available for a given taxon, that the previously-designed primer sets have suffered in terms of accuracy. Lastly, they developed primer sets for the SARS-CoV-2 genome, and evaluated the accuracy of the /de novo/ primer sets using the available SARS-CoV-2 genomes in NCBI Genbank.
Major and minor points
[This should be the main substance of your review. In your assessment, is the methodology sound? Please endeavor to distinguish between fundamentally weak methodological practice and limitations which may be acceptable (but are still important to flag) due to the inherently rapid nature of this research.]
My major points are as follows:
- Need for methodological validation
- Key method details are not included
- Need for more consistent and well-defined terminlogy
I will discuss these three points in more detail here:
This manuscript presents a bioinformatics pipeline with three key components, connected in series:
- Identifying "signature" regions using BioVelocity
- Using Primer3 to identify candidate primer sets from the signature regions
- Using a tool called PSET for evaluating the accuracy of the candidate primer sets
While the description of the components of the pipeline is clear, the paper does not present any information that robustly demonstrates how well each pipeline components performs. This is mostly relevant to the BioVelocity component. It uses a combination of 50-mers and 250-mers in its analysis of entire genomes. While this allows for fast performance, for species with highly similar genomes, such as Campylobacter jejuni, even a 250 bp window might be too large, leading the pipeline to not be able to generate any candidate primer sets. This concern could be addressed by using BioLaboro against a large set of (pathogenic) species of varying strain diversity, and showing how different k-mer thresholds performed. Unfortunately, the authors describe the pipeline being run on ebolavirus and SARS-CoV-2 only, using fixed k-mer sizes of 50 and 250. Perhaps the earlier paper on BioVelocity provides some validation, but the results of which were not shared in this pre-print (and I didn't find any validation results in the BioVelocity paper from a quick read). Also, there are myriads of tools that do full-genome alignments, such as MUMMER. The authors should provide citations or data that explain how BioVelocity performs relative to these other available tools.
On a related but smaller point, the paper includes statements about the pipeline being 'rapid', but no data is presented that would support the claims of fast performance. As an anecdote, I personally have built a similar pipeline that generates candidate primer sets for COVID-19 in approximately 30 minutes. So when the authors state that their pipeline generates primer sets "in hours", I naturally am curious about how that time is being spent. I am not saying that my pipeline is a valid benchmark, but rather it makes me want to know better how that time is being spent by BioLaboro. The authors should break down the average runtimes of each component of their pipeline, and each component's average RAM and disk utilization.
Missing Methodological Details:
The pre-print needs to describe in more detail how the following third-party software was used (i.e., parameters used):
- NCBI Blast+
Specifically with Primer3, there are dozens of available input parameters, and many of them have very specific behavior in terms of sifting through candidate primer sets for primer design flaws. These should be enumerated and explained in the Methods section, or in Supplementary Materials.
Another example: when providing a reference to the GLSEARCH program, the authors cite the EBI website and web APIs, but there is no mention of this in the pre-print itself. In terms of pipeline runtime performance, it is a significant detail to know whether a tool is being used locally, or via web APIs to a server on a different continent.
The authors were consistent in providing NCBI accessions for the target genomes used, but need to describe which version of NCBI GenBank is being used for the local Blast+ searches.
The authors also mention several times a "penalty score" used in ranking candidate primer sets, but this penalty score is never defined. It is also unclear about whether they are using one of Primer3's standard penalty scores, or whether the authors developed their own custom penalty score.
The authors can strengthen their pre-print by providing clearly-defined and consistently applied terminology. For example, they use the term 'contig' to mean something other than an assembly artifact, they use the term 'signature' to mean a genomic region in one context, but it means a candidate primer set in another. They use the term k-mer twice in the same paragraph, but meaning two different steps in the same algorithm. Please see my detailed notes for the occurrences that were unclear for me. In general, I came across several parts of the pre-print where the attention of a skilled technical editor would help make the text "flow" better, but I tried to not dive into editing the text.
Below please find my detailed comments, organized by page number and line number. Please excuse my brevity, and please do not interpret it as being rude or harsh, as that is not my intent.
- L34: sensitivity and specificity are not best match here, as this is a search domain, so the class imbalance between TP and TN will cause problems (see more discussion below).
- L142-144: This could be clearer. The oligos don't mutate. The amplicon region of the target genomes may have mutations or natural variation relative to the reference strain's genome.
L161-162: These job types have names that look reasonable, but I cannot figure out the differences between some of them. I advise defining the job types precisely if they are going to be enumerated here. Otherwise, remove.
L165: Seems like 'signature' is being verbed here, and I don't know what it is meant to mean. Regions that are both conserved among target genomes, and also in a signature region of interest, like a target gene?
L165-167: Seems like reinventing the wheel; there are myriad genome alignment tools, like the MUMMER suite. The following section about splitting "contigs" justifies this design choice further, for it pre-screens these regions against "all non-target sequences" to achieve the desired specificity. Unfortunately, what "all non-target sequences" refers to is unclear.
One question that immediately arises is that this will not work well for genera or species with many near-identical strains, as having a perfect 250 bp region as unique within a single variant will be rare. As this has only been assessed on two viruses, and not on a larger panel of species, it is hard to tell how it would perform in difficult situations. Campylobacter jejuni comes to mind.
I do not like the use of the term "contig" here, as it muddies the waters. The authors have not described up to this point the "status" of the genomes employed (whether draft or complete), so unclear if contig is referring to actual assembly contigs, or their novel definition here. I'd strongly recommend a different term be employed for their extended conserved regions.
There are too many terms being used inconsistently, and not properly defined: sequences, passing sequences, signature contigs, contigs, target sequences, conserved k-mers. Probably if the terms are carefully defined, fewer such terms will be necessary here, and will greatly clarify the presentation.
L171-172: Primer3 is a very complex component of this pipeline, but not enough description is given. I'm hoping that the Methods section will be detailed.
Unclear why only the top five signatures among all "signatures" were kept. And which "penalty score" was used? Primer3 penalty scoring is only concerned with heuristics and thermodynamic models of the sequence, and doesn't factor in "signature erosion" type concerns. In other words, a primer-probe set with the sixth-lowest penalty score (i.e., the sixth-best), might end up having the best accuracy regarding target genomes vs. off-target genomes.
L172-177: Inconsistent terminology: assay, primer, primers, primer sets, primers and probes. This needs to be formally defined and used consistently throughout.
I don't like the use of the term "assay", as to me an assay is a /in vitro/ kit, which consists of oligonucleotides and other reagents. This pipeline is generating candidate primer sets. Candidate primer sets later get validated /in vitro/ at the wetbench.
L181-182: Define similarity here: nucleotide? peptide? The following line makes these lines ambiguous.
L184: if -> whether
L185: Define what are the "current Eboli assay signatures". Above signatures were regions. Now it seems that the term is being used for primer pairs, or primer pair + probe, or what I term a qPCR primer set.
L187: assays or assay signatures? Terminology again.
L188-189: A single oligo in a primer-set might be 20bp. 90% of the length of a single oligo might be 18 bp. 90% identity of that 18 bp region would be 17 bp. So 3 mismatches allowed. For such small oligos, it makes more sense to describe the max number of mismatches. It also doesn't describe the placement of these mismatches. As it is well-known, specific base patterns in the so-called GC-clamp 3' end of oligos, or the 5' end of probe sequences can have an outsized impact on fluorescence and hybridization. So I hope the methods section delves into more detail here, and assuages my concerns. TODO double-check this.
L190: It doesn't make sense to say that the assays didn't have perfect matches to the primer sequences. They are the primer sequences! Better described as "none of the oligos from either primer set had 100% identity to their corresponding regions within the amplicon region of the target genomes."
L191: The previous page described the 90% identity over 90% of the length as computed once over all of the primer and probe sequences. On this line the authors write that an average is taken from the individual oligos. This should be clarified.
The legend begs the following two questions: What is the explanation of "None"; was there really not even a trace of a possible amplicon in these genomes? And 2: What happened to the 90% of the length criterion? If it was bothered to be described before, yet it is absent here. And what is the definition of "no alignment at all"? Not a single oligo aligned? No amplicon? Amplicon but no probe sequence? There are distinct failure modes here that should be described and distinguished.
L207-208: The target genome set is manually defined.
L210: This can be more simply put as it is the NCBI reference sequence for this species.
L218: What does 'step size of 1' mean? Is it the same as 'sliding window of 1 bp'? Consistent description would be helpful. And the phrase "split into 250 k-mers" is confusing, because it could read as "there are 250 k-mers" rather than "split into 250-mers".
L218: contigs, signatures, k-mers, again, the terminology is confusing. Also, the term k-mer is now overloaded, meaning on one part of the paragraph 50-mers, and in a different part of the same paragraph, 250-mers. I'd strongly recommend making these terms separate for clarity.
L223: Good table. The distinction between "Conserved" and "Signature" should be formally defined earlier in the pre-print.
L227: It seems that the pipeline steps are being described thrice: once, in the beginning of the Results section (a high-level overview), here where the results of each step of the pipeline are described, and presumably in the Methods section. If the authors feel the need to write a mini-Methods section at the beginning of the Results section, that is a sign that you should put the paper in a logical flow, where you describe the methods before you describe the results of said methods. It would help with reducing the amount of redundancy in the pre-print.
L227: Primer3 has dozens of input parameters. The parameters used for this study should be enumerated in the supplementary materials.
L230-231: Here is an example of where the passive voice makes manuscripts unclear. Who assigned the penalty score? Primer3 outputs a few different penalty scores. Or did the authors create their own?
L236: Should clarify that the amplicon sequences shown here were taken from the BOMV reference genome.
L239: "Penalty Points" score not defined. Again, passive voice, generated by whom, Primer3 or the authors?
L254: The use of both NCBI Blast+ and GLSEARCH should be explained. Presumably Blast+ is being used as a fast heuristic search, and then GLSEARCH is being used for a non-heuristic exact alignment (which is computationally more expensive). Both of these commands have parameters that should be described in the supplementary materials.
L255: Not clear what was being described by "assay amplicon sequences". Is this the amplicons of all BOMV genomes? Or just the amplicon from the reference genome? Was any low-complexity masking of the reference database or the input sequences performed? Otherwise, the Blast+ step can have many off-target hits.
L259: It is curious that the acceptance criteria do not include anything about the probe sequence's strand orientation relative to the forward and reverse primers.
L262: The NCBI Taxonomy DB identifier is abbreviated as "ID", but in the tables the "Identifier" column has a different meaning, and the column with an NCBI Taxonomy DB ID is called "Targets". I think these need to be made more clear. I'd propose "Taxon" for the column name, and "TaxonID" for the abbreviation of NCBI Taxonomy Database identifier.
L265: True Negatives are not well-defined here. The definition chosen is one such definition, but others exist as well, such as all sequences not within Taxon 2010960. In most search domains, where there can be both ill-specified TNs, and very large space of TNs relative to the Positive predictions, we use precision and recall and the F1 score, to avoid needing to enumerate the TNs. The F1 score built on top of precision and recall, or the Matthew's Correlation Coefficient, can be used when you have large class imbalance issues, as in search problems.
The partial match problem is actually distinct from the classification problem, and should be separated out and reported separately. Otherwise one is not comparing apples with apples (e.g., what counts as a TP hit is not the same definition for what counts as a FN hit). This just conflates the two issues.
It just occurred to me that while the algorithmic description makes it sound like the authors were worried about performance (e.g., using k-mer matching and a Blast+ heuristic search), there is no listing of performance information in terms of runtime, number of cores utilized, RAM and disk storage consumption, etc. The only mention is on page 3, line 47: "all within hours". Not clear what is taking all of that time.
There's no demonstration that the conserved & signature region heuristics work well in general; no validation.
- L303-305: Here the authors write that a large set of candidate primers are generated & sorted by some penalty score (still not clear on what it is), and then manually selected to be spaced out across the genome. Not clear why this is done. Does genome coordinate correlate with primer-set performance, as evaluated by PSET? Seems to be premature optimization; why not search all 330 primer sets across NCBI GenBank?
- L332: Says 145 genomes in NCBI Taxonomy, yet 96 were downloaded from GSAID. Usually there is a lag of sequence being uploaded to GSAID, and then making its way to GenBank. Why are there more in GenBank?
L372-373: "rapidly designed", "obtains results quickly". There is no data presented here to support these descriptions.
L374-375: This algorithm does contain heuristics, such as the k-mer matches and the use of Blast+ as a heuristic filter for finding NCBI Genbank entries to scan with GLSEARCH. The choice of 50-mers and 250-mers are not defended, nor is there any validation using data to demonstrate these constants. They are most definitely heuristics.
L376: "objective penalty scoring system". This system is not described, so hard to judge whether it is objective.
L321-382: "dedicated large RAM system" More details on the system would be better in the body of the pre-print, as "large RAM" is in the eye of the beholder. There is a table at the very end describing the large machine used. This should be introduced earlier on in the Results section.
L382: Not sure what is meant by "discrete logins". Logins cannot be continuous. Perhaps alternatives like "unique" or "distinct" are clearer?
L384: "immediate": only if you can synthesize your oligos in-house!
L402-403: The WHO had several COVID-19 assay kits listed on their site, even back when this manuscript was released. There are more national labs listed on their website than just Germany and the CDC.
L407: "pan assays". 'Pan' in terms of what? It was just mentioned that the primer sets do not cover bat and pangolin coronaviruses related to SARS-CoV-2.
L409-411: This analysis pipeline should have been described in greater detail in the Results section. Doesn't make sense to insert it into the Discussion.
L418: Need data to support "rapidly" description.
L424: MCMs are defined on page 3 to include more than just diagnostics, so shouldn't be equated with diagnostic primer sets.
L445: "various k-mer lengths". Previously just 50-mers were mentioned. This needs to be made clear.
L447: The authors need to present data to support their claim of "rapid" performance.
L449: There are heuristics being used in this pipeline (see above).
L453-454: The parameters used by Primer3 are important for assessment of how the primers were screened for common design flaws. This topic is not adequately addressed in this manuscript.
L445-456: There is no description of the penalty scoring system. This needs to be addressed.
L469: Unique, or distinct, logins
L475: The authors need to first introduce the context where this software runs. Is it open source? Is it proprietary? Can it be used by people outside the organization, or is it strictly for internal use? Is there a way to license it? If there's no way for anyone to use this software aside from the authors, it doesn't make sense to describe it in the pre-print.
L478: Database accessions are not numbers, they are identifiers.
I see no indication that this report didn't adhere to high standards of research integrity.
Conflicts of interest
I have a bioinformatics consulting business that works with clients to develop qPCR primer sets. I am also currently developing an open-source bioinformatics pipeline for both evaluating primer sets and generating them /de novo/. I have also posted reports on primer design problems found among the primer sets designed by the WHO-affiliated national labs.
[Please select one of these categories. We appreciate a single box does not reflect sufficient nuance or detailed scrutiny. However, it will greatly help COVID-19 researchers to screen and prioritize new COVID-19 papers.]
[Incremental. As best as you can assess, this paper validates or summarizes other research to-date and does not promise any substantial breakthrough/s in COVID-19 research.]
I would add that while this is an incremental step, it is novel and interesting work.
© 2020 the Reviewer (CC BY 4.0).