Review of Torix <i>Rickettsia</i> are widespread in arthropods and reflect a neglected symbiosis

Content of review 1, reviewed on September 21, 2020

This study relies heavily on secondary data usage, identifying the presence of Rickettisa symbionts in host samples using discarded data from the BOLD database. This is great, and we should have more studies like this. However, largely, the authors fail to discuss the limitations of their study which comes from secondary data usage. For example, lack of control for cross-contamination of samples, the fact that there may be incomplete taxa sampling, and other biases in the underlying database used. For example, they failed to do a comprehensive analysis looking for batch effects to ensure that samples were not systematically contaminated in data deposited from one organization. I also have significant concerns over the lack of detail in the methods and not having access to the multiple sequence alignment used.

Other concerns/criticisms I had, include:

There are no methods for how samples were binned in Figure 1 either in the manuscript or in the figure. For example, how were bacteria contaminants v. non-bacteria contaminants determined? Was it a BLAST search. If so, what were the criteria? I suspect based on results presented Figures 2 and 3 that the criteria were not stringent enough.

Line 154: Phylogenetic placement does not demonstrate these are of microbial origin. If I put a random sequence into the multiple sequence alignment, it would align and it would be in the phylogeny, by nature of the methods. Nothing about the tree or the topology suggests that didn't happen. In fact, some of the long branches may indicate that it did.

Since COI is derived from the mitochondrial genome, which is a microbe, language about "microbial origin" needs to be fixed throughout. Many consider organelles to still be microbes. If nothing else, their sequences (including COI) are of microbial origin.

The letters mean in Figure 2 are supposed to be the Wolbachia supergroups. But their placement seems quasi random. The sequences don't appear to be assigned to supergroups. If their placement corresponds to representative sequences, please specify that is the case, and make clear what the representative sequences are, and where they are on the tree. Regardless, the phylogeny shows issues with very long branches around "A" from around 7 o'clock to 9 o'clock if the phylogeny were a 12-hour clock. This is peculiar. Is this an artifact of the tree rendering? Or the outgroup selection? Or some other problem—like the presence of Wolbachia lateral gene transfers that are no longer under selection? Or were sequences included in the analysis that aren't really from bacteria and is an methodological artifact?

In general, there is no discussion or acknowledgement of the extensive literature on bacterial DNA integrations in host genomes, which for Wolbachia is extensive.

How much support is there for branches/nodes in the tree? I can see bootstrapping in the methods, but I don't see any indication of bootstrap support.

The multiple sequence alignment and unmodified phylogenetic files need to be made available to the reviewers and the readers either as online supplementary material or in a public repository with a permanent DOI.

Line 215-227, using the term prevalence is not correct. You do not know the full extent of prevalence of any of these organisms since you weren't targeting them with more specific primers with rigorous sampling. It is easy for this to be misconstrued and alternate terminology is needed.

Line 224: "indicating". There are other explanations as well, so I think using the word "suggesting" is more appropriate.

Line 235: The statement is too definitive for the data used. Yes, the stated p-value may be significant, but the statement and conclusions do not take into account the significant sampling bias in the SRA. But in addition, when I do the Fisher's Exact test I get 0.0550, which is not significant. The methods for the Fisher's Exact test and summary of the matrix is missing. My two by two matrix that yields a p-value of 0.0550 used presence/absence in the taxa in the table:

                                 Aquatic                 Terrestrial

Has Torix Rickettsia 9 7 Does not have 49 107

Intuitively it isn't surprising it wouldn't be significant he difference is 20% v. 10% with more limited sampling of one than the other and low levels of detection overall.

Line 300-301: what was the minimum criteria to say that a taxa has it? Merely a COI sequence? Or more? It seems given cross contamination of sequencing projects and other issues, that you need more than just the COI sequence in the BOLD database. Making it clear here is important to the discussion and interpretation of results.

Line 310: I'm not sure I agree with your logic. It might be that they fail because of Rickettsia or other bacterial DNA replication.

Line 329: these conclusions seem premature given the data presented, since bootstrap support values or missing in this version reviewed.

Please check the legends in the additional files. I think Additional File 3 has a legend stating it is "Additional File 2". Likewise Additional File 2 has a legend stating it is "Additional File 1"

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews: Reviewer#1 My only concern is that Torix group Rickettsia and their relatives have also been identified in protists, such as nucleariid amoebae. So I wonder how many of these Rickettsia, particularly in aquatic hosts, are symbionts of protists residing in animal guts. Have the authors tried to pull out protist 18S sequences from the SRA datasets (or tried to amplify protist genes via PCR, although that would be much more difficult)?

We thank the reviewer for this insight which we agree with. phlyoFLash analysis retrieved 16S (microbe) and 18S (eukaryote) sequences for each SRA dataset where present, and we have now included this information on the FTP server under the directory name “phyloFlash html files”. One instance of an assembled parasitoid 18S rRNA sequence was found in dataset ID SRR6313831 from Bemisia tabaci. However, a B. tabaci-Rickettsia true endosymbiosis has already been confirmed though FISH imaging (Wang et al. 2020; doi:10.1111/1462-2920.14927) suggesting the parasitoid is likely not responsible for the presence of Rickettsia in this case.

Protist sequences were also identified in some of the SRA datasets but these were a significant minority of reads compared to Rickettsia reads (doi:10.6084/m9.figshare.12801140). Intriguingly, one of the highest numbers of protist reads came from our previous study (SRA dataset SRR5298327) which was shown by FISH to be a true endosymbiosis between insect and Rickettsia (Pilgrim et al. 2017; doi:10.1111/1462-2920.13887). Overall, these data suggest that detecting contamination from Rickettsia-infected protists or parasitoids is uncommon. This new information has been added on lines 274-281, 355-364 and 576-578.

Minor comments:

Line 194 - Psyllidae spelling

Line 242 & Table 2 - Chaoboridae spelling

Line 251 - Simulium spelling

Spellings of these taxa have been now rectified.

Lines 340 - I would replace refs 49 and 50 with Gehrer & Vorburger, Biol. Lett., 2012

The references have now been changed per the reviewer’s suggestion.

Line 362 - this sentence is confusing because the citations refer to Rickettsia in the belli group

For clarity the sentence has been changed to specify the references refer to the belli group only (line 417).

Table 2 - Siphonaptera spelling

Line 819 - Parentheses spelling

These spellings have now been changed.

Reviewer#2 Abstract 38, 42-43: the introduction of the "aquatic hotspot" hypothesis and that the results were supporting this hypothesis was very appealing (l38), yet this was not addressed in the conclusion, which instead claimed that Rickettsia was associated with a number of habits (l42-43). As these habits were not linked to aquatic, and not introduced previously in the background, the logic flow here is rather difficult to follow.

We thank the reviewer for flagging this. We have now changed the conclusion of the abstract to show that new hotspots of infection were revealed as well as confirming a bias towards aquatic insects (lines 44-47).

69: Rickettsia has been estimated as being present in 20-24% of species. One would be very interested in learning whether this is confirmed/disapproved by the findings of the current study. Which part of the experimental design is set to answer this question? If no, what needs to be done to get a better idea?

The 20-42% prevalence figure for terrestrial arthropod species is derived from model-based estimation techniques which assume populations infected have a minimum of 1/1000 individuals infected. Thus, our figure of ~9% from the targeted PCR screen is likely lower due to small within-species sample sizes. This has been highlighted in lines 366-370.

79-88: It might be a good idea to add something here about the diversity of subgroups of Torix. The results later on revealed two subgroups (Leech and Limoniae), but are these good representatives of the diversity within Torix? How many subgroups are already known?

Previous studies on Torix Rickettsia have highlighted two subgroups: “Leech” and “Limoniae”. This was initially based on limited phylogenetic markers but by extension of using multiple markers we confirm in this study that a majority of Torix strains fall into these two subgroups. We have highlighted this on line 85.

90-102: The use of terms Rickettsia CoxA, COI, Rickettsia COI are confusing. If Rickettsia CoxA and Rickettsia COI are actually referring to the same Rickettsia gene, the term needs to be standardized.

We thank the reviewer for making this point. We agree that terms should be standardised as much as possible. Therefore, we have removed any reference to ‘CoxA’ in the manuscript.

106: does the "template" here refer to DNA extract/aliquot? "Template" in the context of DNA template is primarily used in the description of amplification reaction, which doesn't seem to be the case here. This term is somewhat confusing. As you used "DNA extract" later in the text, I would suggest that these terms be unified.

The term “template” has been swapped for “DNA extract” throughout the manuscript.

109: "function more broadly" here is also vague. Do you mean that the primers used in these PCR assays are more degenerate or specifically designed to target Rickettsia genes? Please clarify.

The primers function more broadly as they were designed from our previous work based on Rickettsia genomes from multiple clades, including the first available Torix genome. This information has been removed from the introduction and is instead clarified in the data description (lines 153-155) and methods (lines 478-480).

123-125: "...deemed as contaminant sequences as a result of not matching initial morphotaxa assignment". I don't think that this is entirely accurate. A significant proportion of barcodes in BOLD are not matching initial morphotaxa assignment, at varied taxonomic levels. These include mis-identification, ambiguous/unstable taxonomic status, lab contaminations, etc. I would assume that BOLD uses an algorithm to confirm the sequence as being contaminants, only when they are matched to the most common non-target contaminants, e.g., bacteria, human etc.

We thank the reviewer for their comment. Yes, this dataset contained both contaminant sequences, as well as misidentified taxa and we have now changed the wording of this sentence to reflect this on line 130-132 and in Figure 1. Information on how contaminants were confirmed as bacterial are also now described in lines 450-465.

125-128: the term "specimens" needs to be clarified. Do these include those that didn't yield a DNA sequence?

Yes-this included some specimens where barcoding had failed to yield a DNA sequence. This has now been clarified on line 126.

142: Explain targeted PCR Rickettsia screen. Does it employ specific primer sets designed for Rickettsia? Although this was described in the method section, a brief explaining of the method would help the readers to understand the context.

Yes, as mentioned above, the primers function more broadly as they were designed from our previous work based on Rickettsia genomes from multiple clades and including the first available Torix genome. This has now been clarified in lines 153-155 and 478-480.

149: Should "Analyses" be "Results"?

The formatting of gigaScience uses “analyses” in place of “results”.

160-161: "further unique bacteria contaminants were also detected", where are these results? Please cite.

These results have now been added in Additional file 1 (graphic representation of taxonomic classification as bacteria) and the FTP server file “Kaiju_misc_bacteria_detection” (sequence information). These were sequences flagged as bacterial by the bioinformatics tool Kaiju (lines 173-176).

167-170： if the BOLD results does not seem to support the aquatic hotspot theory, why?

Both the BOLD and SRA datasets have inherent biases which make them unsuitable to assess whether Torix Rickettsia are more common in aquatic or terrestrial biomes. For example, most SRA submissions are from lab-reared terrestrial insects. Likewise, a majority of the specimens from BOLD containing Rickettsia have limited taxonomic/ecological information, by virtue of not returning an mtDNA COI sequence. Therefore, a PCR-based study targeting both terrestrial and aquatic taxa was implemented in order to specifically test this ‘aquatic hot spot hypothesis’ (lines 149-158).

170-172: the predominance report of Rickettsia from Canada seems meaningless, given the strongly biased sampling in BOLD (supplementary Fig. 1)

The authors agree. This has now been removed.

180: this is confusing, does it mean that the Torix sequence is identical to that of C_LepFolR at the 3' end? Or does it have a SNP but different from that of other bacteria?

The Torix sequence has a SNP at the same site as all the other Wolbachia/Rickettsia genomes compared to C_LepFolR at the 3’ end. However, all the Wolbachia/Rickettisa genomes assessed apart from the Torix Rickettsia have a SNP at the 3’ priming end for C_LepFolF. For clarity, this can be viewed in Additional file 4.

185: How were these 186 Rickettsia-containing samples selected from 753 samples?

These DNA extracts were chosen based on assorted geographic location, host order and diverse phylogenetic placement. This has been clarified on line 196-198.

192: So how many subgroups of Torix are known? How well the findings represent the diversity?

As noted in a previous reply, to date only two subgroups of Torix Rickettsia have been uncovered: “Leech” and “Limoniae”. This was initially based on limited phylogenetic markers but by extension of using multiple markers we confirm in this study that a majority of Torix strains fall into these two subgroups. We have highlighted this on line 85.

207: define attempted barcodes

In this context, an “attempted barcode” is an attempt to retrieve a mtDNA COI barcode from the approximately 185,000 arthropods in the study. As mentioned above and indicated in figure 1, not all DNA extracts produced a COI sequence to interpret. Now that the term “specimen” has been clarified on line 126 we have replaced “attempted barcodes” with “specimens” to avoid confusion.

211: Here you used "genomic extracts", is this equivalent to "template"? Try to standardize terms.

We have standardised terms to only “DNA extracts” throughout the manuscript.

217: again, why BOLD taxa with the most presence of Rickettsia NOT associated with aquatic lifestyle? 233-235: why did the comparison between aquatic/terrestrial arthropods only consider the targeted Rickettsia screen results, NOT that of SRA search?

We refer the reviewer back to our earlier response (167-170) to address both of these points.

269-270: This is somewhat misleading. This might imply that these two groups of bacteria cooccur in the same organisms, and the amplification of R is easier than W. I don't think the current experimental design is able to proof or deny this possibility.

The wording has now been changed on lines 310-312 to avoid this confusion.

308-310: we know that there are many other possibilities that might cause barcoding failure. At least provide some alternative causes to avoid biased argument.

We have deleted this argument from the paragraph.

415-416: what are the exact criteria when choosing these DNA templates?

This point has been addressed above (reviewer comment 185)

428: does "linear" mean non-recombined sequence?

In this context, “linear” refers to a parameter of the recombination detection program which refers to the sequences not being circular.

438-439: does this mean that the hosts were NOT identifiable by morphology?

That is correct, the metadata provided for specimens before barcoding is a general morphological classification usually down to the order level. Subsequently, more refined classification can only be achieved from the mtDNA barcode. This has been highlighted on lines 501-504.

459-461: What if the sequence was matched to more than one barcode at >98% identity?

This did not occur.

489-497: Please provide more details on the analysis of phyloFlash, e.g., parameters used. I am a bit concerned about the assembling process employed here. 16S assembling can be difficult/impossible when metagenomics data contain more than 1 bacterial species or multiple variable copies of 16S, both of which might be the case for Rickettsia.

Default parameters were used for phyloFlash (lines 567-578). Phyloflash uses a combination of SPAdes and BBmap to assemble rRNA SSU and references a curated database (SILVA). BBmap cut off for identification is a minimum identity >70% and phyloflash recommends SPAdes as the best method for cases where there may be a lack of close relatives in the reference database. The recent paper (Gruber-Vodika et al. 2020; doi:10.1128/mSystems.00920-20) goes into further details about chimeras, false positives and dataset preparation. While the defaults do what they can to minimise risk of false positives, it cannot be entirely eliminated.

We have attempted to address this by flagging the instances where Wolbachia sequences or other symbionts were also found in the phyloflash notes, though these sequences were not always assembled. This information can be seen in the phyloflash html files on the FTP server.

Table 1: for species without a definite identification to the species level (e.g., Pachycrepoideus sp.), do we know that all specimens analyzed here actually belong to the same species? I assume this can be confirmed using barcodes.

Some arthropods without a definite identification were referred to as “sp.” because barcoding was not successful or did not match any known species in the database (lines 546-547).

Figure legends for Figs. 2 and 3: the term "No colour" is misleading. I thought these would refer to those without any background colors (e.g., Rickettsia lineage in Fig. 2).

We have removed the term “no colour” from the legend.

Fig. 2: So all Rickettsia in this tree were not from non-BOLD reference (says the Fig legend)? If the number in parenthesis represent the number of sequences, why is there only a single tip for Rickettsia? Are they collapsed? If yes, does it mean that the genetic divergence within Rickettsia is much smaller than that within Wolbachia?

Yes, Rickettsia is collapsed and this is now mentioned in the legend (Line 890). Genetic divergence of Rickettsia is deliberately shown in Figure 3 (and Additional file 2) and not in Figure 2 for ease of presentation, due to the number of taxa in the phylogenies.

Fig. 5: Is the lineage distribution associated with methodology used in discovering these sequences (SRA vs. targeted PCR screening)? Provide statistics.

The SRA datasets contain more Belli strains than the targeted screen but this seems irrelevant information as both datasets cannot be reasonably compared. As mentioned above, the SRA dataset contain very few aquatic insects with most depositions deriving from terrestrial insects and/or lab cultivated insects. In contrast, the targeted screen represents mostly wild-caught insects with a mixture of aquatic and terrestrial arthropods. Subsequently, even if it was shown that specific lineages were associated with the two methods for the SRA and targeted screens, it is just a likely that this is due to sampling bias rather than other methodological biases. Thus, our conclusions are measured 1) The BOLD screen demonstrates that Rickettsia (specifically from the Torix group) are overrepresented in barcoding projects and can help identify new hosts. 2) The SRA screen demonstrates that both Torix and Belli clades of Rickettsia are common. 3) The targeted screen provides evidence to suggest Torix Rickettsia are more common in aquatic insects.

Fig. 6: Move the vertical bars representing Typhus, Transitional, Spotted fever, and Bellii, further to the right so that they are in line with that of Torix. My understanding is that these lineages belong to the same hierarchic level under Rickettsia.

We thank the reviewer for pointing this out and have changed figure 6 accordingly.

Reviewer #3 This study relies heavily on secondary data usage, identifying the presence of Rickettisa symbionts in host samples using discarded data from the BOLD database. This is great, and we should have more studies like this. However, largely, the authors fail to discuss the limitations of their study which comes from secondary data usage. For example, lack of control for cross-contamination of samples, the fact that there may be incomplete taxa sampling, and other biases in the underlying database used. For example, they failed to do a comprehensive analysis looking for batch effects to ensure that samples were not systematically contaminated in data deposited from one organization.

We thank the reviewer for highlighting this. Although this study does use secondary data in the BOLD and SRA screens, our own primary dataset was generated via the targeted screen to prevent an overreliance on secondary data and of course its biases. Regarding the prospect of cross-contamination, this is unlikely for two reasons. 1) A majority of the multilocus profiles assessed from BOLD tend to give unique profiles which is reflected in our phylogenetic trees. Significant cross-contamination would tend to give identical strains. 2) If cross-contamination occurred between DNA extracts then it is likely that an mtDNA COI sequence would be retrieved (either from the original DNA extract or the contaminating one) rather than a Rickettsia COI sequence, as mtDNA is far more likely to amplify than Rickettsia when in competition.

Additionally, due to the aforementioned biases of using secondary data we have tried to be measured in our conclusions as a result of this. Specifically, we are not trying to claim that the Rickettsia sequences discovered in these databases are completely representative of Torix hosts in nature. Merely, that they allow for the discovery of new putative hosts and through combining several methods there is an indication that Torix Rickettsia are more widespread than previously thought and are overrepresented in aquatic insects.

I also have significant concerns over the lack of detail in the methods and not having access to the multiple sequence alignment used.

Sequence alignments, tree files etc. should already be available to the reviewer via the data management team (in the FTP server) at the journal. If this is not the case, we are happy to reupload the relevant data.

Other concerns/criticisms I had, include:

BOLD compares COI sequences to common contaminants (e.g. human, bacteria) using BLAST-details can be found in Ratnasingham and Hebert, 2007 (doi:10.1111/j.1471-8286.2007.01678.x). The designation of bacterial contaminants by BOLD, from the dataset containing 3,817 non-target sequences, was confirmed by the taxonomic classification program, Kaiju, using default parameters. We took the sequences provisionally identified as bacterial before placing them phylogenetically with reference bacteria suggested by Kaiju. This has been highlighted in lines 450-465.

We have now included the usage of Kaiju which is a software program designed to designate taxonomic classification of sequences. For all sequences in the alignment used to create Figure 2, these were all identified as bacteria except one erroneously identified as eukaryotic which was later identified as Rickettsia on our phylogeny. Kaiju also allowed us to choose more specific reference sequences to include in our phylogenies. Aside from Rickettsia and Wolbachia, a significant minority of sequences formed a monophyletic clade with the order Legionellales. In addition, we have now also included mitochondria in the tree on figure 2 to further verify the sequences are bacterial. This is discussed in lines 163-168 and 450-465.

With regards to long branches being problematic, Figures 2 and 3 were constructed as cladograms and not phylograms for neat presentation: branch lengths tell us nothing about clade designation. For transparency we have now included phylograms of figures 2 and 3 in Additional file 2 which demonstrate no long branches.

We thank the reviewer for noting this. “Microbial origin” references have now been removed and we now refer to “bacteria” to distinguish from mitochondria throughout the manuscript.

The supergroup letters are for individual sequences. This has now been noted in the figure 2’s legend with accession details for sequences also clarified as being available in additional file 10.

Regardless, the phylogeny shows issues with very long branches around "A" from around 7 o'clock to 9 o'clock if the phylogeny were a 12-hour clock. This is peculiar. Is this an artifact of the tree rendering? Or the outgroup selection? Or some other problem—like the presence of Wolbachia lateral gene transfers that are no longer under selection? Or were sequences included in the analysis that aren't really from bacteria and is an methodological artifact?

As mentioned above, branch lengths do not say anything about genetic distance on cladograms. We have included phylograms in Additional file 2 for transparency and to show a lack of long branches within clades.

In general, there is no discussion or acknowledgement of the extensive literature on bacterial DNA integrations in host genomes, which for Wolbachia is extensive.

This has now been addressed in lines 352-355.

How much support is there for branches/nodes in the tree? I can see bootstrapping in the methods, but I don't see any indication of bootstrap support.

Bootstrapping is present on all trees in this manuscript and graphically represented as black, white and grey circles in figures 2 , 3, 4 and 5 and coloured circles in 6. This is indicated in the top left corner of all figures.

As mentioned above, all of these files should already be available to reviewers via the FTP server of the journal.

“Prevalence” has now been changed to “frequency” throughout the manuscript when referring to the proportion of Rickettsia and Wolbachia deposits within the BOLD dataset.

Line 224: "indicating". There are other explanations as well, so I think using the word "suggesting" is more appropriate.

This has now been changed accordingly.

Aquatic Terrestrial Has Torix Rickettsia 9 7 Does not have 49 107

Intuitively it isn't surprising it wouldn't be significant he difference is 20% v. 10% with more limited sampling of one than the other and low levels of detection overall.

We appreciate the reviewer’s diligence in checking the Fisher’s Exact test. However, the matrix presented by the reviewer does not consider Rickettsia subgroup and fails to account for multiple rows containing the same species (be it from a different population).

Subsequently, when taking these factors into account this is the matrix which was used in the submitted manuscript. Aquatic Terrestrial Has Torix Rickettsia 9 5 Does not have 49 106

Note that only 5 Torix Rickettsia are present in this matrix for terrestrial species because 2 of the 7 Rickettsia positive strains from the terrestrial species are not from the Torix group.

Since submission of the initial manuscript, table 1 has been updated to reflect previously missing Rickettsia positives detected in 3 spiders. With the addition of these spider positives, there is no significant difference between aquatic taxa and terrestrial taxa (p=0.1038).

However, when considering insects alone, this results in a p value of 0.0131. When controlled for taxonomic group (not all insect orders are represented in terrestrial and aquatic pools) the p value is still significant at 0.025. Subsequently, we have now suggested that the aquatic hotspot for Torix Rickettsia appears to apply for insects but not invertebrates in general. It should also be noted that the within-species sample sizes of terrestrial taxa in this study are often greater than aquatic suggesting that p values are conservative (positives are more likely to be found with greater sample sizes).

Details of Fisher’s exact analyses have now been included in Additional file 7 and discussed in lines 245-261 and 554-564.

The issue of cross-contamination has been addressed in our first response to the reviewer. Of course, ideally to confirm a true endosymbiosis, direct visualisation of the symbiont in the host’s tissues is needed due to potential for the bacteria to come from ingested food or parasitism. However, previous studies have predominantly relied solely on PCR to identify putative hosts (as demonstrated in Table 2). To reflect this, we have changed the language accordingly to mention “putative hosts” where appropriate (lines 287, 296, 342, 389, 427). Additionally, we direct the reviewer to our response to reviewer 1, where we have screened SRA datasets to assess how likely contamination from ingested biota and parasitism is. Rickettsia-insertions into the host nuclear genome is also unlikely because all protein-coding genes from this study showed no signs of a frameshift, suggesting a lack of pseudogenization. Further, there are no well supported cases of Rickettsia inserts in the nuclear genome in the literature to date, a marked contrast to Wolbachia.

We agree with the reviewer that these points are important for the interpretation of the results and now mention them in lines 337-350

Line 310: I'm not sure I agree with your logic. It might be that they fail because of Rickettsia or other bacterial DNA replication.

This argument has been removed from the paragraph.

Line 329: these conclusions seem premature given the data presented, since bootstrap support values or missing in this version reviewed.

We refer the reviewer to our previous response to bootstrapping.

Please check the legends in the additional files. I think Additional File 3 has a legend stating it is "Additional File 2". Likewise Additional File 2 has a legend stating it is "Additional File 1"

We thank the reviewer for flagging this. We have changed the legends accordingly.

Source

Content of review 2, reviewed on December 02, 2020

I apologize for my previous comments about long branch lengths, etc. I am not sure why it didn't register in my brain that they were cladograms rather than phylograms. It is really is quite obvious. Sorry. O_o

Thanks for making me aware of the ftp site as I was able to request the information from the journal. It isn't clear to me whether the journal would provide the ftp site to readers, but this data should be publicly available. Given GigaScience's commitments to open data science and the importance of these files to the conclusions of this study and manuscript, I would recommend putting them into Figshare if the journal is not going to host the data, particularly since you already have data for this paper on Figshare. (In fact, I think they would be better on Figshare than on the FTP site, although I did appreciate a good-old fashioned ftp site, with well organized files, long informative files names, and text readme files!)

The alignments seem to interchange N/n/-, but I checked the IQTree manual and IQTree treats them all the same. But I don't see how IQ tree handles phylogenetic inference at positions with N/n/-, which it calls "unknowns". For example, in the multigene alignment, I don't believe there is any position without gaps, so presumably it handles them. I'm assuming the alignments I'm looking at are the ones fed into IQTree and not the ones coming out of MAFFT and before Gblocks, but it might be good for you to confirm using this example: looking at the alignment BOLD_just_Rciekttsia_COI_contaminants_alignment.fas, these two sequences have absolutely no overlap, and are significant truncated relative to the complete alignment (and these aren't the only two with this problem). How is the algorithm dealing with this, and does it introduce any artifacts? Do you get the same result if you remove sequences like these and use only sequences where all positions of the alignment have a character (both by removing short sequences like this and also trimming the ends of the alignment)? Would it be better to do that?

BIOUG10973-G06_Diptera_Canada ---------attatatttgccatatttgctggaattgttggtgggttattttctgttatttttagattagaattagcaatgcctggtcatatattagctaattatcaactatataatgtattaattaccgctcatgcaataattatggtgtttttcatgattatgccagccttatttggtggatttggtaattactttgtaccaatttta------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Anopheles_plumbeus_C10-34 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ttttggctgcttgttcctgcttttatgttattaatgctttcggcttttgttgatggtggtgcaggtactggctggacgctttacccccctctaagtactctagtgggacatcctggggcagcagtcgatatggctattttaagtttacatattacagggctttcttctattcttggttcaatcaatatgattgttactatctttaatatgagaaccgatggtatggggttatttgaaatgcctctatttatttggtcaattttggttactgccttcttgttgatactagcaattccagtatta

How do double infections confound the results? Do they behave erratically in the cladograms, like a chimera would? Can this be a wider problem in interpreting the results? It seems like it would have to be, but you have chimeras you can use to examine this. Using those, is there a way to address this? With SRA data it seems like you should be able to look for sequence heterogeneity in the reads. Not sure about BOLD data.

Line 254-261: I think you need to add in a correction for multiple testing. You do at least two tests that are in the main text, but it sounds from the response to reviewer's comments that you did many more than that and are only reporting the ones that are P<0.05. However, you should report how many tests you did to find those two and adjust for multiple testing. Otherwise, if you do 20 tests, you would expect to have 1 that is "significant" (for more information on multiple testing: http://www.biostathandbook.com/multiplecomparisons.html). In addition, I think it is important for people to know what comparisons were done that were not significant as these are also results. Addressing multiple testing seems like an issue throughout. In the methods other statistical tests were clearly undertaken where there was multiple testing.

Line 354: Several, maybe many, of the Wolbachia integrations have no mutations or frameshifts, particularly in insects. Those with frameshifts and mutations are easier to find and identify as integrations such that the number of integrations without frameshifts and mutations is likely an underestimate, particularly given how many groups are still screening Wolbachia sequences out before assembling insect genomes. I have no idea how often that happens with Rickettsia, but it seems like, particularly as more groups use tools like blobplots.

Line 366-379: This section still has issues with respect to the study design being secondary data analysis. These lines are in the discussion, it is the time to say things like on line 377 that the over-representation here in BOLD data (if that is the data you are referring to, because I can't remember which one was 17/19 and there is not here that clarifies) could be the result of an amplification bias—in not producing the host copy of the gene, amplifying the Rickettsia gene, or both. Those issues are profound in secondary data usage and need to be addressed head-on so that others who read the paper do not misconstrue the results. Likewise, the SRA data is not random, so I am not sure the statement on line 379-381 is correct, and at very least it needs qualifications. If it is correct, you need to better argue in the manuscript why it is correct, like that you used a sampling scheme to reduce bias, or something like that. Personally, I think it is better to acknowledge the limitations that try to justify, as even if you have a sampling scheme, it can be biased. The PCR screen listed in the table of this manuscripts seems biased from my quick look (e.g. an over-representation of mosquitoes).

Line 387-388: please provide more details. I don't remember reading that. Pointing to an exact result, for instance of how many strains of the same MLST type are in different insect orders is necessary. It should have been in the results if it is in the discussion. In fact, if I look at the figures, the Wolbachia in Figure 2 actually seem to be grouped by insect host taxa at this level. The same is ture for Figure 3 for the Rickettsia. There are a few interleaved colors, but without knowing more I'm not convinced that it can't be explained in another way (like a mite on a host or in the gut from a carnivorous insect or even a double infection); I also can't tell which ones are identical and which ones are just similar. But even if I should infer it from the figures, it should be reported in the results and I didn't find it there. Maybe you are trying to state it in the subsequent sentence if one assumes that all blood feeders are the same taxa and all phloem-feeders are the same taxa, but that isn't clear. (And at least blood feeding is a trait found in multiple diverse taxa). And once again, I'm left wondering if there is a sampling bias. Are mosquitoes over-represented in the database? It some of the tables they seem over-sampled. Blood feeders and phloem feeders are often well sampled, given their important to human health and agriculture, respectively. But maybe more problematically, these results are being described but they are not clearly described in the results section. If I search for blood, I do not find any results that support this statement. When I search for phloem, there is a mention of them being found in phloem-feeding insects, but not that they are diverse, and I have no way of assessing that as a reviewer or reader. Yet, this result for phloem-feeding is also in the abstract as a taxa that is a hot-spot. I don't see it in the figures in a way that I understand (e.g. phloem-feeding isn't annotated). Additionally, there is no assessment here that convinces me it is a hot spot; there are no statistics to suggest it is overabundant, which would be required to be a hotspot (and any such statistics would need correction for multiple testing or some sort of FDR calculation).

"as previously described" Is this in a different manuscript, or earlier in the manuscript? I suspect this means earlier in the paper, but it needs to be clear. Was the same alignment method used with MAFFT and Gblocks? Same for ModelFinder? If it is and all ML trees were inferred the same way, I would recommend that you have a methods section that describes this one for all alignments, maybe concluding with what is different (like the model).

I've outlined some examples where the statements don't reflect what is presented in the results and the limitations of a secondary data analysis. But they are actually more numerous and pervasive than this. For instance, line 39-41 and 41-43 in the abstract have these issues. Likewise, Line 161-162 should read "Torix Rickettsia is the most common bacterial contaminant sequence currently in BOLD, a major barcoding project". This change reflects that this only holds for what has been barcoded thus far, and the issues with the fact you need both failed host amplification and successful bacterial amplification, and that the biodiversity represented in such projects have their own biases. It is so pervasive, I am not sure I found all the instances. Honestly, I think the paper would really benefit from a large clearly labeled section that more explicitly deals with all the limitations of the study, so others do not misconstrue the results for years into the future. It would make the paper much stronger and definitely more rigorous.

Authors' response to reviews: Please find below a point-by-point response from the authors.

“The alignments seem to interchange N/n/-, but I checked the IQTree manual and IQTree treats them all the same. But I don't see how IQ tree handles phylogenetic inference at positions with N/n/-, which it calls "unknowns". For example, in the multigene alignment, I don't believe there is any position without gaps, so presumably it handles them. I'm assuming the alignments I'm looking at are the ones fed into IQTree and not the ones coming out of MAFFT and before Gblocks, but it might be good for you to confirm using this example: looking at the alignment BOLD_just_Rciekttsia_COI_contaminants_alignment.fas, these two sequences have absolutely no overlap, and are significant truncated relative to the complete alignment (and these aren't the only two with this problem). How is the algorithm dealing with this, and does it introduce any artifacts? Do you get the same result if you remove sequences like these and use only sequences where all positions of the alignment have a character (both by removing short sequences like this and also trimming the ends of the alignment)?”

The reasoning behind including sequences with missing character data in our alignments is based on previous work demonstrating that missing data in most cases should not decrease phylogenetic resolution for taxa with complete data (Wiens 2006, DOI: 10.1016/j.jbi.2005.04.001). To confirm this, we reran a modified alignment of the BOLD_just_Rickettsia_COI_contaminants_alignment.fas file by trimming ends by 50 nucleotides and removing any remaining truncated sequences to get rid of missing data (169 of the original 807 sequences removed), as suggested by the reviewer. In accordance with Wiens’ observations, the generated tree (see FTP file ‘BOLD Rickettsia trimmed.png’) placed taxa into the same designated groups as the phylogeny with missing data in Figure 3 (Designations can be found at DOI:10.6084/m9.figshare.12801107). Additionally, the study cited above ran simulations to show that highly incomplete data can be accurately placed in phylogenetic trees as long as at least 50% of sequences contain complete data. Furthermore, it is suggested that the inclusion of sequences with missing data can also sometimes be better than exclusion as this additional data can subdivide misleading long branches. We have now included the reasoning behind using incomplete data in the methods section (lines 480-483).

“How do double infections confound the results? Do they behave erratically in the cladograms, like a chimera would? Can this be a wider problem in interpreting the results? It seems like it would have to be, but you have chimeras you can use to examine this. Using those, is there a way to address this? With SRA data it seems like you should be able to look for sequence heterogeneity in the reads. Not sure about BOLD data.”

Where double peaks were observed in 10/753 Rickettsia-associated taxa from BOLD, the base call was designated as ‘N’ (See FTP file ‘BOLD_multigene_Rickettsia_alignment.fas’). This prevents erroneous placement of chimeric strains on the phylogeny. For BOLD data we unfortunately cannot reconstruct trees by teasing apart the individual strains of the double-infections because we cannot know what phase the double-peaks are in.

The use of ‘N’ characters at double-peak sites could lead to potential problems in the interpretation of these 10 taxa at terminal branches of the phylogeny but the placement as Torix Rickettsia is not likely to be affected. Furthermore, these double-infections are a minority of the total taxa meaning their effects on interpreting results are likely to be minimal.

“Line 254-261: I think you need to add in a correction for multiple testing. You do at least two tests that are in the main text, but it sounds from the response to reviewer's comments that you did many more than that and are only reporting the ones that are P<0.05. However, you should report how many tests you did to find those two and adjust for multiple testing. Otherwise, if you do 20 tests, you would expect to have 1 that is "significant" (for more information on multiple testing: http://www.biostathandbook.com/multiplecomparisons.html). In addition, I think it is important for people to know what comparisons were done that were not significant as these are also results. Addressing multiple testing seems like an issue throughout. In the methods other statistical tests were clearly undertaken where there was multiple testing.”

Two Fisher’s exact tests (aquatic vs terrestrial insects-1 controlled for insect order and 1 uncontrolled) were detailed in the main text and additional file 7, as these were the only taxonomically ‘matched’ pairs. However, one additional test was performed initially to compare terrestrial vs aquatic invertebrates in general which did not give a significant p-value due to a hotspot of Rickettsia in spiders, which are known to be a hotspot for all inherited symbionts tested to date (Wolbachia, Spiroplasma, Rickettsia: Goodacre et al. 2006, doi: 10.1111/j.1365-294X.2005.02802.x.; Cardinium: Duron et al. 2008, doi: 10.1111/j.1365-294X.2008.03689.x.). This detail has now been added in Additional file 7 and lines 266-269. Overall, only 3 tests were done (2 significant and 1 not significant) and this indicates that Torix Rickettsia are over-represented in aquatic insects but this may not be the case for invertebrates in general.

“Line 354: Several, maybe many, of the Wolbachia integrations have no mutations or frameshifts, particularly in insects. Those with frameshifts and mutations are easier to find and identify as integrations such that the number of integrations without frameshifts and mutations is likely an underestimate, particularly given how many groups are still screening Wolbachia sequences out before assembling insect genomes. I have no idea how often that happens with Rickettsia, but it seems like, particularly as more groups use tools like blobplots.”

We thank the reviewer for raising this issue. We have now put a caveat at the end of this sentence to indicate that despite no frameshifts or mutations, it is still possible the sequences from this study are host integrations (lines 376-378). The problem is likely to be less for Rickettsia than Wolbachia, due to differences in the mode of vertical transmission. Wolbachia is present in the germline stem cell niche, such DNA from the symbiont is available for incorporation into the germline. Rickettsia, in contrasts, usually invades the egg after meiosis, through the follicular epithelium. Thus, Rickettsia DNA is much less present in the germline of insects, making integration less likely.

“Line 366-379: This section still has issues with respect to the study design being secondary data analysis. These lines are in the discussion, it is the time to say things like on line 377 that the over-representation here in BOLD data (if that is the data you are referring to, because I can't remember which one was 17/19 and there is not here that clarifies) could be the result of an amplification bias—in not producing the host copy of the gene, amplifying the Rickettsia gene, or both. Those issues are profound in secondary data usage and need to be addressed head-on so that others who read the paper do not misconstrue the results. Likewise, the SRA data is not random, so I am not sure the statement on line 379-381 is correct, and at very least it needs qualifications. If it is correct, you need to better argue in the manuscript why it is correct, like that you used a sampling scheme to reduce bias, or something like that. Personally, I think it is better to acknowledge the limitations that try to justify, as even if you have a sampling scheme, it can be biased. The PCR screen listed in the table of this manuscripts seems biased from my quick look (e.g. an over-representation of mosquitoes).”

The “17/19 strains” being Torix is a reference to the targeted screen (not the BOLD screen) which was used alongside the BOLD data because of the aforementioned biases relating to amplification bias and this has now been clarified on lines 401-402. Additionally, we have added a sentence to the results section explicitly quoting the 17 strains of Rickettsia found in the targeted screen (lines 248-250). Although 95% of Rickettsia amplifications from BOLD are Torix, we already mention that this is likely due to primer bias (lines 321-324). Subsequently, the targeted screen is used in part to negate the problems of relying entirely on secondary data. Of course, many studies which aim to investigate the distribution of a symbiont will have sampling and methodological biases. However, having multiple screening strategies, as we have here, is likely to give a more nuanced and holistic view of Torix Rickettsia ecology. We believe that the combined use of several screening methods is a strength and not a weakness of the study. Despite this, we have now added a separate section detailing the limitations of the study (lines 358-388).

Specifically, regarding lines 379-381 (of the 1st revision), this statement is based not just on SRA data but also the targeted screen from this study and Weinert’s study as mentioned in the previous lines. Thus, the SRA is corroborating two separate targeted screens (one which lacked spiders and aquatic insects demonstrating a high number of Belli infections, and another which included spiders and aquatic insects demonstrating a high number of Torix infections.). Subsequently, for clarity we have now changed the statement “Our additional use of a bioinformatics approach based on the SRA appears to confirm that Belli and Torix are two of the most common Rickettsia groups among arthropods.” to “Our additional use of a bioinformatics approach based on the SRA appears to corroborate targeted screen data indicating that Belli and Torix are two of the most common Rickettsia groups among arthropods.” (lines 403-406).

“Line 387-388: please provide more details. I don't remember reading that. Pointing to an exact result, for instance of how many strains of the same MLST type are in different insect orders is necessary. It should have been in the results if it is in the discussion. In fact, if I look at the figures, the Wolbachia in Figure 2 actually seem to be grouped by insect host taxa at this level. The same is ture for Figure 3 for the Rickettsia. There are a few interleaved colors, but without knowing more I'm not convinced that it can't be explained in another way (like a mite on a host or in the gut from a carnivorous insect or even a double infection); I also can't tell which ones are identical and which ones are just similar. But even if I should infer it from the figures, it should be reported in the results and I didn't find it there. Maybe you are trying to state it in the subsequent sentence if one assumes that all blood feeders are the same taxa and all phloem-feeders are the same taxa, but that isn't clear. (And at least blood feeding is a trait found in multiple diverse taxa).”

The inferences related to similar strains in distantly-related hosts is best observed in the multigene tree in figure 4 rather than the single gene trees of figures 2 and 3. For example, odonate strains are clearly interleaved between strains from other host orders. More specifically, the two Coenagrion strains have 100% identity to the Culicoides stigma strain in contrast to two other odoante (Polythore) strains where multiple SNPs are observed at all loci (See ftp file ‘BOLD_multigene_Rickettsia_alignment.fas’). We thank the reviewer as this was not mentioned in the results but we have now included this on lines 209-211. Furthermore, regardless of exact MLST profiles for strains, taxa from most orders are represented in both Limoniae and Leech Torix subclades indicating a lack of grouping based on insect host taxa. The authors believe this concept is better represented in a phylogeny rather than a list of MLST profiles.

“And once again, I'm left wondering if there is a sampling bias. Are mosquitoes over-represented in the database? It some of the tables they seem over-sampled. Blood feeders and phloem feeders are often well sampled, given their important to human health and agriculture, respectively. But maybe more problematically, these results are being described but they are not clearly described in the results section. If I search for blood, I do not find any results that support this statement. When I search for phloem, there is a mention of them being found in phloem-feeding insects, but not that they are diverse, and I have no way of assessing that as a reviewer or reader. Yet, this result for phloem-feeding is also in the abstract as a taxa that is a hot-spot. I don't see it in the figures in a way that I understand (e.g. phloem-feeding isn't annotated). Additionally, there is no assessment here that convinces me it is a hot spot; there are no statistics to suggest it is overabundant, which would be required to be a hotspot (and any such statistics would need correction for multiple testing or some sort of FDR calculation).”

Mosquitoes are likely to be over-represented in the sequence read archive but as mentioned already in lines 142-144, a single dataset per species was extracted for analysis to negate oversampling of the same species. Although certain genera may still be oversampled, the only instance of mosquito Rickettsia being detected is in the Anopheles plumbeus population of the targeted screen. With regards to phloem-feeding insect strains, psyllids and other phloem-feeders are present in both Limoniae and Leech subclades suggesting again that strains are diverse within similar lifestyles. This is best seen in Figure 6 where both phloem-feeding and blood-feeding are annotated. The common patterns of infection in phloem-feeding bugs and blood-feeders are also already mentioned in lines 296-302 of the results. We agree that the common patterns or ‘hot-spots’ found in our data should come with caveats and we have now included this in the limitations part of the discussion where we clarify that common patterns of infection refer specifically to our datasets which although extensive, have some biases and may not completely represent Torix Rickettsia infection in nature (lines 358-371).

“"as previously described" Is this in a different manuscript, or earlier in the manuscript? I suspect this means earlier in the paper, but it needs to be clear. Was the same alignment method used with MAFFT and Gblocks? Same for ModelFinder? If it is and all ML trees were inferred the same way, I would recommend that you have a methods section that describes this one for all alignments, maybe concluding with what is different (like the model).” This refers to methods described earlier in the manuscript and yes, the same methods were used for all ML trees. This suggestion is welcomed and has been included in lines 492-493.

“I've outlined some examples where the statements don't reflect what is presented in the results and the limitations of a secondary data analysis. But they are actually more numerous and pervasive than this. For instance, line 39-41 and 41-43 in the abstract have these issues. Likewise, Line 161-162 should read "Torix Rickettsia is the most common bacterial contaminant sequence currently in BOLD, a major barcoding project". This change reflects that this only holds for what has been barcoded thus far, and the issues with the fact you need both failed host amplification and successful bacterial amplification, and that the biodiversity represented in such projects have their own biases. It is so pervasive, I am not sure I found all the instances. Honestly, I think the paper would really benefit from a large clearly labeled section that more explicitly deals with all the limitations of the study, so others do not misconstrue the results for years into the future. It would make the paper much stronger and definitely more rigorous.”

With regard to line 39-41 describing how our targeted PCR data supports the aquatic hotspot hypothesis we refer the reviewer to our response to the ‘multiple testing’ above. For lines 41-43, we have changed this sentence to include the caveat that this applies only to arthropod genome projects: “Furthermore, the analysis of 1,341 Sequence Read Archive (SRA) deposits indicates Torix infections represent a significant proportion of all Rickettsia symbioses found in arthropod genome projects.” We have also changed lines 161-162 to reflect a similar caveat for the BOLD data as suggested by the reviewer. As previously mentioned, we have now included a specific section in the discussion detailing the limitations of our datasets (lines 358-388).

Source

References

Jack, P., Panupong, T., R., D. H., Stefanos, S., Matthew, B., V, Z. E., Sujeevan, R., R., D. J., R., M. C., Alex, S. M., D., H. G. D. Torix Rickettsia are widespread in arthropods and reflect a neglected symbiosis. GigaScience.

Pre-publication Review of

Torix Rickettsia are widespread in arthropods and reflect a neglected symbiosis

Reviewed On September 21, 2020 , and December 02, 2020

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on September 21, 2020

Source

Content of review 2, reviewed on December 02, 2020

Source

References