Review of Clusterflock: a flocking algorithm for isolating congruent phylogenomic datasets

Content of review 1, reviewed on March 14, 2016

I have a few remaining comments:

1. Usability

While I think the method is interesting, the implementation remains very difficult to use. After two hours of attempting to install and run the software (I am a proficient programmer in python and R, but have zero Perl experience) I gave up. The installation remains complex for non-perl experts, and the sparsity of the documentation does not help (the documentation has been expanded somewhat, but it remains far too sparse to be useful to non-perl programmers). Because of that, the utility of the tool for the end-users (presumably, biologists with multi-locus datasets) is questionable. This is not something I see as a barrier to publication of the method - which itself is interesting - but since the primary focus of this paper is the software itself, this does seem to me to be an issue.

2. DOI

The authors provide no cogent reason not to provide a DOI for their software. I don't know what the issue is here. By not providing a DOI (e.g. through Zenodo), there is no guarantee that the software will stay around. This is a problem for reproducibility and for the general utility of the work. Given that link rot, and lost/broken software in general is such a huge problem in our field, and given that the primary focus of this paper is the provision of 'an open-source tool' I think it's important to properly archive a version of the software with a DOI here. Neither tagging versions in github nor making a copy of the repo on bitbucket guarantees persistence. But the ~10 minutes it takes to provide a DOI through Zenodo does guarantee persistence. It means that, no matter what the authors decide to do with their github repository, the copy of the code used for this ms will be around and will be discoverable from the manuscript itself.

A side note: the authors state that they have tagged the current version of the software as 0.1. However, there are no tags or releases on their github repository. Tags and releases are specific things designed to help people get to particular versions of software: https://help.github.com/articles/creating-releases/ . Minting a DOI with Zenodo would solve this problem too - Zenodo works with tagged versions of the repository only.

3. Simulations

Can the authors please provide data (in a figure) on the number of clusters returned by clusterflock in each of the simulated datasets, versus the number of underlying topologies that were simulated. It's not possible to get this from the currently-presented data, and this is an important part of assessing the accuracy of the algorithm on the simulated datasets.

4. Data availability

Please provide the output data from the simulations: specifically, the data that could be used to recalculate figures 3 and 4 on the identity of the simulated topology versus the topology to which clusterflock assigned that locus.

5. Discussion of performance

Figures 3 and 4 would benefit from having the expected jacard index with random assignment of trees to loci plotted. This way we could see which methods do no better than randomly assigning trees to groups. As far as I can tell, clusterflock with 50% missing data tracks the random expectation very closely (JI = 0.5 with 2 trees; 0.1 with 10 trees; 0.04 with 25 trees). This in itself is interesting - even with data for 50% of the species, clusterflock does not appear to gain any benefit over randomly assigning trees to groups. Can comment on this particular case? It seems counterintuitive to me that with data for 50% of the species at each locus, the method gains no benefit over randomly assigning trees.

More generally, can the authors comment on the meaning (for biologists) of the fact that clusterflock gets a JI of ~0.4 when there are 25 simulated topologies. If the algorithm correctly assigns loci to topologies less than half of the time in these simulations, what does this mean for biological inferences from the data? For example, it seems from the simulated and empirical data that while clusterflock might be useful when the number of clusters is very small (e.g. <10) it might be much less useful with >10 clusters. For example, while the empirical test presented in the paper is compelling, it seems likely that the algorithm may be much less useful if there had been a lot of recombination events (as might be the case in many empirical datasets, such as the analysis of whole-bird genomes from across the avian tree of life).

As above, some comparison with existing approaches to this problem is warranted here: if clusterflock does better than existing approaches (i.e. Concaterpillar, conclustador, etc), then that's great even if the absolute performance remains less than ideal. In this case, biologists should prefer clusterflock because it makes the best inferences. However, if clusterflock is consistently worse than other methods, then we know that it is a neat method that requires additional development before it is useful. In my opinion, knowing which of these situations is the case would vastly strengthen the paper.

Level of interest
Please indicate how interesting you found the manuscript:

An article whose findings are important to those with closely related research interests

Quality of written English
Please indicate the quality of language in the manuscript:

Acceptable

Declaration of competing interests
Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from an
organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?
2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?
3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?
4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?
5. Do you have any other financial competing interests?
6. Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.

None.

I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons
CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.

I agree to the open peer review policy of the journal.

Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0152-3/13742_2016_152_AuthorComment_V2.pdf)

Source

Content of review 2, reviewed on August 21, 2016

The authors propose a method that uses a modified flocking algorithm figure out how many trees are needed to represent a set of alignments of the same taxa. This is an interesting problem, and the proposed solution is a valuable contribution. Nevertheless, there were a number of places in which I thought the study could and perhaps should be improved. I split them into two below: those to do with the software, and those to do with the manuscript.

Comments on the manuscript/method:

1. Only a passing mention is given to previous solutions to this problem. Given that there are various previous solutions, it would be useful for the reader to be given some comparison of the relative merits and shortcomings of the solutions, to motivate the current study. It would also be worth noting in the discussion whether and how this new method overcomes the limitations of previous methods. This seems like an important point for readers who areconsidering which tool to use.

2. There are no simulations. This seems like an important omission to me, because without simulations it's impossible to know when the method works well, and when it doesn't. Although the authors present an analysis of one empirical dataset in which the algorithm appears to do roughly what it should, this is not sufficient to make robust judgements as to the general performance of the algorithm. Thus, without simulations I would argue that the conclusions of the paper are not supported by the data, specifically it is not possible claim that 'we show that [clusterflock] is particularly well suited to isolating genes into discrete flocks that share a unique phylogenetic history'. Simulations could obviously take many forms, but a simple approach would be to consider 100 datasets with 1-100 trees underlying them. 100 loci could then be sampled across these trees, and fed into the algorithm.

Repeating each simulation 10 times would require only 1000 analyses, and could give a quite detailed picture of the method's performance. Specific questions to ask would be: what is the false positive rate (i.e. how often do you detect more than one cluster when there is only a single underlying tree)? What is the false negative rate (i.e. how often do you cluster together genes with different underlying trees)? What are the detection limits (e.g. how much data and how different do two trees have to be before you can detect the differences)? What aspects of sequence evolution can mislead the algorithm (e.g. rates of evolution, see below)? How does the ratio of the number of loci to the number of trees affect performance (this seems like a particularly important point to address in a flocking algorithm - it's not obvious to me what will happen to trees that are represented by a single locus, and particularly in the case where most trees are represented by a very small number of loci)?

3. The design choices are described relatively thoroughly in the paper, but very few motivations for these choices are given. Thus, while I might be able to re-implement a similar algorithm by reading the paper, I have no idea why most of the choices were made. It would be nice to include the background to the decisions made when implementing the algorithms, because this would facilitate progress in this area.

4. The use of LD seems reasonable here, but it seems like it could also be misled by genes evolving at different rates. This is because higher rates will tend to exacerbate problems like long-branch attraction. Thus, under parsimony, a slow gene and a fast gene may have quite different most-parsimonious topologies. Given the vast differences in rates between many genes, this seems like a potential issue that could at the very least be explored with simulation, e.g. by simulating 100 genes on the same tree, where 50 evolve slowly and 50 evolve more quickly. By varying the rate ratio of the two genes, one could determine whether this is an issue, and at what kinds of scales it manifests itself.

5. A simple question - could the authors include some information on the relative proportion of the runtimes that are associated with different parts of the algorithm. I ask this because it's easy to think of other options (like calculating ML or NJ trees, and then using any of a number of metrics of tree distances) which might improve accuracy but increase runtimes. However, without knowing what the rate-limiting steps of the algorithm are, it's not possible to know whether such improvements are worth even thinking about.

6. Following from point 5: given that you have to run the algorithm 100 times to get some idea of the robustness of the flocking, how does the aggregated runtime compare to other approaches to this problem? E.g. what about software such as concaterpillar or conclustador? The latter states that it is specifically designed to solve the same problem as clusterflock, so it seems worth comparing the two here. Note that I don't think it's necessary to do better than any other software - this is a very interesting approach that should be described regardless of whether it's better on any particular metric - but it does seem important to make some attempt to compare performance in terms of accuracy and speed.

Comments on the software:

1. The way that github has been used is unconventional, and inconvenient. The only way I could download the software was to download a whole collection of other pieces of software along with it. Please give this software its own repository. This will also facilitate future collaboration and development, since github works fundamentally at the level of the single repository.

2. Please mint a DOI for the released version of the software with Zenodo or some other service. This ensures that the software will stay around if the github repo is deleted, and it also ensures that the ms refers to a persistent and tagged version of the software even if the repo stays around and the software continues to be developed.

3. There are no tests in the software. In this case, tests seem rather vital. The paper describes clusterflock 'an open source tool', so presumably the intention is that many others will use it. Simulations will form a useful set of tests on their own, and should be included in the repository with a script to run all tests and check that they produce the expected results. (note - the results don't have to be correct, but there should be some checking to make sure that they are expected). Given that the algorithm is stochastic, it might be useful to include an option to provide a random number seed in the code, in particular to facilitate testing. Unit tests would also be useful, to ensure that key functions are behaving as expected. As it stands, software with no tests does not inspire a great deal of confidence.

4. More documentation is needed. I suspect this is particularly the case here, since the vast majority of the end-users of the tool will not know Perl. It would be worth putting together a comprehensive manual, and in particular providing detailed installation instructions and a quickstart guide. For example, although I am quite proficient in a couple of languages I do not use Perl. Even if I had access to a linux machine to test the software (sadly, I don't, but I hope at least one reviewer does), I'm guessing that getting it up an running would have taken me some time.

5. I searched for a license, and found one in the script. But I am confused. The license states that the work is copyright of the AMNH, but also that it is released under the same terms as Perl itself. These seem incompatible, and also perhaps incompatible with the three dependencies that are packaged in the repo. Can the authors double check this, and when they are sure they have a valid license, include it somewhere obvious in the repository and the manual.

6. Just an observation: 'Clusterflock' is a very popular name for many things, and that makes this tool very hard to find on google. Even typing 'clusterflock phylogenetics github' does not produce a link to the tool. It might be worth considering a name that makes the tool easier to find.

Level of interest
Please indicate how interesting you found the manuscript:

An article whose findings are important to those with closely related research interests

Quality of written English
Please indicate the quality of language in the manuscript:

Acceptable

None

I agree to the open peer review policy of the journal.

Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0152-3/13742_2016_152_AuthorComment_V1.pdf)

Source

References

Apurva, N., Richard, B., Rob, D., Barun, M., Sergios-Orestis, K., Barry, K., J., P. P. 2016. Clusterflock: a flocking algorithm for isolating congruent phylogenomic datasets. GigaScience.

Pre-publication Review of

Clusterflock: a flocking algorithm for isolating congruent phylogenomic datasets

Reviewed On March 14, 2016 , and August 21, 2016

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on March 14, 2016

Source

Content of review 2, reviewed on August 21, 2016

Source

References