Abstract

Background: Collective animal behavior, such as the flocking of birds or the shoaling of fish, has inspired a class of algorithms designed to optimize distance-based clusters in various applications, including document analysis and DNA microarrays. In a flocking model, individual agents respond only to their immediate environment and move according to a few simple rules. After several iterations the agents self-organize, and clusters emerge without the need for partitional seeds. In addition to its unsupervised nature, flocking offers several computational advantages, including the potential to reduce the number of required comparisons.Findings: In the tool presented here, Clusterflock, we have implemented a flocking algorithm designed to locate groups (flocks) of orthologous gene families (OGFs) that share an evolutionary history. Pairwise distances that measure phylogenetic incongruence between OGFs guide flock formation. We tested this approach on several simulated datasets by varying the number of underlying topologies, the proportion of missing data, and evolutionary rates, and show that in datasets containing high levels of missing data and rate heterogeneity, Clusterflock outperforms other well-established clustering techniques. We also verified its utility on a known, large-scale recombination event in Staphylococcus aureus. By isolating sets of OGFs with divergent phylogenetic signals, we were able to pinpoint the recombined region without forcing a pre-determined number of groupings or defining a pre-determined incongruence threshold.Conclusions: Clusterflock is an open-source tool that can be used to discover horizontally transferred genes, recombined areas of chromosomes, and the phylogenetic 'core' of a genome. Although we used it here in an evolutionary context, it is generalizable to any clustering problem. Users can write extensions to calculate any distance metric on the unit interval, and can use these distances to 'flock' any type of data.

  • Reviewer #1 reported problems with using the Clusterflock tool due to the complexity with installing the software and its dependencies. In response, the authors of Clusterflock have provided a Docker container which ships all of the code and associated software libraries in a standalone package ready for use.

    I have tested the clusterflock-0.1 Docker container and can report that I have successfully executed the clusterflock.pl and clusterflock_simulations.pl scripts to completion using the instructions available from https://github.com/narechan/clusterflock/blob/master/MANUAL. This involved:

    1. Deploying an Ubuntu-14.04 EC2 virtual server as a t2.medium instance on the AWS cloud and installing the Docker software on it.

    2. Downloading the narechan/clusterflock-0.1 Docker image from DockerHub onto the virtual server.

    3. The Clusterflock scripts can then be executed by running the clusterflock-0.1 Docker container with this command on the host server: 
    $ docker run -v /mount/path/on/host:/home/test -it narechan/clusterflock-0.1

    The following two commands can then be executed using clusterflock-0.1 Docker image:

    $ clusterflock.pl -i test_data/4/fastas/ -c config.boids.simulations -l test_data/4/4.lds -s all -b 1 -d -x -o /home/test/4_out

    $ clusterflock_simulations.pl -c config.boids.simulations -r 10 -p 10 -o /home/test/4_sim/ -i test_data/4/fastas/ -l test_data/4/4.lds -j /home/clusterflock/dependencies/elki-bundle- 0.6.5~20141030.jar -k 4 -f 500 > /home/test/4_sim.avg_jaccard

    Both of the above commands generated outputs as described in https://github.com/narechan/clusterflock/blob/master/MANUAL.

    Level of interest
    Please indicate how interesting you found the manuscript:

    An article whose findings are important to those with closely related research interests

    Quality of written English
    Please indicate the quality of language in the manuscript:

    Acceptable

    Declaration of competing interests
    Please complete a declaration of competing interests, considering the following questions:
    1. Have you in the past five years received reimbursements, fees, funding, or salary from an
    organisation that may in any way gain or lose financially from the publication of this
    manuscript, either now or in the future?
    2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
    financially from the publication of this manuscript, either now or in the future?
    3. Do you hold or are you currently applying for any patents relating to the content of the
    manuscript?
    4. Have you received reimbursements, fees, funding, or salary from an organization that
    holds or has applied for patents relating to the content of the manuscript?
    5. Do you have any other financial competing interests?
    6. Do you have any non-financial competing interests in relation to this paper?
    If you can answer no to all of the above, write 'I declare that I have no competing interests'
    below. If your reply is yes to any, please give details below.

    I declare that I have no competing interests.

    I agree to the open peer review policy of the journal. I understand that my name will be included
    on my report to the authors and, if the manuscript is accepted for publication, my named report
    including any attachments I upload will be posted on the website along with the authors'
    responses. I agree for my report to be made available under an Open Access Creative Commons
    CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
    which I do not wish to be included in my named report can be included as confidential comments
    to the editors, which will not be published.

    I agree to the open peer review policy of the journal.


    Published in
    Reviewed by
    Ongoing discussion
  • The authors propose a method that uses a modified flocking algorithm figure out how many trees are needed to represent a set of alignments of the same taxa. This is an interesting problem, and the proposed solution is a valuable contribution. Nevertheless, there were a number of places in which I thought the study could and perhaps should be improved. I split them into two below: those to do with the software, and those to do with the manuscript.

    Comments on the manuscript/method:

    1. Only a passing mention is given to previous solutions to this problem. Given that there are various previous solutions, it would be useful for the reader to be given some comparison of the relative merits and shortcomings of the solutions, to motivate the current study. It would also be worth noting in the discussion whether and how this new method overcomes the limitations of previous methods. This seems like an important point for readers who areconsidering which tool to use.

    2. There are no simulations. This seems like an important omission to me, because without simulations it's impossible to know when the method works well, and when it doesn't. Although the authors present an analysis of one empirical dataset in which the algorithm appears to do roughly what it should, this is not sufficient to make robust judgements as to the general performance of the algorithm. Thus, without simulations I would argue that the conclusions of the paper are not supported by the data, specifically it is not possible claim that 'we show that [clusterflock] is particularly well suited to isolating genes into discrete flocks that share a unique phylogenetic history'. Simulations could obviously take many forms, but a simple approach would be to consider 100 datasets with 1-100 trees underlying them. 100 loci could then be sampled across these trees, and fed into the algorithm.

    Repeating each simulation 10 times would require only 1000 analyses, and could give a quite detailed picture of the method's performance. Specific questions to ask would be: what is the false positive rate (i.e. how often do you detect more than one cluster when there is only a single underlying tree)? What is the false negative rate (i.e. how often do you cluster together genes with different underlying trees)? What are the detection limits (e.g. how much data and how different do two trees have to be before you can detect the differences)? What aspects of sequence evolution can mislead the algorithm (e.g. rates of evolution, see below)? How does the ratio of the number of loci to the number of trees affect performance (this seems like a particularly important point to address in a flocking algorithm - it's not obvious to me what will happen to trees that are represented by a single locus, and particularly in the case where most trees are represented by a very small number of loci)?

    3. The design choices are described relatively thoroughly in the paper, but very few motivations for these choices are given. Thus, while I might be able to re-implement a similar algorithm by reading the paper, I have no idea why most of the choices were made. It would be nice to include the background to the decisions made when implementing the algorithms, because this would facilitate progress in this area.

    4. The use of LD seems reasonable here, but it seems like it could also be misled by genes evolving at different rates. This is because higher rates will tend to exacerbate problems like long-branch attraction. Thus, under parsimony, a slow gene and a fast gene may have quite different most-parsimonious topologies. Given the vast differences in rates between many genes, this seems like a potential issue that could at the very least be explored with simulation, e.g. by simulating 100 genes on the same tree, where 50 evolve slowly and 50 evolve more quickly. By varying the rate ratio of the two genes, one could determine whether this is an issue, and at what kinds of scales it manifests itself.

    5. A simple question - could the authors include some information on the relative proportion of the runtimes that are associated with different parts of the algorithm. I ask this because it's easy to think of other options (like calculating ML or NJ trees, and then using any of a number of metrics of tree distances) which might improve accuracy but increase runtimes. However, without knowing what the rate-limiting steps of the algorithm are, it's not possible to know whether such improvements are worth even thinking about.

    6. Following from point 5: given that you have to run the algorithm 100 times to get some idea of the robustness of the flocking, how does the aggregated runtime compare to other approaches to this problem? E.g. what about software such as concaterpillar or conclustador? The latter states that it is specifically designed to solve the same problem as clusterflock, so it seems worth comparing the two here. Note that I don't think it's necessary to do better than any other software - this is a very interesting approach that should be described regardless of whether it's better on any particular metric - but it does seem important to make some attempt to compare performance in terms of accuracy and speed.

    Comments on the software:

    1. The way that github has been used is unconventional, and inconvenient. The only way I could download the software was to download a whole collection of other pieces of software along with it. Please give this software its own repository. This will also facilitate future collaboration and development, since github works fundamentally at the level of the single repository.

    2. Please mint a DOI for the released version of the software with Zenodo or some other service. This ensures that the software will stay around if the github repo is deleted, and it also ensures that the ms refers to a persistent and tagged version of the software even if the repo stays around and the software continues to be developed.

    3. There are no tests in the software. In this case, tests seem rather vital. The paper describes clusterflock 'an open source tool', so presumably the intention is that many others will use it. Simulations will form a useful set of tests on their own, and should be included in the repository with a script to run all tests and check that they produce the expected results. (note - the results don't have to be correct, but there should be some checking to make sure that they are expected). Given that the algorithm is stochastic, it might be useful to include an option to provide a random number seed in the code, in particular to facilitate testing. Unit tests would also be useful, to ensure that key functions are behaving as expected. As it stands, software with no tests does not inspire a great deal of confidence.

    4. More documentation is needed. I suspect this is particularly the case here, since the vast majority of the end-users of the tool will not know Perl. It would be worth putting together a comprehensive manual, and in particular providing detailed installation instructions and a quickstart guide. For example, although I am quite proficient in a couple of languages I do not use Perl. Even if I had access to a linux machine to test the software (sadly, I don't, but I hope at least one reviewer does), I'm guessing that getting it up an running would have taken me some time.

    5. I searched for a license, and found one in the script. But I am confused. The license states that the work is copyright of the AMNH, but also that it is released under the same terms as Perl itself. These seem incompatible, and also perhaps incompatible with the three dependencies that are packaged in the repo. Can the authors double check this, and when they are sure they have a valid license, include it somewhere obvious in the repository and the manual.

    6. Just an observation: 'Clusterflock' is a very popular name for many things, and that makes this tool very hard to find on google. Even typing 'clusterflock phylogenetics github' does not produce a link to the tool. It might be worth considering a name that makes the tool easier to find.

    Level of interest
    Please indicate how interesting you found the manuscript:

    An article whose findings are important to those with closely related research interests

    Quality of written English
    Please indicate the quality of language in the manuscript:

    Acceptable

    Declaration of competing interests
    Please complete a declaration of competing interests, considering the following questions:
    1. Have you in the past five years received reimbursements, fees, funding, or salary from an
    organisation that may in any way gain or lose financially from the publication of this
    manuscript, either now or in the future?
    2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
    financially from the publication of this manuscript, either now or in the future?
    3. Do you hold or are you currently applying for any patents relating to the content of the
    manuscript?
    4. Have you received reimbursements, fees, funding, or salary from an organization that
    holds or has applied for patents relating to the content of the manuscript?
    5. Do you have any other financial competing interests?
    6. Do you have any non-financial competing interests in relation to this paper?
    If you can answer no to all of the above, write 'I declare that I have no competing interests'
    below. If your reply is yes to any, please give details below.

    None

    I agree to the open peer review policy of the journal. I understand that my name will be included
    on my report to the authors and, if the manuscript is accepted for publication, my named report
    including any attachments I upload will be posted on the website along with the authors'
    responses. I agree for my report to be made available under an Open Access Creative Commons
    CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
    which I do not wish to be included in my named report can be included as confidential comments
    to the editors, which will not be published.

    I agree to the open peer review policy of the journal.

    Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0152-3/13742_2016_152_AuthorComment_V1.pdf)


    Published in
    Reviewed by
    Ongoing discussion
  • The manuscript present a method to identify phylogenetically-congruent genes through an agentbased modelling approach originally designed to model bird flocking. The method is concisely presented and applied to a set Staphylococcus aureus genomes known to have evolved via large hybridisation events.

    The problem is relevant and the idea has merit, but as I elaborate below, I am concerned that no efforts were done to compare the approach to standard clustering approaches or to existing methods for the same problem. I am also concerned that the authors only provide results for a single dataset. The minimum standard in the field is to validate one's approach on a variety of simulated data, showing that the method performs well under these ideal conditions at least.

    Major points:

    1. Method only validated on a single problem instance. This is inadequate for a new method. Instead, the authors should at least show on simulated datasets covering a variety of scenarios that the algorithm is able to cluster the data correctly.

    2. No comparison with other methods: As the authors correctly point out, their approach boils down to a clustering method. There are many such methods, so why should the proposed approach be preferred? Contrary to the claim two paragraphs prior to the conclusions (please number your ms pages), there are other clustering methods that do not require specifying the number of clusters. Even for those that do, there are heuristics available (elbow, silhouette, etc.). At the very least, it seems that embedding the genes in a space using a standard multidimensional scaling procedure followed by clustering (e.g. using the OPTICS algorithm used by the authors) would provide a reasonable baseline to gauge how useful the flocking approach is.

    Minor Points:

    3. What genomes were used as input? (accession number/date)

    4. How were the orthologous groups computed? 5. How were the single-gene trees computed?

    5. How were the single-gene trees computed?

    6. Given that orthologous groups were inferred, why did the authors need to map genes to USA300/TCH1516 via profile HMM? In any case, this needs to be described.

    7. Paragraph right before conclusions: "The LDs of these genes with respect...". The authors probably mean ILD here. In the context of recombination, LD usually means linkage disequilibrium, which could be confusing.

    8. Same sentence: the conjecture that genes that are both in the "core cluster" and hybridisation region could have *reverted back* to the core phylogeny seems highly improbable to me. Assuming these indeed follow the core phylogeny, it seems more likely that they were translocated to that region *after* the hybridisation event.

    9. The labels on Fig. 3 are illegible.

    Level of interest
    Please indicate how interesting you found the manuscript:

    An article of limited interest

    Quality of written English
    Please indicate the quality of language in the manuscript:

    Acceptable

    Declaration of competing interests
    Please complete a declaration of competing interests, considering the following questions:
    1. Have you in the past five years received reimbursements, fees, funding, or salary from an
    organisation that may in any way gain or lose financially from the publication of this
    manuscript, either now or in the future?
    2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
    financially from the publication of this manuscript, either now or in the future?
    3. Do you hold or are you currently applying for any patents relating to the content of the
    manuscript?
    4. Have you received reimbursements, fees, funding, or salary from an organization that
    holds or has applied for patents relating to the content of the manuscript?
    5. Do you have any other financial competing interests?
    6. Do you have any non-financial competing interests in relation to this paper?
    If you can answer no to all of the above, write 'I declare that I have no competing interests'
    below. If your reply is yes to any, please give details below.

    By way of full disclosure, I am the senior author of a loosely related manuscript submitted to
    another journal. However, the two manuscripts use different approaches, and have very different
    focuses, so they are not in competition.

    I agree to the open peer review policy of the journal. I understand that my name will be included
    on my report to the authors and, if the manuscript is accepted for publication, my named report
    including any attachments I upload will be posted on the website along with the authors'
    responses. I agree for my report to be made available under an Open Access Creative Commons
    CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
    which I do not wish to be included in my named report can be included as confidential comments
    to the editors, which will not be published.

    I agree to the open peer review policy of the journal.

    Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0152-3/13742_2016_152_AuthorComment_V1.pdf)


    Published in
    Ongoing discussion
  • I have a few remaining comments:

    1. Usability

    While I think the method is interesting, the implementation remains very difficult to use. After two hours of attempting to install and run the software (I am a proficient programmer in python and R, but have zero Perl experience) I gave up. The installation remains complex for non-perl experts, and the sparsity of the documentation does not help (the documentation has been expanded somewhat, but it remains far too sparse to be useful to non-perl programmers). Because of that, the utility of the tool for the end-users (presumably, biologists with multi-locus datasets) is questionable. This is not something I see as a barrier to publication of the method - which itself is interesting - but since the primary focus of this paper is the software itself, this does seem to me to be an issue.

    2. DOI

    The authors provide no cogent reason not to provide a DOI for their software. I don't know what the issue is here. By not providing a DOI (e.g. through Zenodo), there is no guarantee that the software will stay around. This is a problem for reproducibility and for the general utility of the work. Given that link rot, and lost/broken software in general is such a huge problem in our field, and given that the primary focus of this paper is the provision of 'an open-source tool' I think it's important to properly archive a version of the software with a DOI here. Neither tagging versions in github nor making a copy of the repo on bitbucket guarantees persistence. But the ~10 minutes it takes to provide a DOI through Zenodo does guarantee persistence. It means that, no matter what the authors decide to do with their github repository, the copy of the code used for this ms will be around and will be discoverable from the manuscript itself.

    A side note: the authors state that they have tagged the current version of the software as 0.1. However, there are no tags or releases on their github repository. Tags and releases are specific things designed to help people get to particular versions of software: https://help.github.com/articles/creating-releases/ . Minting a DOI with Zenodo would solve this problem too - Zenodo works with tagged versions of the repository only.

    3. Simulations

    Can the authors please provide data (in a figure) on the number of clusters returned by clusterflock in each of the simulated datasets, versus the number of underlying topologies that were simulated. It's not possible to get this from the currently-presented data, and this is an important part of assessing the accuracy of the algorithm on the simulated datasets.

    4. Data availability

    Please provide the output data from the simulations: specifically, the data that could be used to recalculate figures 3 and 4 on the identity of the simulated topology versus the topology to which clusterflock assigned that locus.

    5. Discussion of performance

    Figures 3 and 4 would benefit from having the expected jacard index with random assignment of trees to loci plotted. This way we could see which methods do no better than randomly assigning trees to groups. As far as I can tell, clusterflock with 50% missing data tracks the random expectation very closely (JI = 0.5 with 2 trees; 0.1 with 10 trees; 0.04 with 25 trees). This in itself is interesting - even with data for 50% of the species, clusterflock does not appear to gain any benefit over randomly assigning trees to groups. Can comment on this particular case? It seems counterintuitive to me that with data for 50% of the species at each locus, the method gains no benefit over randomly assigning trees.

    More generally, can the authors comment on the meaning (for biologists) of the fact that clusterflock gets a JI of ~0.4 when there are 25 simulated topologies. If the algorithm correctly assigns loci to topologies less than half of the time in these simulations, what does this mean for biological inferences from the data? For example, it seems from the simulated and empirical data that while clusterflock might be useful when the number of clusters is very small (e.g. <10) it might be much less useful with >10 clusters. For example, while the empirical test presented in the paper is compelling, it seems likely that the algorithm may be much less useful if there had been a lot of recombination events (as might be the case in many empirical datasets, such as the analysis of whole-bird genomes from across the avian tree of life).

    As above, some comparison with existing approaches to this problem is warranted here: if clusterflock does better than existing approaches (i.e. Concaterpillar, conclustador, etc), then that's great even if the absolute performance remains less than ideal. In this case, biologists should prefer clusterflock because it makes the best inferences. However, if clusterflock is consistently worse than other methods, then we know that it is a neat method that requires additional development before it is useful. In my opinion, knowing which of these situations is the case would vastly strengthen the paper.

    Level of interest
    Please indicate how interesting you found the manuscript:

    An article whose findings are important to those with closely related research interests

    Quality of written English
    Please indicate the quality of language in the manuscript:

    Acceptable

    Declaration of competing interests
    Please complete a declaration of competing interests, considering the following questions:
    1. Have you in the past five years received reimbursements, fees, funding, or salary from an
    organisation that may in any way gain or lose financially from the publication of this
    manuscript, either now or in the future?
    2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
    financially from the publication of this manuscript, either now or in the future?
    3. Do you hold or are you currently applying for any patents relating to the content of the
    manuscript?
    4. Have you received reimbursements, fees, funding, or salary from an organization that
    holds or has applied for patents relating to the content of the manuscript?
    5. Do you have any other financial competing interests?
    6. Do you have any non-financial competing interests in relation to this paper?
    If you can answer no to all of the above, write 'I declare that I have no competing interests'
    below. If your reply is yes to any, please give details below.

    None.

    I agree to the open peer review policy of the journal. I understand that my name will be included
    on my report to the authors and, if the manuscript is accepted for publication, my named report
    including any attachments I upload will be posted on the website along with the authors'
    responses. I agree for my report to be made available under an Open Access Creative Commons
    CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
    which I do not wish to be included in my named report can be included as confidential comments
    to the editors, which will not be published.

    I agree to the open peer review policy of the journal.

    Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0152-3/13742_2016_152_AuthorComment_V2.pdf)


    Published in
    Reviewed by
    Ongoing discussion
  • My authors have satisfactorily addressed my remarks. I only have two small comments on the new analyses:

    1) The comparison with other clustering methods is good addition. The fragility of hierarchical methods and "partitioning around medoid" with respect to missing data is surprising. The authors should make data and scripts available.

    2) The legend of new figures 3 and 4 should be clearer. As it stands, one needs to read the main text to understand that "zero, ten, twenty" refers to percentages of missing data.

    Level of interest
    Please indicate how interesting you found the manuscript:

    An article of limited interest

    Quality of written English
    Please indicate the quality of language in the manuscript:

    Acceptable

    Declaration of competing interests
    Please complete a declaration of competing interests, considering the following questions:
    1. Have you in the past five years received reimbursements, fees, funding, or salary from an
    organisation that may in any way gain or lose financially from the publication of this
    manuscript, either now or in the future?
    2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
    financially from the publication of this manuscript, either now or in the future?
    3. Do you hold or are you currently applying for any patents relating to the content of the
    manuscript?
    4. Have you received reimbursements, fees, funding, or salary from an organization that
    holds or has applied for patents relating to the content of the manuscript?
    5. Do you have any other financial competing interests?
    6. Do you have any non-financial competing interests in relation to this paper?
    If you can answer no to all of the above, write 'I declare that I have no competing interests'
    below. If your reply is yes to any, please give details below.

    I declare that I have no competing interests.

    I agree to the open peer review policy of the journal. I understand that my name will be included
    on my report to the authors and, if the manuscript is accepted for publication, my named report
    including any attachments I upload will be posted on the website along with the authors'
    responses. I agree for my report to be made available under an Open Access Creative Commons
    CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
    which I do not wish to be included in my named report can be included as confidential comments
    to the editors, which will not be published.

    I agree to the open peer review policy of the journal.

    Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0152-3/13742_2016_152_AuthorComment_V2.pdf)


    Published in
    Ongoing discussion