Review of A benchmark study of <i>k</i>-mer counting methods for high-throughput sequencing

Content of review 1, reviewed on November 13, 2017

This manuscript is a review of recent k-mer counters. The description of the tools is passable overall. In some places, could be made less vague and there are some factual inaccuracies (I list some specific ones below). I recognize also that it is difficult to estimate what is the right level of details to give for each k-mer counter.

Where the article shines is in its benchmark section. The authors did a painstaking job at reporting relevant results for all the k-mer counters. They even commented on the CPU usage across execution of tools. The benchmark looks good to me, the tools behaved somewhat as I expected.

As it is, the manuscript really needs one more good pass of proof-reading by a native speaker, and another pass by a senior computer scientist. I therefore recommend a major revision but would be happy to review a revised version later, as I believe that such a benchmark will be valuable.

signed,

Rayan Chikhi

Major remarks:

Another k-mer counting benchmark is cited: "Computational Performance Assessment of k-mer Counting Algorithms" from Perez et al, 2016. I will agree that the current manuscript is much more up to date with respect to the software and versions. Please nevertheless discuss whether your conclusions are in line with what Perez et al report. Please also recall what the conclusions of Perez et al are, in your introduction.
The authors put much similar emphasis on the memory usage and disk usage as on running time. However one should note that, from a user perspective, when it comes to counting large datasets such as human, it does not matter so much if a tool uses 100 GB of disk or 150 GB of disk, as this resource is generally plentiful. Thus, I would have appreciated a more pragmatic viewpoint when it comes to comparing tools with respect to the various metrics.
The middle of the results section, for instance the specific details regarding how each tool was run, should be moved to the appendix.
Tables 4,5,6,7 too. The authors nicely spotted that MSPKmerCounter and Gerbil have bugs, apparently.. yet all these tables in the main text contain a lot of redundant information.

Minor remarks:

grammar/typo: "till" in Introduction, "approache" page 5.

Formatting, lines 17 and 19 page 3 e.g. "S = {{ACGTTA},{ACGTTT}}.A": lack of space after comma. Also why is the set "s" a set of sets? Counting k-mer implies returning the set s = {ACGT, CGTT, GTTT}, as it does not matter which read the k-mer comes from. The following sentence is awkward: "A 4-mer is a 4-character long substring of every such reads in input set S is obtained as set.
Page 4 line 10, reference [15] is Jellyfish, and it does not perform k-mer frequency statistics
Page 5, regarding the sentence "In the light of this, many heuristic techniques and 2 approaches have been implemented in the various research works". Not all k-mer counters are heuristics, in fact the majority of them return exact results.
Page 5, the sentence lacks citations: "Many memory efficient data structures used by various researchers can be listed as, enhanced suffix array, burst trie, lock free hash table, membership query data structure like bloom filter, pattern block bloom filter, counting quotient filter (CQF) and so on". Although I realize each of these objects will be described later, it is confusing to list them all here.
Page 5, please define "primary/secondary" memory.
Reference 24 is badly formatted.
Page 5, the sentence "The presented benchmark study covers all the areas of evaluation where as the existing literature review is incomplete or inadequate" is quite vague.
Page 5, the terms "capability to deal with enormous real world data" and "various categories" are vague.
GTester4 is actually named GenomeTester4. I would argue that the term "efficiently" in "efficiently uses the sort and count approach" page 6 is disputable, given that other tools also do that and yet are more efficient.
Page 7, the sentence "For k-mer counting, a typical data structure which can hold k-mer against its count is needed such as hash table [37] and is found suitable." reads awkwardly.
Page 7, the terms "proper hash function" and "best possible location" are vague.

Page 8 and onwards, there are several more minor remarks regarding the use of the English language as well as the precision of technical terms, but I would rather read a revised manuscript after it has been proof-read by a native speaker. I'll now focus on more technical remarks.

Bloom should be capitalized.
scTurtle is not an exact k-mer counter, please mention it.
Page 9, "Ability to modify [..] k-mers" is inaccurate, k-mers are not "modified".
The description of CQF is unclear. For instance: What are the values of the array? What does it mean for an array to have a "key"? (dictionaries have keys, not arrays). How is CQF "better" in terms of loss-less compression of k-mers than count-min sketches? (Count-min sketches do not compress k-mers at all and are not even exact data structures.) Why are CQFs sensitive to high counts? ("highly distorted" is not the right term)
Page 10: no, a burst trie is not a modified suffix trie.
The KCMBT section overall reads quite awkwardly, it needs to be revised carefully.
Page 11: no, DSK does not use a single large hash table. (it uses multiple ones, as k-mers are partitioned)
MSPKC is actually MSPKmerCounter.
Page 7: the same tool (khmer) is described 4 times.
Note: DSK handles k > 127 when it is recompiled (as per the github readme).

Level of interest Please indicate how interesting you found the manuscript:
An article whose findings are important to those with closely related research interests

Quality of written English Please indicate the quality of language in the manuscript:
Not suitable for publication unless extensively edited

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal. Authors' response to reviews: Reviewer #1:

Comment 1.1 In the testing of the k-mer counting tools, k = 22 and 55 were used. But later k size with range from 28 to 200 was used to do the testing. Firstly, it will be nice if the authors can explain more about the applications of the counting of long k-mers like 150, 175, 200. In most situation in bioinformatics, we do not count the k-mer with such long length.

We thank the reviewer for a detailed review. The applications for counting of long k-mers (for k = 150 to 200) are appended in the manuscript with related references.

Text Added in manuscript is on the page no.5 line no.16-23, page no.6 line no 1-3. References are added to support the text added is on page no.33 line no.10-20.

Comment 1.1.2 Secondly, if some of the k-mer counting tools make it clear that they do not support long k-mers counting, the authors may not need to actually test them and make them fail.

We agree to the point raised herein, and hence, we have modified the text in this regards as KCMBT which does not support the longer k length has been removed from Figure 1 (old) titled “Analysis of time (second), memory (GB) and disk (GB) utilization of counting algorithms on AT and GT datasets for longer k length with, k = 28, 40, 55, 65, 100, 125, 150, 175 and 200” along with its description. Though the tools like BFCounter, KAnalyze 2.0.0 etc support long k-mer counting, they were unable to execute completely on AT and NC datasets. The reason possibly could be that they cannot handle longer length reads.

Comment 1.2 One important application of k-mer counting is to get the frequency of a specific k-mer. It will be nice to test the performance of these tools to retrieve the frequencies of a group of k-mers. For example, this may be tested by retrieving the frequency of the first 100000 k-mers (or randomly selected k-mers) in a reads data set and measure the time, memory and disk usage of this process.

Out of all the tools considered in this study, only five tools support the online retrieval of k-mer frequencies namely: 1) KCMBT 2) Jellyfish 3) MSPKmerCounter, 4) Tallymer and 5) Squeakr. These tools do not support the same forms of retrieval of the k-mers frequencies. For example: 1) KCMBT supports retrieval of a range of frequency of k-mers but not a k-mer or a group of k-mers. 2) MSPKmerCounter supports retrieval of a k-mer, one at a time but not a set of k-mers, 3) Jellyfish, Tallymer and Squeakr retrieve a set of k-mers. Tallymer is not considered in our analysis, as we have only studied tools that were released after 2010. Squeakr is also not considered in this study as its exact k-mer counter code has not been released yet. Hence, we could not test the performance of these tools, to retrieve the frequencies of a group of k-mers. As these tools support retrieval of varying types of input, a fair comparison amongst them cannot be achieved. Secondly, there is only one tool, Jellyfish, out of the other tools in this study that support the retrieval of a frequency of a set of k-mers.

Comment 1.3 Related to the previous points, it will be nice to have a more comprehensive table listing more details of each k-mer counting tools, like the algorithm/data structure used, if the tool support long k-mers, or the limit of k-size, if the tool support online k-mer frequency retrieval, etc.

In the original manuscript the section titled “Overview of k-mer counting approaches” cover the information regarding data structure and the algorithm used by each tool. Hence the table covering comprehensive information of all the mentioned points as above is added in the supplementary (please see Table S1 along with its description). We believe that the original table titled “Approaches for k-mer counting” is the best way to show the ontology of k-mer counting approaches, which categorizes different tools according to the underlying principles they use as well as the relationship between these approaches (especially for those tools spanning multiple approaches or borrowing / using many different ideas). Hence, we kept the original Table 1 as it is in the manuscript.

Comment 1.4 In Tables such as Table 4, Table 5, Table 6, some tools have varying results compared to other tools. Are the results purely wrong? If it is, what is the reason? Is it because the tools do not support the experiment by design? Here more discussions may be needed since the accuracy of the counting of these tools is important information to the readers.

As suggested by Reviewer #2, Table 4-7 are merged and now reflected in two Tables numbered as A1 and A2 in the appendix with the removal of redundant data. The varying results may be due to the bugs in the considered recent versions of the tools. Actually we cannot state the exact reason for the non-matching results as there can be n number of reasons like synchronization problem while multithreading, estimation of the size of an array, etc. at the implementation level of the various approaches. We have taken special care while executing the tools by carefully setting all the parameters as per the guidelines given in the documentation of the tools. The answers to the above queries are reflected in the manuscript.

Text Added in manuscript is on the page no.18 line no.14-23, page no.19 line no 1-7.

Comment 1.5 On Page 17, the authors made it clear that "The tools which give an approximation of k-mer counts histogram by streaming analysis of the data are not considered in this paper." However on Page 20 Line 14, the authors mentioned that "scTurtle is itself having some false positive results". What causes such false positive results? Is it some kind of approximation of k-mer counts? Does this contradict the previous statement?

We agree to your point that scTurtle is not an exact k-mer counter. The scTurtle reports only an approximate count of the genuine k-mers because of its probabilistic nature (false positive results) owing to the underlying bloom filter it has used. Whereas BFCounter also uses a bloom filter, it makes one more pass to remove all the possibilities of false positive results (additional filtering to get exact results). To remove the contradiction as pointed out in the above comment, the scTurtle results are replaced by single threaded aTurtle (one of the variants of Turtle implementation) results as the aTurtle provides perfect counting. This benchmark study now considers only the exact k-mer counters. The paragraph related to aTurtle has been added to the manuscript.

Text Added in manuscript is on the page no.9 line no. 5-14.

Comment 1.6 In Table2, "Genome size (M Base)", may be "M Bases".

We have modified the above term in Table 2 titled as “Dataset specifications”, on page number 16.

Comment 1.7 On Page3, the description of the k-mer counting problem is a little bit confusing, which may need be rewritten, like the sentence on Line 10, "Every read of set S indicated by r with r[i] accounts for every ith character of r with index starting at 0 to l-1, where l is the length of r. "

To respond to the above comment, the paragraph explaining the k-mer counting process in the introduction section has been carefully revised, and the changes made in the manuscript are as follows,

Text Added in manuscript is on the page no.3 line no. 6-16.

Reviewer #2: Major remarks:

Comment 2.1 Another k-mer counting benchmark is cited: "Computational Performance Assessment of k-mer Counting Algorithms" from Perez et al, 2016. I will agree that the current manuscript is much more up to date with respect to the software and versions. Please nevertheless discuss whether your conclusions are in line with what Perez et al report. Please also recall what the conclusions of Perez et al are, in your introduction.

Our conclusions are in line with Perez et al. who have reported on parameters like time, memory utilization and parallelization. However, we have also tested the performance of the tools; with respect to additional parameters e.g. impact of compressed input (as suggested by reviewer #3) and scalability of different tools for long k-mers and accuracy. We have also tested the performance of tools on a range of datasets from small (D. Melanogaster) to large (human). We have considered only the exact k-mer counters as opposed to stochastic k-mer counters to provide a fair comparison. These parameters not only provide a good assessment of the actual performance of the k-mer counting programs but also provide a clear picture of the current state of the art programs used for counting of k-mers. Our study also considers the latest tools to date. We have added the supporting text for each such additional parameter in the manuscript.

Text Added in manuscript in Introduction Section, page no.5 line no.1-16 Text Added in manuscript in Result and discussion Section, page no.28 line no.20-21

Comment 2.2 The authors put much similar emphasis on the memory usage and disk usage as on running time. However one should note that, from a user perspective, when it comes to counting large datasets such as human, it does not matter so much if a tool uses 100 GB of disk or 150 GB of disk, as this resource is generally plentiful. Thus, I would have appreciated a more pragmatic viewpoint when it comes to comparing tools with respect to the various metrics.

We have agreed to your point, and hence we have mainly focused on the prime parameters of time and memory usage instead of focusing on disk usages of any tools. To support this, we have removed the old figure numbered as 2 titled “Analysis of disk (GB) utilization of the disk based algorithms for k-mer counting for increasing value of k, k = 8 and k = 55” from the manuscript along with its description. For newly added parameters like the impact of the compressed input file and multithreading analysis, the disk utilization is not discussed. We responded to this comment by adding the paragraphs in the manuscript which gives a more pragmatic viewpoint, to the comparison of tools concerning the various metrics.

Text is added in Abstract and Result and discussion section Text Added in manuscript page no.5, line no3-16

Comment 2.3 The middle of the results section, for instance the specific details regarding how each tool was run, should be moved to the appendix (a section or table of subsidiary matter at the end of a book or document.).

Thank you very much for your valuable comment. The specific details regarding how each tool was run have been moved to the appendix (page no.2, 3 and 4).

Comment 2.4 Tables 4,5,6,7 too. The authors nicely spotted that MSPKmerCounter and Gerbil have bugs, apparently yet all these tables in the main text contain a lot of redundant information.

We are grateful for your comment regarding Tables 4,5,6,7 and redundant data. We have merged Table 4, Table 5, Table 6 and Table 7(old) now and have prepared two tables (Table A1 and A2) in the appendix. We have also done our best to remove the redundant data. As suggested by Reviewer #1 the scTurtle results have been replaced by single threaded aTurtle (one of the variants of Turtle implementation) results. We have used aTurtle to perform the benchmark study having only exact k-mer counters. The paragraph related to aTurtle is also added to the manuscript.

Minor remarks:

Comment 2.1* grammar/typo: "till" in Introduction, "approache" page 5.

Typo errors have been corrected carefully.

Comment 2.2* Formatting, lines 17 and 19 page 3 e.g. "S = {{ACGTTA}, {ACGTTT}}.A": lack of space after comma. Also why is the set "s" a set of sets? Counting k-mer implies returning the set s = {ACGT, CGTT, GTTT}, as it does not matter which read the k-mer comes from.

The paragraph in the introduction section that explains the k-mer counting process has been rewritten carefully as the same was also suggested by Reviewer #1 and is now reflected in the manuscript.

Text Added in manuscript is on the page no.3, line no. 6-16.

Comment 2.3* Page 4 line 10, reference [15] is Jellyfish, and it does not perform k-mer frequency statistics.

We agree to your point on Jellyfish, and a proper sentence with proper citation is added to the manuscript as follows,

Text added in manuscript on page no.4, line no.2-3 and reference is added on page no.32, line no.35

Comment 2.4* Page 5, regarding the sentence "In the light of this, many heuristic techniques and 2 approaches have been implemented in the various research works". Not all k-mer counters are heuristics, in fact the majority of them return exact results.

We agree with your point and hence removed the sentences creating confusion from the manuscript.
Comment 2.5* Page 5, the sentence lacks citations: "Many memory efficient data structures used by various researchers can be listed as, enhanced suffix array, burst trie, lock free hash table, membership query data structure like bloom filter, pattern block bloom filter, counting quotient filter (CQF) and so on". Although I realize each of these objects will be described later, it is confusing to list them all here.
Sir, we agree with your comment. We have removed the sentence as mentioned above from the manuscript. Citations for each of the data structures referred have been added in manuscript at respective places. Also to be in line with Reviewer #1 (Comment: to have a more comprehensive table listing more details of each k-mer counting tools, like the algorithm/data structure used, whether the tool supports long k-mers, or the limit of k-size, whether the tool support online k-mer frequency retrieval, etc.) we have added Table S1 titled ‘Overview of various k-mer counting tools’ in the supplementary.

Comment 2.6* Page 5, please define "primary/secondary" memory.

To maintain uniformity throughout, we have used the terms memory and disk instead of primary and secondary memory respectively. The line including the above terms is added in the manuscript.

Text added in manuscript on page no.4, line no.17-18.

Comment 2.7* Reference 24 is badly formatted.

The said reference is now correctly formatted and added to the manuscript.

Reference added in manuscript on page no.33, line no.8-9

Comment 2.8* Page 5, the sentence "The presented benchmark study covers all the areas of evaluation where as the existing literature review is incomplete or inadequate" is quite vague.

We removed the inconsistency by modifying and adding supporting sentences in the manuscript.

Text added in manuscript on page no.5, line no.1-15.

Comment 2.9* Page 5, the terms "capability to deal with enormous real world data" and "various categories" are vague.

When we use the term “capability to deal with enormous real-world data” we want to suggest that “the scalability of tools for processing large datasets like human datasets has also been considered in this benchmark study”. However, we modified the sentence containing the above terms with an additional sentence to improve the manuscript.

Text added in manuscript on page no.5, line no.1-15.

Comment 2.10* GTester4 is actually named GenomeTester4. I would argue that the term "efficiently" in "efficiently uses the sort and count approach" page 6 is disputable, given that other tools also do that and yet are more efficient.

We have modified the sentence containing the above terms to improve the manuscript.

Text added in manuscript on page no. 7, line no.1.

Comment 2.11* Page 7, the sentence "For k-mer counting, a typical data structure which can hold k-mer against its count is needed such as hash table [37] and is found suitable." reads awkwardly.

We have modified the sentence containing the above terms to improve the manuscript.

Text added in manuscript on page no. 7, line no.8.

Comment 2.12* Page 7, the terms "proper hash function" and "best possible location" are vague.

The sentence containing the above terms has been removed from the manuscript. To above the repetition, the paragraph is removed. Because the use of hash table for k-mer counting has already been coved in the subsequent paragraph.

Comment 2.13* Page 8 and onwards, there are several more minor remarks regarding the use of the English language as well as the precision of technical terms, but I would rather read a revised manuscript after it has been proof-read by a native speaker.

We have rigorously worked on the suggestions recommended by you and have worked hard to improve the quality of written English and precision of technical terms throughout the manuscript.

Technical remarks: Comment 2.1** Bloom should be capitalized.

Bloom is capitalized in the manuscript at the respective places.

Comment 2.2** scTurtle is not an exact k-mer counter, please mention it.

We have mentioned that scTurtle is not an exact k-mer counter. As suggested by Reviewer #1 (Comment: On Page 17, the authors made it clear that "The tools which give an approximation of k-mer counts histogram by streaming analysis of the data are not considered in this paper." However, on Page 20 Line 14, the authors mentioned that "scTurtle is itself having some false positive results". What causes such false positive results? Is it some kind of approximation of k-mer counts? Does this contradict the previous statement?) we have replaced results of scTurtle with single threaded aTurtle (one of the variants of Turtle implementation) as the aTurtle provides perfect counting.

Text Added in manuscript is on the page no.9 line no. 5-14.

Comment 2.3 and 2.4

2.3** : Page 9, "Ability to modify [..] k-mers" is inaccurate, k-mers are not "modified".

2.4** : The description of CQF is unclear. For instance: What are the values of the array? What does it mean for an array to have a "key"? (dictionaries have keys, not arrays). How is CQF "better" in terms of loss-less compression of k-mers than count-min sketches? (Count-min sketches do not compress k-mers at all and are not even exact data structures.) Why are CQFs sensitive to high counts? ("highly distorted" is not the right term)

Necessary changes have been made in the sentence containing the above terms, and we have removed the vague terms from the paragraph, explaining the CQF.

Text added in manuscript on page no. 9, line no.15-20

Comment 2.5** Page 10: no, a burst trie is not a modified suffix trie.

We agree with your point Sir, and the necessary change has been made in the sentence containing the above terms.

Text added in manuscript on page no.10, line no.18-19

Comment 2.6** The KCMBT section overall reads quite awkwardly, it needs to be revised carefully.

We have carefully revised the KCMBT section, to improve the manuscript.

Text added in manuscript on page no.10, line no.18-22 and page no.11, line no.1-17

Comment 2.7** Page 11: no, DSK does not use a single large hash table. (it uses multiple ones, as k-mers are partitioned).

The sentence corresponding to the above point is rewritten.

Text added in manuscript on page no.12, line no.11

Comment 2.8** MSPKC is actually MSPKmerCounter.

MSPKC is replaced with MSPKmerCounter at respective places in the manuscript, appendix and supplementary.

Comment 2.9** Page 7: the same tool (khmer) is described 4 times.

We accept your point, Sir. We have described the latest implementation/approach of khmer, published under the title “Efficient cardinality estimation for k-mers in large DNA sequencing datasets.” only in the manuscript. The old khmer implementations/approaches have been removed from the manuscript.

Text added in manuscript on page no. 16, line no.14-16

Comment 2.10** Note: DSK handles k > 127 when it is recompiled (as per the github readme).

Sir, we recompiled the code and took the reading for k = 28,40, 55, 60,100,125,150,175 and 200 once again for datasets NC and AT, and the changes are reflected as shown in Figure No.1 on page no.28 along with the modified description in the manuscript.

Reviewer #3 Comment 3.1 As a general, minor comment; though the manuscript is nicely laid out, the quality of the written English should be improved. There are a number of places with awkward phrasing, typos, etc.

We have rigorously worked on the suggestions recommended by you and have worked hard to improve the quality of written English throughout the manuscript. We have also removed the awkward phrasing, typos, etc. as suggested by you.

Major technical comments

Comment 3.1* For example, the authors state that input datasets are decompressed and "concatenated" into a single input file before running each of the tools. While this may seem a reasonable approach to normalize for potential differences (e.g., if not all of the tools support decompression directly), it may have a non-trivial ( significant.)effect on the results as they could deviate from how the tools might be used following best practices. Specifically, it is common to run most k-mer counting tools directly on the compressed files without first decompressing the reads. This has the benefit, especially on "traditional" hard drives, of improving the I/O throughput and allowing the actual counting algorithms to consume data faster. This isbecause certain compression schemes like gzip have quite limited computational overhead, and so the cost of having to decompress the data in memory is overcome by the corresponding increase in data throughput that results from reading compressed data from disk. Since this is likely a common strategy for running k-mer counters, it should be included in the existing benchmarks (at least in addition to the current benchmarks). Another potential issue stems from normalizing the data by concatenating the reads into a single file. Some k-mer counting tools (e.g. KMC2, though I am not entirely certain of the changes made in KMC3) effectively perform parallelization in their first phase by reading from individual input files using separate threads. This means that restricting the input to a single file effectively limits them to 1 or 2 threads (e.g., one to parse and one to bin / partition). Given KMC3 %CPU utilization numbers this may, indeed, be occurring. It looks as though parallelism effectively increases when KMC3 enters the second phase (that doesn't depend on input), but it would likely be able to do even better on multi-file inputs if those inputs were not first concatenated into a single file. This may also be true of other tools.

Thank you, sir, for the above comment. The benchmarking of all the k-mer counting tools supporting compressed input is performed by running them on compressed input files (gzip/bz2), and the results are uploaded in table numbered as 4, 5, 6, 7 and 8. The related description has also been added in manuscript.

Text added in manuscript on page no.29 and line no. 2-18

Comment 3.2* My final concern is not, unfortunately, one where I am able to offer a great solution. That is, given the wide variation in adopted settings, it's not completely clear to me that the CPU resource comparisons are fair. For example, when different programs are executed with different numbers of threads, their running times aren't directly comparable. Moreover, given the difficulty in achieving "perfect scaling", it's not necessarily much better to examine non-wall-clock-time metrics. I realize that it is not always possible to compare all programs with exactly equal parameters --- as some programs, for example, restrict the specific number of threads that can / must be used. Perhaps, in this case, the best approach to analyzing CPU-related performance comprehensively is to run each program using a few configurations in terms of the number of threads. Though things may still not be directly comparable, it will at least give some insight into the scaling properties of the different tools with respect to the number of threads they are allowed to use.

We agree to the suggestion made herein, and we have performed the benchmark study of all the tools using the following number of threads (thread = 1, 2, 4, 6, 8 and 12) on two different datasets, FV and MB. The results of which (speedup and memory plot) are shown in the figure no. 2 and the corresponding readings are added to the supplementary document (Table S8 – S9). The related description is added in the manuscript.

Text added in manuscript on page no. 29 and 30

Comment 3.3* As an aside, when certain in-memory tools were not able to complete a task in the specified time or resources (e.g., running out of memory), I wonder if the authors considered the strategies such programs had to minimize RAM usage. For example, Jellyfish 2 supports filtering "erroneous" k-mers using a Bloom filter. This capability is mentioned in the introduction. However, I wonder if the authors attempted to use this feature when Jellyfish 2 exceeded the allocated resources when processing certain datasets. It is possible that other programs also expose such options. Perhaps it is reasonable to avoid benchmarking all tools in such "non-standard" configurations, nonetheless, it would be useful to know if it is at least possible to allow such tools to complete processing these datasets on the benchmarking machine.

Thank you, sir, for your valuable comment. As on our benchmarking machine Jellyfish in its default mode could not execute on HS1. Hence, we tried to run Jellyfish in BF-based mode on HS1 for both values of k (28 and 55). But as Jellyfish’s - BF-based mode for phase 1 could not finish execution within 15 hours for both k values and as the system froze, we had to terminate its execution forcefully. Jellyfish was also tested in the BF-based mode for dataset MB (M. balbisiana). Where its performance degraded regarding time (by 21% for k = 28 and 45% for k = 55) and memory (by 25% for k = 28 and 45% for k = 55) as compared to the Jellyfish’s default mode. The results of execution for Jellyfish in BF-based mode on dataset MB for k = 28 and k = 55 are added in supplementary documents.

Text added in manuscript on page no. 20 , line no.1-3

Source

Content of review 2, reviewed on March 14, 2018

The authors have done extensive corrections and responded to all my comments in a satisfactory fashion, except one: I had recommended that the paper gets an extensive proof-reading by a competent English speaker, yet there are still numerous mistakes.

Major remarks: 1. Reading the first two pages, part of the results and the conclusion, it is apparent that the usage of English is still largely flawed throughout the paper. 2. The "Result and discussion" has become very long (12 pages with figures), please use subsections.

English mistakes found on page 1:

"A read say r" -> "a read r"

"a sequence of nucleotide" -> "a sequence of nucleotides"

"denotes alphabet " -> "denotes the alphabet" or "denotes an alphabet"

" Let R denote the dataset having n number of reads such that R = {ri; 1 ≤ i ≤ n}." -> "having n reads"

"applied in the de novo genome assembly viz., the overlap layout consensus approach [3, 4] and the de Bruijn graph [5-8] based assembly" -> "applied in de novo genome assembly, e.g. using the overlap layout consensus approach [3,4] or the de Bruijn graph approach [5-8]".

"The probable misalignment in the reads is either due to errors or genuine nucleotide variations can be estimated using k-mer frequencies" -> This sentence does not make sense to me. What is "the misalignment" in this context, is it related to multiple sequence alignment or error-correction or both? Could you give a citation that shows an approach that perform such "misalignment" estimation task? (but I did not understand what this task is)

on page 2:

"de novo repeat annotation techniques like ReAS makes" -> "make"

"high-frequency k-mer as a seed" -> "high-frequency k-mers as seeds"

"to annotate repetitive plant genome" -> "to annotate repetitive plant genomes"

"billions of next-generation sequencing (NGS) data needs to be processed" -> "billions [..] need"

"disked based" -> "disk based" throughout the paper

pages 9 and 12: "till" should be replaced by "until"

page 19: "within15 hours" space missing

page 25: "it pass" -> it passes

page 26: "When it's time " -> When its time

page 26: "[..], it is still remarkable" -> this is a non-quantitative personal judgement. Same for "astonishing" in the next sentence.

page 31: "several gigabytes of genomic data is generated." -> are generated

Note: this is not an exhaustive list of style corrections for the paper, just a few that I spotted.

Source

Content of review 3, reviewed on September 05, 2018

See attached file.

Authors' response to reviews: Reviewer #2:

Comment 2.1 page 3: "For instance, k-mer frequencies are used to assess a probable misalignment among reads, which is either a sequencing error or a genuine nucleotide variation [10]." -> I am still still unsure about that sentence, as I couldn't find where in the Quake paper is this matter discussed. Could you please point it to me, or remove this sentence? It is not clear to me whether any alignment "among reads" are being made, but instead, alignment between reads and a reference genome.

We thank the reviewer for a detailed review. We agree to the point raised herein and hence we have removed this sentence from the manuscript as suggested by the reviewer.

Comment 2.2 page 4: "Statistics on the number of frequencies of all the k-mers" ->The meaning of "number of frequencies" is unclear.

In this sentence, "number of frequencies" is replaced with "number of occurrences" in the manuscript.

Comment 2.3 page 4: "use arrays with substring {k-mer} indexing." ->How is a flat array indexed by substrings? isn't it rather a dictionary?

We agree to the point raised herein and hence we have modified the text in this regard. Modified Text: A naive approach for k-mer counting is to use a dictionary, with k-mers as keys and their counts as values.

Comment 2.4 page 4: "the approach will soon overwhelm" ->More definitive wording can be used, I suggest "the approach overwhelms".

"the approach will soon overwhelm" is modified to "the approach overwhelms".

Comment 2.5 page 4: "a magnitude" ->"an order of magnitude"

We have replaced "a magnitude" with "an order of magnitude" in the sentence on line number 18 and page number 4.

Comment 2.6 page 5: "The disk-based approaches achieve very high efficiency at a marginal increase in the cost." ->"Cost" in what resource?

Cost is in regard of I/O. As opposed to disk-based approach, in-memory approach puts everything in memory and therefore involves almost no I/O costs [this sentence is referred from Jellyfish {Memory} article]. We have modified the text in this regard in the manuscript. Modified Text: The disk-based approaches achieve very high efficiency at a marginal increase in the I/O costs.

Comment 2.7 page 5: Just a remark; on top of what is written there, long reads are also notoriously used at resolving repetition in genome assemblies, not just structural variants.

We agree to the point raised herein and hence we have modified the text in this regard. Modified Text: The long reads are used at resolving repetition in genome assemblies; they also facilitate better resolution of structural variants present in DNA samples and genomic repeat content [27], along with many other advantages [26].

Comment 2.8 page 5: "Large values of k {k values up to 200} facilitate improvement of the accuracy of long sequencing reads {particularly of repeat-overlapping reads} and contig assembly [28]." ->Please note that this article isn't about "long reads" in the PacBio/Nanopore sense, but rather slightly longer Illumina reads.

We agree to the point raised herein and hence we have modified the text in this regard. Modified Text: Large values of k {k values up to 200} facilitate improvement of the accuracy of longer Illumina reads {particularly of repeat-overlapping reads} and contig assembly [28].

Comment 2.9 page 6: "highest N50 are obtained at an optimal choice of k, which seems to be larger values of k" ->No, not necessarily for larger values of k. {Only when there is sufficiently high coverage.}

We agree to the point raised herein and hence we have modified the text in this regard. Modified Text: Empirically, the best assemblies {without miss-assembly} and the highest N50 {only when there is sufficiently high coverage} are obtained at an optimal choice of k, which seems to be larger values of k [29, 30].

Comment 2.10 page 7 typo: "reprove"

We have incorporated the change in the corresponding sentence as suggested above and hence reprove is replaced with reprobe.

Comment 2.11 page 7: "Jellyfish follows a 'quotienting' technique" ->This seems to me that this quotienting technique is just a regular insertion procedure in a hash table.

We agree to the point raised herein and hence we have removed the text in this regard as it has already been covered in the subsequent sentences.

Comment 2.12 page 8: I suggest that the original Turtle algorithms may be presented in "k-mer counting using the sorting approach" for greater clarity that it is indeed a sorting-based algorithm.

We agree to the point that Turtle follows a sorting-based approach and hence Turtle algorithms are now presented in "k-mer counting using the sorting approach" for greater clarity along with the proper rearrangement of references.

Comment 2.13 page 9: "The k-mers are hashed using a one-way hash function," ->I understand where this sentence is coming from, but the general concept of Squeakr could in principle also use an invertible hash function {which is in fact the main ingredient in Squeakr-exact}.

We agree to the point raised herein and hence we have modified the text in this regard. Modified Text: Squeakr {exact}, which counts the frequency of each k-mer exactly using an invertible hash function, is not considered in this benchmark study, as its code is not available yet {currently not suitable for this benchmark study}.

Comment 2.14 page 10: "Burst trie manages large sets [..]"; grammar, perhaps revise to "Burst tries manage large sets [..]"

We have modified the corresponding sentence as suggested herein. ‘Burst trie manages’ is replaced by ‘Burst tries manage’.

Comment 2.15 page 10: "extended k-mers, similar to KMC2", for historical accuracy, please mention that {k+x}-mers were introduced by KMC2.

We have added the sentence as suggested by the reviewer in the manuscript as follows. Modified Text: Extended k-mers {{k + x}-mers for x > 0} are substrings of length more than k and were introduced by KMC2.

Comment 2.16 page 11: "uses hashing approach for k-mer counting such as DSK" grammar, please revise to "uses a hashing approach for k-mer counting similar to DSK".

We agree to the point raised herein and hence we have modified the text in this regard. Modified Text: Gerbil [31] uses a hashing approach for k-mer counting similar to DSK.

Comment 2.17 page 11: what is a "part hash function" ?

Actually it is ‘partHash’, not ‘part hash’. Hence, we have replaced ‘part hash’ with ‘partHash’ along with appropriate citation. The partHash function is explained in the ‘Gerbil’ article {Page No. 4}.

Comment 2.18 page 11: "The algorithm avails high parallelization." I am unsure if this is correct in English.

We agree to the point raised herein and hence we have modified the text in this regard. Modified Text: The algorithm is designed to make optimal use of the hardware with multiple threads running concurrently.

Comment 2.19 page 17: "adopted" typo -> "adapted" and the beginning of the sentence "The commands used to run all the programs" is repeated twice

Corrections suggested herein are reflected in the manuscript and repetition is now removed. Modified Text: The commands used to run all the programs were adapted from their documentation and the publications of KMC3 and KMC2. These commands are given in the Supplementary Material.

Comment 2.20 page 17: "are available with the histogram subroutine." is awkward wording {- "can create histograms directly"}

Corrections suggested herein are reflected in the manuscript and hence ‘are available with the histogram subroutine’ is replaced with ‘can create histograms directly’. Modified Text: Some tools, such as Jellyfish, DSK, Gerbil and MSPKmerCounter can create histograms directly.

Comment 2.21 page 18: "The results are not entirely wrong for [..]," is vague

We agree to the point raised herein and hence we have modified the text in this regard to make it precise. Modified Text: Not all the k-mers with their counts in the outputs of Turtle, MSPKmerCounter and Gerbil {only for k = 55} are matching with the outputs of the other tools, for respective datasets.

Comment 2.22 note: in my version of the manuscript, Figures 1 & 2 are blurry because in JPG format, they should be uploaded in EPS or PDF format

Figures 1 & 2 are now uploaded in PDF format with better clarity.

Comment 2.23 page 29: "in such large set of reads {massive datasets}" {redundant formulation}

Redundant formulation has been removed from the sentence as suggested herein. Modified Text: There is a need to continue the development of a system that realizes memory and time trade-off for k-mer counting in such large set of reads.

Note: Since we are not able to submit 'Respond to Reviewers' having "opening parenthesis" , hence we have used "curly braces" in above responses.

Pre-publication Review of

A benchmark study of k-mer counting methods for high-throughput sequencing

Reviewed On November 13, 2017 , March 14, 2018 , and September 05, 2018

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on November 13, 2017

Source

Content of review 2, reviewed on March 14, 2018

Source

Content of review 3, reviewed on September 05, 2018

Source