Content of review 1, reviewed on May 08, 2015

The authors describe GenomeTester4, a software package for set operations on k-mer frequency lists in genomic sequences. It consists of three components: the first, GListMaker, builds the list of k-mers and their counts from the input sequences; the second, GListCompare, implements the set operations (union, intersection, and complement, allowing for some sequence variation); and the third, GListQuery, allows querying a list for counts of k-mers in a set of sequences, or a k-mer list.

The key data structure is a list of sorted k-mers and their counts, in binary format, which allows fast set operations on the input lists.

The software, albeit straightforward, is useful and needed. The usage scenarios described are representative and illustrate the need for such a tool. The tool performs similarly to Jellyfish when run multi-threaded and is faster for a single thread, for the index-building function. Memory and disk requirements are not listed in Table 1, but in our testing the size of the index and amount of RAM used for the human genome and k=31 were comparable between the two programs (~29 GB index on disk, ~40GB RAM).

Discretionary revisions:

There are a few limiting features, which can be addressed in time:
1. The k-mer size is limited to 32, but some applications may require larger values.
2. It is not clear whether a hash index (or a variation), which can store the count in the entry, would not be a better choice for the implementation (faster random access and more compact representation).

Minor essential revisions:

3. A rather significant limitation is performance on NGS reads, especially as read error correction is listed as one of the primary applications for k-mer counters. If NGS reads are intended as input, then the authors should include a test case in the evaluation; otherwise, they should explicitly state that the tool is not suitable for this type of data.
4. The software crashed when I tried to build the index for the human genome with 8 threads and a table size of 5 GB on a 512 GB RAM machine:
“Segmentation fault glistmaker /home/florea/hg38c.fa -w 31 --num_threads 8 -o hg38.gt4 --table_size 5000000000”
5. The use of MiB, KiB and GiB in table 2 and throughout the manuscript should be explained.

Level of interest An article whose findings are important to those with closely related research interests
Quality of written English Acceptable
Statistical review No, the manuscript does not need to be seen by a statistician.
Declaration of competing interests I declare that I have no competing interests.

 

Authors' response to reviews (http://www.gigasciencejournal.com/imedia/1477433267178627_comment.pdf)


The reviewed version of the manuscript can be seen here:
http://www.gigasciencejournal.com/imedia/1844723828168502_manuscript.pdf
All revised versions are also available:
Draft - http://www.gigasciencejournal.com/imedia/1844723828168502_manuscript.pdf

Source

    © 2015 the Reviewer (CC BY 4.0 - source).

References

    Lauris, K., Maarja, L., Maido, R. 2015. GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists. GigaScience.