Review of A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data

Content of review 1, reviewed on October 18, 2014

In this study Siretskiy et al. provide an analysis of the runtime and scalability characteristics of an example of a Hadoop genomics workflow and compare it to a more traditional HPC architecture and workflow. In large part their work measures the performance of Crossbow (Langmead et al), which the authors have extended slightly by replacing its preprocessing module. The authors show the scalability of Hadoop, and assert that it is performance-competitive for datasets over 100GB. The novel part of this part of the paper is an empirical comparison to processing in a traditional HPC setting. In addition, the authors conduct a second test which shows the benefits of Hadoop/HDFS’s data locality strategy. Although these second results are not terribly surprising in showing that non-data local architectures can saturate their network connections to storage, it is useful to empirically see where this happens in a typical HPC setup. Commendably, replicates of the experiments were conducted. The basic scripts used to conduct the analyses are available on GitHub.

Major Compulsory Revisions

The authors of Crossbow conducted some measurements of scalability and reported them in their paper. It would be good for this manuscript to acknowledge that work more explicitly, discuss their own results in relation to the results reported by Langmead et al, and clearly state the novelty of this work. In addition, some portions of the manuscript do not make it sufficiently clear what portions of the manuscript are novel (the testing in comparison to HPC) vs what has already been built and published. In particular, the second paragraph of the Methods section should make it clear that the workflow being described is Crossbow - Crossbow is referenced in the next paragraph, but at that point it sounds like the the description of a novel pipeline.
Methods pp 1 Why are the pipelines different between the HPC approach and Hadoop approach? In particular, if the use of SOAPsnp is mandated by Crossbow in the Hadoop approach, why use Samtools for variant calling in the HPC approach? It seems more difficult to make an apples-to-apples comparison if the tools used are different.
Methods pp 3: "The input data and the indexed genome reference were copied to Hadoop’s storage system (HDFS) before starting the experiments. Specifically, the storage volume containing the input data was mounted with SSHFS to one of the Hadoop nodes, from where the bzip2-compressed data were transferred to the HDFS.”

The authors should report and discuss the time spent transferring data into HDFS, or address why they do not do so. This has been a bottleneck in several of the Hadoop-based applications that I am familiar with, and seems relevant to this study, which is attempting to characterize the end-to-end performance of Hadoop vs a traditional HPC setup. This also applies to the T_comm time reported for Hadoop in the Test II section of the Results.

Methods pp 3, Preprocessing results, and Table 2: "Unfortunately, unlike Crossbow’s main program, its preprocessor does not scale well to large datasets (it is not implemented as a MR program).”

In the Crossbow paper (2009), Langmead et al say, "Crossbow is distributed with a Hadoop program and driver scripts for performing these bulk parallel copies while also preprocessing the reads into the form required by Crossbow.” Why does this program not scale - it sounds like it can do parallel copies? Perhaps a more detailed explanation would help.

Results - Test I Even given that different SNP callers were used (which is not optimal; see comment 2) it does not seem sufficient to only compare the outputs based on a single variant call. It would be ideal to characterize the similarity between the two call sets; at the least the number of calls made in each dataset should be reported so that we can be reasonably confident that the callers were doing similar amounts of work.
Results - Efficiency of Calculations, pp 3:

"Hadoop, even when running in the virtualized environment of a private cloud assembled on moderate hardware, is competitive with the HPC approach run on “bare metal” of modern hardware for datasets larger than 100Gbases”

The manuscript should define what is meant by “competitive”. Hadoop still takes more CPU hours according to these results, if I understand them correctly. I think that one could justify the extra CPU hours with an argument relating to throughput or elapsed time but it is up to the manuscript to make that argument. It’s also not clear to me from figure 2b why 100GB was chosen as the cutoff - it seems to be in a region where the slope is gradually changing. Dataset V does not seem to be present in fig 2b - maybe if it was there it would be clearer.

Results - User Interface

While the interface presented for running Crossbow on the UPPMAX cluster seems user-friendly, it does not seem particularly novel, considering that Crossbow itself provides a friendly UI for running on AWS. The manuscript should probably reference the Crossbow UI and differentiate the work done here.

Discussion pp 1 Again, as with the discussion of “competitive above… "Hadoop becomes comparable with HPC when processing datasets larger than 100Gbases”. It’s not clear quite what is meant by “comparable”. The manuscript should be more specific about what the benefits and tradeoffs of Hadoop and why the authors believe 100Gigabases to be the turning point.
Discussion pp 6 "Tests at CRS4 have shown that on large distributed file copy operations – which effectively mimic the typical I/O pattern for Hadoop jobs – their specific shared storage volume can match the throughput of an HDFS volume with up to at least 60 of their specific HPC nodes (unpublished data)."

This sentence is unclear. What does "up to at least" mean? Also using Hadoop as described here seems to lose the benefits of data locality, which was a point of emphasis earlier in the paper.

Minor Essential Revisions

"For the tools which do not support OMP natively – e.g., Samtools – variant calling can be parallelized by creating a separate process for each chromosome or using the GNU Parallel Linux utility [9]."

Depending on the tool and application, parallelization does not need to be executed by chromosome; for example genomic intervals can often be specified for some applications, etc..

Introduction pp 6: “MapReduce … is successfully used by many well-known data-driven operations [16] to process huge datasets with sizes of up to several terabytes [17]"

MapReduce can certainly handle datasets much larger than several TB for some applications. Petabyte-scale Hadoop installations have been described - see eg. http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/

Introduction pp 8: "Since the Hadoop framework poses a significant performance overhead, we seek to estimate the average data size at which point it starts to pay off from a performance perspective.”

This sentence sounds contradictory. It would be good to clarify what is meant - i.e. a tradeoff between additional CPU hours and sample throughput.

Introduction pp 3 "The most common approach to accelerate NGS tools is to parallelize within a compute node (multi-core parallelization) using shared memory parallelism (OMP) [7], but this approach is naturally limited by the number of cores per node, which nowadays does not usually exceed 16 [8]." ...and later.. “The Bowtie aligner natively implements OMP”

Unless I am misunderstanding something, the manuscript appears to conflate any sort of shared memory parallelism with the OpenMP API. Bowtie does not use OpenMP, as far as I can tell from looking at the source code - it looks like they use ptheads and mmap to share memory. Using OMP as an acronym for shared memory parallelism seems confusing to me. I may just simply be incorrect here, but the terminology doesn’t seem right to me.

Results, Efficiency of Calculations, pp 3 The text reports the extrapolation of FHadoop/FHPC as 1.34 for Hadoop II, but the figure legend says 1.43.
Results - Efficiency of Calculations, pp 4 The manuscript should take more care to distinguish between the BAM sorting and SNP calling tools in SAMtools. Sorting is a bottleneck for BAM file processing, and is built-in strength of the Hadoop engine, as shown by performance on sorting benchmarks like Terasort. Calling this out would make the manuscript stronger.

Discretionary Revisions

Some of the tools used in the analysis pipelines are quite old. It would make the analysis more relevant for users if more up-to-date tools could be used, although this would likely require changes to the Crossbow codebase.
re Discussion pp 4

It would be good to cite several recently published libraries that attempt to ease adoption of Hadoop in genomics, for example BioPig (Nordberg 2013) and SeqPig (which shares an author with this manuscript!), in relation to the difficulty of developing Hadoop software.

Level of interest An article whose findings are important to those with closely related research interests Quality of written English Needs some language corrections before being published Statistical review No, the manuscript does not need to be seen by a statistician. Declaration of competing interests I declare that I have no competing interests.

Authors' response to reviewers: (http://www.gigasciencejournal.com/imedia/2119042624155274_comment.pdf)

Source

Content of review 2, reviewed on February 05, 2015

The authors have made substantial revisions to the paper and addressed most of my concerns. This new version of the manuscript is quite a bit clearer and more understandable. My remaining major comments are all matters of clarifying the writing and interpretation of the results.

Major Compulsory Revisions:

Results Section 3.2, “Scalability": The statement at the end of this section that “The measurements highlight that one of the advantages of using the Hadoop framework is its almost linear scalability with respect to input size, regardless of the scaling nature of the underlying program”. It’s not really clear to me which measurements this is referring to; the immediately preceding discussion was of the results in Figure 6, and the data in Figures 1 and 2 shows that Hadoop exhibits scaling problems similar to those that the HPC pipeline has when it comes to the SNP-calling stage, which seems to contradict the quoted statement. I think this statement should be clarified.

Results Section 3.2, “Efficiency of Computation”: Although I see what is meant by “competitive” more clearly now, I think that the fact that the ratio F_hadoop/F_HPC extrapolates to 1.4 for the original Hadoop pipeline and 1 for the Hadoop II pipeline deserves some more explanation. If the ratio is 1, then the number of CPU-hours is the same for the two pipelines. It seems to me that the benefits from Hadoop in this case come from being able to decrease the total runtime and therefore improve throughput by increasing the number of cores, an option not available to the HPC setup, but that is not explicitly stated. That may not be the interpretation that the authors have in mind, but I think that it is necessary to provide some further comment and or discussion of these ratios one way or another.

Table 2: The caption says "The “Hadoop II” data are explained in the text,” but I’m not seeing where it is explained, or any reference to “Hadoop II” in the main text.

Results Section Minor Essential Revisions:

Abstract p1: Should be “the challenges of increasingly large datasets” or something similar

Methods Section 2.2, “Data Preparation”, p2: the manuscript says that input data was copied to HDFS "before starting the experiments”. But in this version of the manuscript time spent copying data to HDFS is counted in the communications measurements. I think this needs to be clarified.

Results Section 3.1, “Efficiency of the preprocessing stage”: The manuscript highlights a linear relationship in the data in Table 2 - the same data is present in Figures 1 and 2 and I think the relationship is easier to see there, so it would help to point that out from this paragraph.

Figure 6 is mentioned before Figures 3+, so should probably be renumbered as Figure 3.

Discretionary Revisions:

Results Section 3.2, “Concordance of Hadoop and HPC results”: Is the mapping efficiency reported here of 59% typical for A thaliana data sets? It seems low relative to, for example, efficiency in human experiments. A small comment might give some context.

Results Section 3.2, “Concordance of Hadoop and HPC results”: It would be good to explain the differences seen in the results from the two platforms. Is it the result of, for example, data being processed in a different order? Or, is there some sort of small bug in one of the systems that is causing some reads to be lost? I don’t think this is completely necessary for the paper to make its point about Hadoop efficiency, but it would make it a bit tighter.

Results Section 3.2, Scalability: The authors note that the failures in scalability of the SNP calling stage are due to “Crossbow task scheduling”. It would be interesting to comment on what in particular is causing this. Are there regions (in genomic coordinates) of the pileup with very high depth that take much longer to process? If so, what could be done to mitigate this? Level of interest An article of importance in its field Quality of written English Needs some language corrections before being published Statistical review No, the manuscript does not need to be seen by a statistician. Declaration of competing interests I declare that I have no competing interests.

Authors' response to reviewers: (http://www.gigasciencejournal.com/imedia/2898289071650429_comment.pdf)

Source

Content of review 3, reviewed on March 16, 2015

Major Compulsory Revisions

Results sections "Scalability" and "Efficiency of calculations"

The authors have addressed some of my concerns about the clarity and arguments made in the manuscript. However, the revised text contains many spelling and grammatical errors. For example:

"are due to ther round-robin-like fashion the jobs are distributed between the nodes"

"the Hadoop framework can help to engage dosens of nodes"

The manuscript needs to be proof-read and revised for clarity.

Results section "Concordance of Hadoop and HPC results"

In the newly added explanation for the differences between the HPC and Hadoop implementations, the authors state that the dataset "is being split by the Hadoop platform into chunks (128 Mbytes each), resulting in 10-11 cut reads...". Does that mean that the process of splitting is discarding reads? If so, this should be addressed - I don't think that there is a reason that processing in Hadoop should necessarily mean discarding data, even if the difference in outcomes is small. Hadoop pipelines I have dealt with typically make some effort to put the raw data into a format that is splittable without the loss of records. I think that this should either be fixed or explained.

Level of interest An article of importance in its field Quality of written English Needs some language corrections before being published Statistical review No, the manuscript does not need to be seen by a statistician. Declaration of competing interests I declare that I have no competing interests.

Source

References

Alexey, S., Tore, S., Mikhail, V., Ola, S. 2015. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. GigaScience.

Pre-publication Review of

A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data

Reviewed On October 18, 2014 , February 05, 2015 , and March 16, 2015

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on October 18, 2014

Source

Content of review 2, reviewed on February 05, 2015

Source

Content of review 3, reviewed on March 16, 2015

Source

References