Review of Comparative analysis of 7 short-read sequencing platforms using the Korean Reference Genome: MGI and Illumina sequencing benchmark for whole-genome sequencing

Content of review 1, reviewed on May 29, 2020

The submitted study has characterized sequencing quality, uniformity of coverage, %GC coverage, and variant accuracy of seven sequencing platforms. They found that MGI platforms showed a higher concordance rate of SNP genotyping than HiSeq series. The study is of interest to genomics and sequencing technologies areas. Two concerns must be addressed prior to acceptance. 1)The author defined low-quality reads as those that had more than 30% of bases with a sequencing quality score lower than 20. I am wondering whether the results is stable once the definition changed ? 2)It looks the author ignored a highest duplicate ratio was found in MGISEQ-T7. More discussion and analysis should be preformed to make this clear. The author claimed that duplicates and adapter contamination may be more affected by the process of sample preparation than by the sequencing instrument. However, again, no evidence was provided.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews: Reviewer #1: In this manuscript, Kim et al. compared seven sequencing platforms, including 2 MGI platforms (BGISEQ-500 and MGISEQ-T7) and 5 Illumina platforms (HiSeq2000, HiSeq2500, HiSeq4000, HiSeqX10, and NovaSeq6000), by using one human genome. The sequencing quality of different sequencing platform was assessed by basic sequencing statistics, mapping statistic and variant statistic. Overall the manuscript is suitable to be published on Giga Science after a major revision. There are several major issues with the work presented in the manuscript, as listed below: => Thank you for precise and critical feedback. We have modified the text and added further analysis to accommodate the reviewer’s suggestions. (See below for point-by-point responses).

This work only contains samples from one human individual. It's really hard to reach a confident conclusion based on such a small sample size. => It is a generally correct point. However, both platforms produce massive amounts of sequences and the sample number would not affect the conclusion much as our study rely on how the two sets of platforms are similar or dissimilar in terms of variant calling.

This work still needs more samples and even replicates (both Cross-platform replicates and intra-platform replicates) to do further analysis, and provide confident evidence. => We think this is a practically important point. Unfortunately, we have not generated replicates for each sequencer. First, this study is based on years of sequencing history with one reference sample and each sequencing batch can contain multiple replicates or not. It is because each platform has a different amount of sequence output per run, it is impossible to produce a controlled amount of sequences in a certain common replicate number. We stated these limitations in the discussion part of the manuscript. The purpose of this benchmarking work was to compare two major platforms (MGI and Illumina).

The samples for sequencing were extracted on different points of time from the individual, that we wonder if the differences between mutation sets of seven sequencing platforms were caused by different sampling time and the bias of sampling process. => There must be some problems caused by the different sampling time and the sampling process mentioned by the reviewer. We used a Korean male sample and the difference between the first and the last sampling time is about 7 years. It is known that the human germline mutation rate is approximately 0.5×10−9 per base pair per year (Scally A, 2016. [10.1016/j.gde.2016.07.008]), which means that 10.5 germline mutations can be accumulated in 7 years. In this respect, although the mutation rate of DNA of leukocyte, a somatic cell, is expected to be higher than that of a germline cell, the number of mutations accumulated over the 7 years would be much lower than the difference between platforms. Therefore, we think that the different sampling time had no significant effect on the results. For the case of sampling process bias, we stated in the discussion part of the manuscript that there is a clear limitation in the sampling process. Although there are some limitations as the reviewer mentioned, we think our study is still meaningful in that it provides the data generated by the short read-based whole genome sequencing platform, which is the most used in the field. We compared the long existing common Illumina platforms with the relatively new MGISEQ-T7 platform using one human whole genome sequence (WGS) data which has not been done before.
This manuscript needs to show more detail about the sequencing process, such as the number of the flow cell and sequencing cycle, the run time of the sequencing process, the amount of DNA each sequencing platform needs. => We added the detailed methods for DNA extraction, library preparation, and sequencing process in the Materials and Methods section.
In order to compare, the sequencing data of seven sequencing platforms need to have the same genome coverage. => Very good point. As pointed out by the reviewer, we set the same genome coverage of the seven platforms and updated all subsequent analyses after analyzing the whole data. Please see Figure S5 and Table S4.
The results of the manuscript let me worry about the quality of the sequencing data generated from Hiseq2000 and Hiseq4000. More samples or replicates were needed to prove these results that the author found were normal. => HiSeq2000 and HiSeq4000 platforms are old, and their quality is not good compared to other platforms in our case. Currently it is not possible to have more replicates as these machines are often not available in sequencing centers and, also, it is quite expensive to run them now. Still, to compare with MGI platforms, we decided to add as many Illumina platforms as possible.
According to the official information, MGI platforms have low duplicate rate than any sequencing platform which needs PCR. But this work showed MGISEQ T7 had highest duplicate rate, I suggest the authors prove their finding by using other samples or individuals. => The official information showed a duplicate rate of less than 3% when using a PCR free library kit. However, we used the FS library kit that included the PCR process. Therefore, it seems that the duplicate rate is higher than the manufacturer’s official information. We provide the table presenting the mapping rates and duplicate rates of other human samples produced simultaneously with the KOREF sample. We found that the duplicate rates of the other human samples that were sequenced simultaneously with the KOREF sample were also high (see link below).

https://github.com/howmany2/SequencingPlatformComparison/raw/master/Mapping%20and%20duplicate%20rate%20of%20samples%20using%20PE100%20protocol%20and%20MGISEQ-T7.xlsx

An FS library kit containing PCR steps was used for MGISEQ-T7 sequencing of the KOREF sample. Furthermore, according to the sequencing vendor, the PE100 (Paired-end 100 bp) protocol has a high duplication rate, and the new PE150 (Paired-end 150 bp) protocol has a duplication rate less than 3%. We used the PE100 protocol for the KOREF sample and it can be a reason for why relatively many duplicated reads were found from the reads generated by the MGISEQ-T7 platform. However, we think the duplicate rate does not affect variant results much because it was analyzed after removing duplicate reads and matching to the same genome coverage for the seven sequencing platforms.

The methods for identifying the platform-specific covered region are unreasonable as different sequencing platforms had different coverage. => We agree with the reviewer's comment. We set the same genome coverage of the seven platforms and updated the result. As a result, the number of platform-specific covered regions of MGI platform decreased from 1,516 to 1,436, and in the case of Illumina, increased from 2,264 to 2,881. However, it was confirmed that the %GC ratio of the platform-specific covered region is the same as before meaning that the MGI platform covers a higher GC area (see Figure S10).
The Comparison of variants detected among seven platforms needs further analysis. Authors need a standard SNP and indel list of the Korean reference genome, which is verified by Sanger sequencing or other methods, to replace the dbSNP and SNP genotype chip as a compare object. What the relationship of FP, FN and the sequencing errors? => We agree with the reviewer's comment that it is a powerful tool to compare the variants to the gold standard variant set. However, to our knowledge, there is no gold standard variant set for the KOREF, which can give FP, FN, and sequencing error information, and, for this reason, we could not make a design for this study to conduct more precise and accurate comparison among the NGS platforms. As an alternative, we examined how much difference exists among the sequences generated by different NGS platforms which are generally used methods for genome sequencing.
The introduction of this manuscript is too simple. => We added several sequencing platform comparative studies to the introduction section.

Minor revisions: 1. The coverages of BGISEQ-500 and HiseqX10 were not mentioned in the first section. => We added the coverages of BGISEQ-500 and HiSeqX10 in the first section.

Using the ratio of singletons may help you to bring out your findings more clearly. => We agree with the reviewer's comment. We examined the concordance rate of the singleton variants with SNP genotyping data to determine the accuracy of the singleton variants (see link below). However, it was difficult to obtain statistically significant results because there were very few overlapping positions between the singleton variants and the SNP chip data.

https://github.com/howmany2/SequencingPlatformComparison/raw/master/Comparison%20between%20singleton%20variant%20and%20SNP%20genotyping%20chip.xlsx

Reviewer #2: The submitted study has characterized sequencing quality, uniformity of coverage, %GC coverage, and variant accuracy of seven sequencing platforms. They found that MGI platforms showed a higher concordance rate of SNP genotyping than HiSeq series. The study is of interest to genomics and sequencing technologies areas. Two concerns must be addressed prior to acceptance. => Thank you for the feedback. We have modified the text and added further analysis to accommodate the reviewer’s suggestion. (See below point-by-point responses).

1)The author defined low-quality reads as those that had more than 30% of bases with a sequencing quality score lower than 20. I am wondering whether the results is stable once the definition changed? => As a supplementary analysis, we conducted an analysis without the filtering step to see how much the read filtering step affects in the result of this study. The supplementary analysis was conducted by matching the number of unfiltered reads with that of clean reads of prior analysis. The two tables below are the results of comparing the read mapping and variant statistics between the cases using clean (filtered) and unfiltered sequences (see link below).

https://github.com/howmany2/SequencingPlatformComparison/raw/master/Mapping%20rate%20and%20Variant%20statistics%20between%20clean%20reads%20and%20unfiltered%20reads.xlsx

As a result of using the unfiltered sequences, there was no notable difference in mapping and duplicate rates. The number of SNVs increased by 0.8% on average, and as the number of heterozygous SNVs increased, the hetero/homo ratio increased by 0.02 on average. Interestingly, the differences in total SNVs between clean and unfiltered reads in the two MGI platforms were less than that of the Illumina platforms. In the case of the Illumina platforms, on average, 44,000 additional SNVs were discovered when unfiltered reads were used compared to the case of the clean reads, while the increment in MGI platform was 800 SNVs on average when using unfiltered reads.

2) It looks the author ignored a highest duplicate ratio was found in MGISEQ-T7. More discussion and analysis should be performed to make this clear. The author claimed that duplicates and adapter contamination may be more affected by the process of sample preparation than by the sequencing instrument. However, again, no evidence was provided. => We agree with the reviewer’s concerns about the high duplicate ratio. We provide the table presenting the mapping rates and duplicate rates of other human samples produced simultaneously with the KOREF sample. We found that the duplicate rates of other human samples that were sequenced simultaneously with the KOREF sample were also high (see Table below). An FS library kit containing PCR steps was used for MGISEQ-T7 sequencing of the KOREF sample. Furthermore, according to the sequencing vendor, the PE100 (Paired-end 100 bp) protocol has a high duplication rate, and the new PE150 (Paired-end 150 bp) protocol reduces the duplication rate to less than 3%. We used the PE100 protocol for the KOREF sample sequencing and it can be a reason why relatively many duplicated reads were found from the reads generated by the MGISEQ-T7 platform. However, we think the duplicate rate does not affect variant calling results because it was analyzed after removing the duplicate reads and matching to the same genome coverage for the seven sequencing platforms (see link below).

https://github.com/howmany2/SequencingPlatformComparison/raw/master/Mapping%20and%20duplicate%20rate%20of%20samples%20using%20PE100%20protocol%20and%20MGISEQ-T7.xlsx

There are three main causes of duplicate reads generated by NGS technology. 1. Natural duplication 2. PCR duplicates (occur in library preparation step) 3. Optical duplicates (occur in sequencing step) Natural duplications are not discussed in this section because it is difficult to distinguish them from PCR duplicates and optical duplicates. The following table showed the ratio of PCR duplication and optical duplication of the seven platforms (see link below).

https://github.com/howmany2/SequencingPlatformComparison/raw/master/Statistics%20of%20PCR%20duplicate%20and%20optical%20duplicate%20in%20seven%20sequencing%20platforms.xlsx

This result showed that PCR duplication occurs at least 2 times more than the optical duplication. (Unfortunately, the two MGI platforms were unable to calculate optical duplication.) This means that most duplication occurs during the library preparation rather than the sequencing steps. The adapter contamination is caused by the sequencing of short DNA fragments that are shorter than the read length (Turner FS, 2014. 10.3389/fgene.2014.00005). For this reason, it can be expected that adapter contamination is mainly affected by the library preparation step, because size selection of DNA fragments is a part of the library preparation step; improper operation of size selection can introduce the shorter DNA fragments into the DNA library for sequencing.

Reviewer #3: The authors compare various short-insert, short-read whole-genome sequencing platforms used by academic researchers and clinical scientists.

My minor comments and suggestions are:

● As stated by the authors, Illumina platforms are indeed now considered 'historical.' However, many Illumina sequencers are still heavily used - in particular in pathology labs. This manuscript may prove very useful when arguing for an instrument upgrade in such a setting.

● You may like to comment on single tube long fragment read (stLFR), which enables the sequencing of long transcripts by sequencing bar-coded reads on the BGISEQ-500 platform [and, thus, probably also MGISEQ-T7) (10.1101/gr.245126.118). This technology is relatively cheap and is likely to decrease in cost - another argument for the adaption of MGI platforms in the laboratory.

● You may want to comment on Illumina library kits. It is possible that revisions [in the five-six years since the data in your study were generated] to these kits could improve the sequencing results (e.g., see 10.1371/journal.pone.0113501). I realize the effect may be minor, but it may nevertheless be useful to remind the reader about the potential for slightly better raw read statistics. => Thank you for your positive feedback and the suggestions. We added the idea suggested in your comments to the discussion part of the manuscript. (See Discussion section lines 209-210)

Source

Content of review 2, reviewed on September 21, 2020

The authors addressed my and other reviewers's comments however many of the changes were quite minimal. It is suggested they can put the additional test in the main text and clarify all those limitations (not simple mentioned) in their study in the discussion section. For example, the high duplicate ratio in MGISEQ-T7 and a single individual was used.

Authors' response to reviews: Reviewer #1: Much improved manuscript. I only have minor comments: 1) The examination of platform-specific covered region between MGI and Illumina platforms is still problematic. A single fold change threshold is unreliable. The authors should further make statistical test to identify platform-specific covered regions. ==> As pointed out by the reviewer, we re-analyzed the platform-specific covered region between MGI and Illumina platforms. We now use statistical test (edgeR method for group comparison followed by Benjamini-Hochberg correction for p-value adjustment) rather than the single fold change threshold to identify the platform-specifically covered region. As a result, the number of platform-specific covered regions of MGI platform increased from 1,436 to 1,778, and in the case of Illumina, increased from 2,881 to 2,967. We updated the manuscript and supplementary figure and table (See Results section lines 143-145; Figure S10 and Table S6).

2) Since the standard variant data set is not available, I think it is necessary to discuss the potential reason of the platform-specific SNVs and the singletons. Whether their distribution is associated with platform-specific covered regions or other reasons associated with low sequencing quality? ==> We speculate that repetitive regions with low mapping tendency were the one of the reasons for the platform-specific SNVs and singletons. To figure out the potential reason of the platform-specific SNVs and the singletons, we compared these SNVs to platform-specific covered regions. First, we compared platform-specific SNVs to platform-specific covered regions. We found only 2.8% of Illumina platform-specific SNVs and 1.6% of MGI platform-specific SNVs are included in the platform specific covered region (Table S8). In addition, most of the platform-specific SNVs were located in a sufficient depth region (>10×), and about 74% of platform-specific SNVs were included in the repeat region (Table S9). The singleton also showed a similar pattern to platform-specific SNVs. There were very few overlapping positions between the singleton variants and the platform-specific covered region (0.5% on average, Table S10), and most of the singletons were located in the relatively high depth region (>10×). About 74% of singletons were included in the repeat region (Table S9). We updated these results to the manuscript (See Results section lines 179-194).

Reviewer #2: The authors addressed my and other reviewers's comments however many of the changes were quite minimal. It is suggested they can put the additional test in the main text and clarify all those limitations (not simple mentioned) in their study in the discussion section. For example, the high duplicate ratio in MGISEQ-T7 and a single individual was used. ==> Thanks for the comment. We now added additional result for platform-specific SNVs, singleton, and high duplicate ratio of MGISEQ-T7 platform in the manuscript (See Results section lines 134-135; Tables S4, S8, S9, and S10). Furthermore, we added a list of sequencing platform comparison studies using single individual in the discussion section (See Discussion section lines 221-228; Table S14).

Source

Content of review 3, reviewed on January 14, 2021

The title could be shorter.

Source

References

Hak-Min, K., Sungwon, J., Oksung, C., Hoon, J. J., Hui-Su, K., Asta, B., Hwang-Yeol, L., Youngseok, Y., Sung, C. Y., M., B. D., Jong, B. Comparative analysis of 7 short-read sequencing platforms using the Korean Reference Genome: MGI and Illumina sequencing benchmark for whole-genome sequencing. GigaScience.

Pre-publication Review of

Comparative analysis of 7 short-read sequencing platforms using the Korean Reference Genome: MGI and Illumina sequencing benchmark for whole-genome sequencing

Reviewed On May 29, 2020 , September 21, 2020 , and January 14, 2021

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on May 29, 2020

Source

Content of review 2, reviewed on September 21, 2020

Source

Content of review 3, reviewed on January 14, 2021

Source

References