Review of StoatyDive: Evaluation and classification of peak profiles for sequencing data

Content of review 1, reviewed on August 25, 2020

Review Journal: GigaScience Manuscript ID: GIGA-D-20-00218 Title: StoatyDive: Evaluation and Classification of Peak Profiles for Sequencing Data Category: Technical Note The manuscript by Florian Heyl and Rolf Backofen presents StoatyDive, a new peak calling tool to identify binding sites from sequencing data such as CLIP-seq. The paper shows that the StoatyDive can identify different classes of binding sites, based on their peak shape profile. This information can be very useful to study binding specificity of RNA-binding proteins (RBPs). This tool can also be used as a quality control measure to identify specific and unspecific binding sites to filter out false positive bindings based on their peak shape profiles. The paper presents several interesting analyses, including SLBP protein, which shows interesting binding patterns but also a low reproducibility between the replicates. Overall, the manuscript is well written, the method is of broad utility, and all the analyses are well documented and their code is available online. However, as listed below, the manuscript has several shortcomings. 1. comment: I am surprised that the authors focused their analyses on the SLBP protein and not on the more well studied RBP. However, the authors did find interesting observations for the SLBP protein by analysing eCLIP samples in K562 cell lines, including low reproducibility between the replicates, which haven't been fully investigated in this manuscript. I think it would be important to address these observation in a more systematic manner: * Have you checked how much the Coefficient of Variation (CV) changes between other replicates in other eCLIP samples, including eCLIP HepG2 samples? It would also be important to include CV results for all other samples that you analysed in this study (RBM22, U2AF2, PTBP1…). Do other eCLIP samples also show similar differences between the replicates or is the SLBP protein one of the exceptions? Is this variation between replicates more specific for the eCLIP protocol or is it also common in other CLIP-seq methods such as iCLIP, irCLIP or PAR-CLIP? * How many peaks identified in replicate one overlap with replicate two? * Is the library size similar between the replicates and have you done any more detailed quality controls between the replicates (number of uniquely mapped reads, sequencing errors etc.)? Maybe a poor quality or low coverage could explain the difference. * Would it be possible that one replicate has more cytoplasmic binding sites than the other? One way of testing this would be to check how many peaks that you identify are in intronic regions vs exonic regions, for each replicate separately. Also, how does the CVs and shape profile of the peaks change between exonic and intronic peaks? Maybe this sample has a higher binding specificity in the nucleus than in cytoplasm or vice versa. 2. comment: "It is important to note that peaks shaped like plateaus might be false positives. These peaks most likely corresponded to PCR duplicates that were not real binding sites. We have removed PCR duplicates during the pre-processing of the read library, but some duplicates might still be in the data. Sequencing errors in the unique molecular identifiers (UMI) are a common reason." * Are there any changes in shape profiles between intronic and exonic peaks? To filter out false positive peaks, you could use ENCODE RNA-seq data for the same cell line and analyse peaks and their shapes in expressed transcripts. How many False positives are present in non-expressed transcripts and do they show different motifs and different shapes of their peaks? 3. comment: "We investigated also the sequence motifs of sharp and broad peak profiles of the protein RBFOX2 (eCLIP data from the study by Van Nostrand et al. [5]), because it has the conserved sequence motif TGCATG, which is enriched in the RBPFOX2's binding sites [13, 14, 15]. We have found the conserved motif only in sharp peaks (Table 3), but the broader profiles also had an enriched motif. It is worth investigating if RBFOX2 has some unspecific binding preferences with that specific motif. The results for RBFOX2 again demonstrated that different peak shape groups result in different sequence motifs." * A recent paper by Bridget E. Begg et al. identified secondary binding motifs from their RBFOX2 CLIP experiments (Begg et al. 2020). These motifs are specific for intronic and 3'UTR regions (Bridget E. Begg et al., Figure 3). Could the authors also identify these motifs by separating their peak shape profiles into intronic and 3'UTR region? 4. comment: "All peaks were therefore extended or shrank to a length of 77 nucleotides. This was based on the observation that the third quartile of all peaks (from all proteins) was 77 nucleotides long." * The 77 nucleotides window size for all RBPs could be too broad to identify sharp peaks for proteins with specific bindings. Anob M. Chakrabarti et al. has shown in their review (Chakrabarti et al. 2018), that the window size parameters can modify the sensitivity and specificity of the data and it needs to be adapted for each RBP. For example, PTBP1 can have multiple binding sites just next to each other, where the broad window size would detect all the protein binding sites as one large binding site instead of multiple sharp ones. Could the authors demonstrate and comment on at least one example of how the window size can affects peak finding specificity and their shape profiles? 5. comment: "So we investigated the number of peaks that fall into introns (90% overlap) for the proteins that are involved in the splicing process. The proteins PTBP1 (85%), RBM22 (62%), TARDBP (79%), and HNRNPM (87%) had more than 50% of peaks in introns, which detected the assumption of split peaks. However, the proteins U2AF1 (17%), and U2AF2 (15%) had more peaks in exon regions, where the possibility of split peaks might still occur. We found for U2AF1 only 27 peaks (0.72%) and for U2AF2 5 peaks (0.40%) that are potential split peaks, again defecting the assumption of a technical artefact in the peak set of splicing factors. Thus, the sharpness of the peaks of U2AF1 and U2AF2 was potentially not the result of the peak calling." * The authors examined RBPs that are involved in the splicing process, including U2AF1 and UAF2, which recognises 3' splice sites and recruits the spliceosome. The authors claim that the majority of the identified peaks from both proteins fall into the exonic region. This statement contrasts with previous studies including my preliminary analysis (see Figure ??? below). The figure shows a normalised coverage of cDNA-starts for the same eCLIP U2AF1 and U2AF2 K562 and HepG2 samples relative to the 3' splice site region, showing a dominant enrichment of the intronic region for both proteins which also agrees with the previous CLIP studies (Zarnack et al. 2013; Sutandy et al. 2018; Briese et al. 2019). The authors should examine their peak analyses in these results further to identify the reason for such a difference.

References Begg, BE, M Jens, PY Wang, CM Minor, and CB Burge. 2020. "Concentration-Dependent Splicing Is Enabled by Rbfox Motifs of Intermediate Affinity.". Nat Struct Mol Biol. Chakrabarti, Anob M., Nejc Haberman, Arne Praznik, Nicholas M. Luscombe, and Jernej Ule. 2018. "Data Science Issues in Studying ProteinRNA Interactions with CLIP Technologies". Annual Review of Biomedical Data Science 1 (1): 235-61. https://doi.org/10.1146/annurev-biodatasci-080917-013525. Zarnack, Kathi, Julian König, Mojca Tajnik, Iñigo Martincorena, Sebastian Eustermann, Isabelle Stévant, Alejandro Reyes, Simon Anders, Nicholas M. Luscombe, and Jernej Ule. 2013. "Direct Competition between HnRNP C and U2AF65 Protects the Transcriptome from the Exonization of Alu Elements". Cell 152 (3): 453-66. https://doi.org/10.1016/j.cell.2012.12.023. Sutandy, F.X. Reymond, Stefanie Ebersberger, Lu Huang, Anke Busch, Maximilian Bach, Hyun-Seo Kang, Jörg Fallmann, et al. 2018. "In Vitro ICLIP-Based Modeling Uncovers How the Splicing Factor U2AF2 Relies on Regulation by Cofactors". Genome Research 28 (5): 699-713. https://doi.org/10.1101/gr.229757.117. Briese, M, N Haberman, CR Sibley, R Faraway, AS Elser, AM Chakrabarti, Z Wang, et al. 2019. "A Systems View of Spliceosomal Assembly and Branchpoints with ICLIP.". Nat Struct Mol Biol 26: 930-40.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews: {Reviewer 1}

Major comments:

1) I find that there's an imbalance where the most important findings are stated very briefly (sometimes without figures), whereas significant text and figure space is devoted to more descriptive analyses that lack a clear conclusion.

For example, to me the most important finding from the method is that the use of different peak shapes enables separation of real versus artifact peaks, which is aided by the authors use of SLBP (which has a well-characterized specific role in binding a specific hairpin structure in histone 3' UTRs), However, this result is stated in Table 1 without statistical analysis (enrichment, significance, etc) and a couple sentences of text. The manuscript would be stronger with additional analysis and description here (what is the sensitivity/specificity tradeoff in using the shape group / CV cutoffs described here? How do those change with different cutoffs?)

\begin{answer} Thank you very much for your suggestion as it contributed a lot to the paper.

PureCLIP already filters for significant peaks and does not give any enrichments (no fold changes or p-values). Thus all peaks of SLBP that are mentioned in table 1 are significant peaks. So it was not possible to define sensitivity/specificity or other features for table 1.

We made the effort and applied the peak shape clustering on the CLIPper peaks for replicate 2 (all peaks). Therefore, we add the new Supplements 3 in the paper and a whole new section called "Optimizing StoatyDive with the data of SLBP", where we go deeper about StoatyDive based on different clusters, CV cutoffs and peak sizes.

As a short summary, we show that the best CV cutoff is at 0.2 and that the cluster has to be chosen carefully in order to reduce some noise. We also show, that we can remove some noise from the initial CLIPper peak set. Furthermore, we demonstrate that the best peak length for StoatyDive is around 70 nucleotides.

The section includes more findings, and we would invite you to read the new section.

In addition we included also more statistical analysis to clarify our findings. So we state for the different CV distributions between the replicates and the control of the SLBP data: "Although the CV distributions of the input control and replicate 1 of the SLBP data differed significantly (one-sided Wilcoxon test p-value = $0.03$), both contained a lot of regions with a CV close to zero (Figure 2, both with a mean CV of 0.47). In contrast, the CV distribution of replicate 2 was distinct (p-value $< 0.05$ to input control and replicate 1) since it had more peaks with a higher CV (mean CV of $1.41$) and thus more specific binding events (e.g., Figure 1a, $CV \approx 5.3$)."

We also calculated the ACC for the tool comparison: "StoatyDive classified most peaks correctly into the three peak shape groups (Figure 5a, ACC = 0.87)." and later "SIC-ChIP identified up to six different peak shapes (Figure 5b), whereas FunChIP found three (Figure 5c and d, ACC = 0.6)." \end{answer}

In contrast, two paragraphs are devoted to a fairly superficial motif analysis - it's not clear from the text which motifs the authors believe are real and which are false-positives, which makes it descriptive but unclear what the conclusion is (other than that they are 'different'). Although this does not appear to be discussed in the text, it seems that the GCUCUUU motif matches the known SLBP hairpin (e.g. Fig 2 from Tan et al, PMID 23329046), which would again suggest that the sharp and plateau classes are real whereas the broad class motifs are artifacts; this should be more clear.

\begin{answer} So far we do not believe any of the motifs are true or false. We initially started to analyse SLBP and tried to show that different sequence motifs can be found with different peak shapes. SLBP is still a good candidate to show, that different peak shapes are either artifacts or real binding sites. Yet, it is important that the user checks the biological correctness of those clusters also for other proteins. We made a statement in the paper to clarify this issue by noting:

"Yet, we cannot confirm the biological truth behind those motifs as it requires further experiments to verify them." \end{answer}

2) The authors use the ENCODE data, but then re-perform peak calling with PureCLIP to perform their analyses. This isn't necessarily a problem; however, in the context of the major point of the manuscript being discussion of peak shapes, I find it perplexing that this is done with no further discussion, as one of the main differences between PureCLIP and the CLIPper algorithm (which was used in the peak calls provided by ENCODE) is that CLIPper explicitly uses peak shape as a feature to perform spline-fitting as part of the cluster identification process (Lovci et al. PMID 24213538). I can thus see an argument for using a peak caller that does not incorporate this feature in order to enable downstream analyses comparing peak shape; but I do think there should be some ultimate discussion of whether the classification and clustering described here provides additional power over the spline-fitting used in the initial CLIPper calls.

\begin{answer} We showed and analyzed also the CLIPper peaks (peaks which are significant and not significant) of SLBP in the paper. We demonstrated that also CLIPper's results could benefit from StoatyDive. Furthermore, the intention behind StoatyDive is an analysis independent of the peak caller. Of course it can be used to remove potential artifacts of the data processing and peak calling, but it is also worth to investigate a set of significant peaks with StoatyDive. You can cluster those peaks in different groups, which might reflect different biological properties. That to say, peak shape clustering comes after the peak calling and is a refinement or selection step for a deeper analysis of the peak set. \end{answer}

Specific comments:

The authors conclude a hypothesis that "peak profiles shaped like plateaus were mainly PCR duplicates that were less informative." - this is a conclusion that is not supported by data, but is both confusing (the eCLIP datasets contain unique molecular identifiers and thus shouldn't contain significant PCR duplicates) and also is easy to test from the data used (those peaks should have higher mutation rates or PCR duplication rates among pre-PCR-duplicate-removed reads than other peaks).

\begin{answer} We removed this statement as we saw that those peaks are indeed not PCR duplicates and basically a distinctive shape of the data. \end{answer}

I'm confused by the sentence "We looked further and found sharper peaks located on stem loops targeted by SLBP such as RNU7-1 RNA ..." - do the authors believe these are real or artifact peaks? Either way there should be some discussion of why (and why they are specifically called out here)

\begin{answer} RNU7-1 is a potential target of SLBP. We have removed the other stem loop protein from the paper because we could not find any literature to verify it as a target of SLBP. The other stem loop was an assumption based on our own past findings.

The paper now states: "For example, we found a sharper peak located on RNU7-1 RNA (U7 small nuclear 1) that contains a stem loop that might be potentially targeted by SLBP [citation]. The peak got a CV of $3.9$ and was classified into the peak profile cluster 3 of replicate 2." \end{answer}

Many figures need editing to be more visually clear. In addition to fonts (Fig. 3 and 4 in particular have multiple panels that illegibly small at standard resolution), the figures would be significantly clearer if additional labels are provided to enhance flow through the analyses in the paper. For example - Table 1 refers to separation of the 7 classes in Fig. 3c/d into 4 shape 'groups'; it would be helpful to have those labeled in Fig 3 as well.

\begin{answer} We included your recommendations and changed the font as well as layouts for all figures. Therefore we invite you to check the new figures. We hope we could improve the readability and intuitive understanding. \end{answer}

I think it needs to be clearer exactly what the analysis being done in the last section (of other RBPs) - after multiple reads I'm still a bit confused how the earlier analysis (with unsupervised clustering and manual annotation of classes) connects with the last analysis (with a hard cutoff of 'sharp' vs 'broad' peaks). The text in this section (Fig 7) refers to 'broader' versus 'sharper' shaped peaks; however I'm not clear on what defines those terms (the methods text suggests that for standard analysis the authors set a CV cutoff of 0.5 to define 'specific' versus 'unspecific'; I'm assuming but not clear that it's that and not manual annotation of unsupervised clustering classes as was done for SLBP?)

\begin{answer} It is true, we have not done a manual classification into \textbf{broad} and \textbf{sharp} as we have done for the data of SLBP. SLBP was a detailed example and finer analysis, whereas the last part was done more automatically. We stayed very conservative and looked at the CV distribution of each protein to find a good cutoff for broad vs sharp. It is to note that the CV distribution depends on the size of the peaks and the protein being analyzed. The initial cutoff of 0.5, was done on initial observations. Now, we know that a very conservative cutoff is at 0.2. We state in the paper in section "Investigation of eCLIP Protein Profiles":

"All peaks were therefore extended or shrunk to a length of 77 nucleotides. This was based on the observation that the third quartile of all peaks from all proteins was 77 nucleotides long (see Supplementary Figure 4). In addition, StoatyDive achieved for the SLBP data a better TPR and MCC with a peak size of 70 and a CV threshold of 0.2 (see Supplementary Figure 3). The results of SLBP showed that it may be wise to combine the clustering and the CV threshold to assess the profile landscape of other proteins. We therefore defined a peak as sharp if it had a CV $> 0.2$ and it fell into a cluster that was generally sharper. A cluster was declared as sharp if the median CV of the cluster was bigger than the median CV of the whole peak set. All other peaks were classified as broad." \end{answer}

\section{Reviewer 2}

comment: I am surprised that the authors focused their analyses on the SLBP protein and not on the more well studied RBP.

\begin{answer} It is true that there is no information about a sequence or structure motif for SLBP, but it is well known (see reviewer 1), that SLBP binds distinctively histone mRNAs at the 3' UTR region. Thus, it was interesting to see if we can find differences in peak shapes and show that we can differentiate between specific and unspecific binding with StoatyDive. Furthermore, it was interesting to see if we can find other distinct binding shapes for SLBP that might reflect different biological mechanics since the protein has two functions in eukaryotic cells, serving as transport and translation factor. SLBP is one of the proteins that can be used to optimize CLIP-Seq data analysis and is therefore a good candidate for StoatyDive. \end{answer}

However, the authors did find interesting observations for the SLBP protein by analysing eCLIP samples in K562 cell lines, including low reproducibility between the replicates, which haven't been fully investigated in this manuscript. I think it would be important to address these observation in a more systematic manner: * Have you checked how much the Coefficient of Variation (CV) changes between other replicates in other eCLIP samples, including eCLIP HepG2 samples? It would also be important to include CV results for all other samples that you analysed in this study (RBM22, U2AF2, PTBP1…). Do other eCLIP samples also show similar differences between the replicates or is the SLBP protein one of the exceptions?

\begin{answer} We included the CV distributions of the other proteins for the K562 cell line in the supplementary of the paper (Supplementary Figure 1) and made a short statement in the section "Peak Profile Landscape Reveals Low Reproducibility of Binding Sites", where we state: "Checking the CV distributions of other CLIP-Seq datasets such as TAF15, TARDBP, and HNRNPA1, which we will analyze further in a later section, (Supplementary Figure 1), we saw that the CV distributions also had differences between the replicates (two-sided Wilcoxon test p-value $< 0.05$). However, this does not mean a low quality of the data and just highlights that it is important to do replicates in order to quantify biological and technical variance as noted in a previous CLIP study [citation]."

We think that this is already enough for a demonstration that the CV might vary between replicates. An even deeper analysis including the HepG2 cell line is beyond the scope of the paper. \end{answer}

Is this variation between replicates more specific for the eCLIP protocol or is it also common in other CLIP-seq methods such as iCLIP, irCLIP or PAR-CLIP?

\begin{answer} A variation between replicates can of course happen (see AC Jungkamp et al. (2011), In vivo and transcriptome-wide identification of RNA binding protein target sites, and X. Chen et al. (2015), Statistical issues in binding site identification through CLIP-seq). This can come from different sources: execution of the experiment, changes in lab staff, technical variances, biological variances and more. It is hard to figure out what the actual reason was, especially since we used data that we did not generated in our group. It is beyond the scope of this paper to investigate other CLIP protocols, since we wanted to check the power of StoatyDive. However, we would analyze this in the future, when we will update StoatyDive and give it even more power. \end{answer}

How many peaks identified in replicate one overlap with replicate two?

\begin{answer} Replicate one had 659 peaks and replicate two 935. The overlap is the mentioned 899 peaks. We made a robust peak detection (intersecting and merging the two peak sets) to define a robust set of peaks. So we use these 899 peaks to evaluate both replicate one and replicate two. Replicate two has perhaps more peaks, but some of those peaks can be merge into a bigger peak region that is covered by one peak of replicate one.

From those peaks, we have 236 exonic and 151 intronic peaks for replicate one, and 374 exonic and 289 intronic peaks for replicate two. So replicate two has for both 136 exonic and 136 intronic peaks more. So there is a difference in the number of peaks between replicate one and two but the difference between exonic and intronic is the same. \end{answer}

Is the library size similar between the replicates and have you done any more detailed quality controls between the replicates (number of uniquely mapped reads, sequencing errors etc.)? Maybe a poor quality or low coverage could explain the difference.

\begin{answer} We checked the raw data quality and there are no quality breeches, except the high duplication level and adapters you usual see in CLIP-data.

Furthermore, the supplementary data table by Van Nostrand et al. (2016, Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP)) state for the aligned files:

\textbf{Replicate 1}\ # Input read: 13,553,232\ # Uniquely mapped read: 3,558,228\ # Usable read: 2,664,051\ \% usable (out of uniquely mapped): 0.748701601\ \

\textbf{Replicate 2}\ # Input read: 15,493,437 \ # Uniquely mapped read: 3,354,237 \ # Usable read: 2,520,409 \ \% usable (out of uniquely mapped): 0.751410529

So the library sizes are almost the same. \end{answer}

Would it be possible that one replicate has more cytoplasmic binding sites than the other? One way of testing this would be to check how many peaks that you identify are in intronic regions vs exonic regions, for each replicate separately. Also, how does the CVs and shape profile of the peaks change between exonic and intronic peaks? Maybe this sample has a higher binding specificity in the nucleus than in cytoplasm or vice versa.

\begin{answer} Thank you very much for this interesting question and idea.

Indeed there was a big difference for exonic and intronic peaks. We dedicated a new chapter in the paper, where we state: "To investigate further differences between the two replicates, we split the peak set into peaks overlapping with exons and introns (see Figure~3). SLBP is a translation and transport factor, which is present in the cytoplasm as well as nucleus [citations]. Therefore, the replicates could have different binding events, where one replicate might have more events in cytoplasm and the other more in the nucleus. For replicate 1 we got $236$ exonic and $151$ intronic peaks for replicate 1, and $374$ exonic and $289$ intronic peaks for replicate 2. So replicate 2 had $136$ exonic and $136$ intronic peaks more. Notably, we can see a huge CV difference when comparing the intronic peaks of the two replicates, with a mean CV of $0.23$ for replicate 1 and $1.26$ for replicate 2 (one-sided Wilcoxon test P-value $< 0.05$). The exonic peaks on the other hand were more similar (mean CV = $0.48$ and $0.90$, respectively), but still the CV distributions were significantly different (one-sided Wilcoxon test P-value $< 0.05$). It is not clear why this difference appeared, but our observations stress out how important it is to perform an experiment with replicates." \end{answer}

comment: "It is important to note that peaks shaped like plateaus might be false positives. These peaks most likely corresponded to PCR duplicates that were not real binding sites. We have removed PCR duplicates during the pre-processing of the read library, but some duplicates might still be in the data. Sequencing errors in the unique molecular identifiers (UMI) are a common reason."
Are there any changes in shape profiles between intronic and exonic peaks? To filter out false positive peaks, you could use ENCODE RNA-seq data for the same cell line and analyse peaks and their shapes in expressed transcripts. How many False positives are present in non-expressed transcripts and do they show different motifs and different shapes of their peaks?

\begin{answer} We removed this statement in the paper as we observed that those peaks are not PCR duplicates and a distinct peak shape.

While your suggestion still pose an interesting question for future research, it goes beyond the scope of this paper and we would do this part with a newer version of StoatyDive. We want to expand StoatyDive with a comparison mode to allow two or more samples to be compared (cluster them together) in their profile shape landscape (e.g., control vs CLIP). This would give a better distinction of false positives. \end{answer}

comment: "We investigated also the sequence motifs of sharp and broad peak profiles of the protein RBFOX2 (eCLIP data from the study by Van Nostrand et al. [5]), because it has the conserved sequence motif TGCATG, which is enriched in the RBPFOX2's binding sites [13, 14, 15]. We have found the conserved motif only in sharp peaks (Table 3), but the broader profiles also had an enriched motif. It is worth investigating if RBFOX2 has some unspecific binding preferences with that specific motif. The results for RBFOX2 again demonstrated that different peak shape groups result in different sequence motifs."
A recent paper by Bridget E. Begg et al. identified secondary binding motifs from their RBFOX2 CLIP experiments (Begg et al. 2020). These motifs are specific for intronic and 3'UTR regions (Bridget E. Begg et al., Figure 3). Could the authors also identify these motifs by separating their peak shape profiles into intronic and 3'UTR region?

\begin{answer} Thank you very much for this questions, however we have removed this chapter from the paper as it was too short and disrupted the flow of the paper.

We did a short analysis for your question, but there where no interesting results or significant differences of the shapes between exonic, intronic and 3'UTR regions. \end{answer}

comment: "All peaks were therefore extended or shrank to a length of 77 nucleotides. This was based on the observation that the third quartile of all peaks (from all proteins) was 77 nucleotides long."
The 77 nucleotides window size for all RBPs could be too broad to identify sharp peaks for proteins with specific bindings. Anob M. Chakrabarti et al. has shown in their review (Chakrabarti et al. 2018), that the window size parameters can modify the sensitivity and specificity of the data and it needs to be adapted for each RBP. For example, PTBP1 can have multiple binding sites just next to each other, where the broad window size would detect all the protein binding sites as one large binding site instead of multiple sharp ones. Could the authors demonstrate and comment on at least one example of how the window size can affects peak finding specificity and their shape profiles?

\begin{answer} It is true that different window sizes affect the results. We did an additional analysis to show the effect of the peak length. We dedicated now a complete new chapter "Optimizing StoatyDive with the data of SLBP", where we go deeper about StoatyDive based on different clusters, CV cutoffs and peak sizes.

The section includes more findings, and we would invite you to read the new section.

A size of 77 nucleotides might be long, yet it is good to discriminate the profiles that you have mentioned. We state later in the paper now: "All peaks were therefore extended or shrunk to a length of 77 nucleotides. This was based on the observation that the third quartile of all peaks from all proteins was 77 nucleotides long (see Supplementary Figure 4). In addition, StoatyDive achieved for the SLBP data a better TPR and MCC with a peak size of 70 and a CV threshold of 0.2 (see Supplementary Figure 3). The results of SLBP showed that it may be wise to combine the clustering and the CV threshold to assess the profile landscape of other proteins. We therefore defined a peak as sharp if it had a CV $> 0.2$ and it fell into a cluster that was generally sharper. A cluster was declared as sharp if the median CV of the cluster was bigger than the median CV of the whole peak set. All other peaks were classified as broad." \end{answer}

comment: "So we investigated the number of peaks that fall into introns (90\% overlap) for the proteins that are involved in the splicing process. The proteins PTBP1 (85\%), RBM22 (62\%), TARDBP (79\%), and HNRNPM (87\%) had more than 50\% of peaks in introns, which detected the assumption of split peaks. However, the proteins U2AF1 (17\%), and U2AF2 (15\%) had more peaks in exon regions, where the possibility of split peaks might still occur. We found for U2AF1 only 27 peaks (0.72\%) and for U2AF2 5 peaks (0.40\%) that are potential split peaks, again defecting the assumption of a technical artefact in the peak set of splicing factors. Thus, the sharpness of the peaks of U2AF1 and U2AF2 was potentially not the result of the peak calling."
The authors examined RBPs that are involved in the splicing process, including U2AF1 and UAF2, which recognises 3' splice sites and recruits the spliceosome. The authors claim that the majority of the identified peaks from both proteins fall into the exonic region. This statement contrasts with previous studies including my preliminary analysis (see Figure ??? below). The figure shows a normalised coverage of cDNA-starts for the same eCLIP U2AF1 and U2AF2 K562 and HepG2 samples relative to the 3' splice site region, showing a dominant enrichment of the intronic region for both proteins which also agrees with the previous CLIP studies (Zarnack et al. 2013; Sutandy et al. 2018; Briese et al. 2019). The authors should examine their peak analyses in these results further to identify the reason for such a difference.

\begin{answer} Thank you for your suggestion. It was not our intention to make this claim. We just wanted to investigate a potential shape bias, because of a peak phenomenon that we called "split peaks". We state now in the paper: "We also checked whether our result that RBPs involved in splicing have sharper peaks can have technical reasons. In this case, a splicing related protein could have more peaks that are split over two exons, which are detected by the peakcaller as two separate but sharp peaks (split peak). So we investigated the number of peaks that fall into introns ($90\%$ overlap) for the proteins that are involved in the splicing process. We used bedtools set to a strict overlap (intersect -u -s -f 0.9) to investigate potential split peaks." and later "It is important to note that this does not mean that the aforementioned proteins bind generally more introns or exons."

We checked and assume that our constraint for bedtools with $90\%$ overlap is the reason for this different observation to your findings. When we decrease the overlap, more peaks overlap with introns. \end{answer}

\section{Reviewer 3 -- finished}

Figure 2 indicates that replicate 1 of SLBP CLIP-seq is indistinguishable from the negative control from the perspective of the coefficient of variation. However, this replicate is then used throughout the analyses of Figure 3 and Tables 1 and 2. If Replicate 1 did work experimentally, it would contradict the authors' proposal that evaluating the coefficient of variations across peak-profiles (Figure 2) is a valuable method of examining CLIP-seq quality control.

\begin{answer} Thank you for your question. We state in the paper: "Although the CV distributions of the input control and replicate 1 of the SLBP data differed significantly (one-sided Wilcoxon test p-value = 0.03), both contained a lot of regions with a CV close to zero (Figure 2, both with a mean CV of 0.47). In contrast, the CV distribution of replicate 2 was distinct (p-value $< 0.05$ to input control and replicate 1) since it had more peaks with a higher CV (mean CV of 1.41) and thus more specific binding events (e.g., Figure 1a, CV $\approx 5.3$). Yet, some potential binding sites were more unspecific with a CV closer to zero (e.g., Figure 1b, CV $\approx 0.006$)."

Even though the distribution between the input control and replicate one looks quite similar the p-value is $< 0.05$, thus replicate one is significantly different in the CV distribution in comparison to the input control.

Furthermore, the CV distribution is just one quality factor for a CLIP experiment. We state in the paper: "The CV is just one quality factor and we recommend to test other features as well, such as the read coverage correlation." \end{answer}

Figure 3 shows the distinct patterns present across two replicates of SLBP CLIP-seq. It is unclear how unique these patterns are to SLBP CLIP-seq or if they are an artifact of CLIP-seq assay. It would be important to show the average tag distribution of the control experiment at sample peaks (separated by distinct peak profile) in order to demonstrate that the discovered peak profiles exist only within sample-targeted CLIP-seq.

\begin{answer} We included the results of the input control in the paper and state: "We also checked the uniqueness of the shapes by analyzing the peaks based on the reads from the size-matched input control (see Supplementary Figure 2). StoatyDive had just identified four different clusters, encompassing mountain shaped peaks, as well as plateaus, and constant peaks. It was to be expected to find similar shapes in the control, because the biggest challenge for peak calling is the identification of enriched sites with different shapes between control and CLIP data [12, 13]. In a future version of StoatyDive, we will include a mode to check peak shapes between samples to see if we could improve peakcalling results with a peak shape comparison." \end{answer}

- The underlying RNA sequence of the peaks should also be investigated with regards to whether mappability (or a specific A/U/G/C content) is responsible for producing broad vs narrow peaks. Are broad peaks a function of less mappable reads producing unnatural peaks? It would important to demonstrate that StoatyDive is not just modeling the underlying sequence mappability at the region.

\begin{answer} We can immediately excluded this because we have used only significant peaks from CLIPper and PureCLIP. PureCLIP searches for significant peaks (combining both CLIP replicates) with an HMM model. CLIPper from the CLIPper pipeline is run on each replicate separately and both peaks sets are then tested for robustness with an irreducible discovery rate (IDR) analysis. That is to say, mappability and low quality of the peaks can be excluded since (a) only significant and robust peaks are used from the peakcalling, and (b) only high-quality (quality score $>$ 20) and unique mappable reads were used. So the comparison of broad and narrow peaks really is just done on the CLIP-Seq enrichment. \end{answer}

\section{Final remark:}

We thank all reviewers for their comments, critics, ideas and questions. It was very helpful and improved the paper a lot. You will find changes in the paper marked in red. Furthermore we added a lot of new supplements. Please take a look. Thank you very much for your time and effort.

Source

Content of review 2, reviewed on December 15, 2020

The resubmitted manuscript by Florian Heyl and Rolf Backofen shows a good improvement. The authors addressed all the reviewer's comments in detail and restructured the manuscript accordingly. There are still a lot of interesting observations that would need further investigation but as the authors already mentioned, it would be beyond the scope of this paper. Overall, the manuscript is well written, and the method would be an interesting tool for the CLIP community. There are just few minor comments/suggestions, which are listed below: 1. Adding a library complexity distribution (number of peaks or a number of uniquely mapped reads) from each sample into the final CV distribution figures (Figure 2, Supplements_1.) would be very helpful to interpret the results. 2. Same for the Figure 3, if the authors could add a number of exonic and intronic peaks from each sample or the ratio. 3. Page 6: "Furthermore, the proteins HNRNPM, U2AF2, and PTBP1 have all more than 2 RRMs (3, 3, and 4, respectively), which led to the assumption that an increasing number of RRMs results in more sharper peaks. Figure 6b shows more clearly that proteins with more than one RRM tend to have more sharper peaks." - The authors assumption could be misleading without supporting literature or further analyses since there are not enough RBPs in their analysis to be confident enough. It would be better if authors could discuss this as one of the possibilities in the discussion section. There are also specific RBPs which are known to interact with RBP complexes, which could also affect the shape of the binding profile.

I declare that I have no competing interests.

Authors' response to reviews: {Reviewer 1} I appreciate the effort of the authors to provide additional analyses and address previous concerns, and the sections with specific edits are greatly improved. However, while the authors have clarified in many locations, there remain others (especially in the added sections) where it remains very confusing exactly what dataset is being used for analyses or plotted in figures. I also have remaining concerns (particularly about the newly added sections), where the authors seem to assume by default that their CV metric is accurate and reliable and then use that to draw conclusions about the underlying data, but I think this assumption is something that needs to be shown by analysis in the paper and cannot be assumed.

I think the new section on replicates "Peak Profile Landscape Reveals Low Reproducibility of Binding Sites" is weak and detracts from the paper. Broadly, my major concern is that it's completely unclear whether the low reproducibility seen is a feature of the experimental eCLIP data or a technical property of their CV metric. Essentially, this section defines a new metric (the CV for each peak), sees low reproducibility in this metric across replicates, and then assumes that this must imply low reproducibility in data "However, we have to stress out that we are unsure if this is the result of the specificity of SLBP, the different quality of replicate 1 and 2, or a different source, for example, a bad immunoprecipitation." To me, this statement requires some sort of analysis (downsampling / replicate splitting?) to show what the reproducibility of their CV metric is to random sampling variability. It would also be helpful to show here real examples of peaks that the authors claim show low reproducibility across replicates, to visualize what exactly their analysis is claiming.

\begin{answer} Thank you very much for your suggestions. We have changed the term "low reproducibility". We rephrased the title to "Peak Profile Landscape Reveals Variability of Binding Sites". We also rephrased the chapter saying: "The distribution was an indicator for some variability in the binding events. This was also striking when we compared the distribution of replicate 1 and 2. However, we have to stress out that this does not mean a poor quality of replicate 1 and 2. For a downstream analysis, for example the prediction of sequence motifs, it is worth investigating why the CV distribution of replicate 1 was very different to replicate 2. A sequence motif prediction depends on the selected binding sites, thus StoatyDive's inspection allows the user to assess the binding sites using the CV distributions. This helps to appraise if SLBP has protein domains that bind to RNA in an unspecific manner. The user can also test if the unspecific peak Figure~1b might be a false positive of PureCLIP."

To clarify, we wanted to investigate the variability of the binding events that are common for both replicates. Therefore we merged all binding sites that we have obtained from replicate 1 and 2. So we have a set of 899 peaks. We looked how these 899 peaks behaved (peak profiles shapes) for replicate 1 and 2. We took also the same 899 peaks and analyzed the profiles shapes for the control data to see if the shapes are similar to the replicates (reference point).

The chapter was designed to investigate if differences in the peak profiles occur on a set of common binding sites for the replicates. This is crucial, because in the end, the user has to know if different shapes occur for different replicates and that this variability can be inspected with our tool, which ultimately leads to a better down-stream analysis. \end{answer}

In addition, the conclusion that their analysis indicates that the replicates are of differing quality is also a computationally testable question (for example, comparing the fold-enrichment at previously validated histone 3'UTR peaks), which would help to convince that this isn't simply reflecting low reproducibility of their metric itself. The authors do this somewhat in Sup Fig 1, but then just use that to conclude that many other RBPs are also variable across replicates; SLBP definitely seems worse than those other RBPs, but this difference should be quantified. If the authors simply mean 'there is high variability in the CV metric between replicates for some RBPs', this should be worded as such and avoid claims about reproducibility of the underlying data without further analysis.

\begin{answer} We have removed the word reproducibility as stated in our first answer. It is true, we wanted more to clarify that there is some variability in the CV between replicates. \end{answer}

I'm also unclear what the 'control' dataset used in Figure 2 (and some later analyses) is exactly - the author state that they ran PureCLIP "for each CLIP replicate separately, taking the input control into account", so what is the control peak set in Figure 2 (as I assume it can't have been PureCLIP run on the input with itself as a control)?

\begin{answer} Please refer to the first answer for the explanation. \end{answer}

The conclusions here are also vague and non-specific - "For a downstream analysis, for example the prediction of sequence motifs, it is worth to either exclude replicate 1 or to investigate, why the CV distribution of replicate 1 was very different to replicate 2. Thus given the inspection of StoatyDive, the user can now decide to investigate the binding sites of replicate 1 and compare them with the input control or replicate 2." - why was it worth investigating replicates separately? I'm unclear what the authors think was improved by performing this analysis - they show that the replicates have differences in CV distribution, with replicate 1 resembling the control (as above, I don't understand what this control peak set represents), but I don't see anything other than the CV analysis to indicate that Rep1 is of worse quality than Rep2 (in peak enrichment for histone RNAs, motifs, peak enrichment above input, etc).

\begin{answer} We rephrased the statement, which we explained in the first answer. Indeed it is better to point out that the user just observes a variability in the CV distribution between the replicates, which does not say anything about the reproducibility or that replicate 1 should be excluded from further investigations. \end{answer}

Similarly, the authors then split the peaks by exonic and intronic and claim "It is not clear why this difference appeared, but our observations stress out how important it is to perform an experiment with replicates." - how do they stress this? I don't see any benefit that the replicates have provided beyond the ability to say that they're different; other than Replicate 2 having more peaks, it's not described whether those additional peaks are real or false positives, and thus whether replicate 2 is better or worse.

\begin{answer} We have rephrased this sentence slightly: "It is not clear why this difference appeared, but StoatyDive showed a variability between the different peak sets."

It was not our intention to state that replicate 2 was better than replicate 1. We thought the sentence benefits the data since it had replicates. Our results showed that some variability in the binding events might occur and highlighted that it was very good that SLBP had two replicates. \end{answer}

I similarly think there needs to be improved clarity in the "Optimizing StoatyDive with the data of SLBP" section. I don't understand this sentence "This was achieved by the second cluster, which pointed out that the cluster had to be carefully chosen in order to remove some noise in the data (artifacts)." - does this mean that the authors removed some peaks in this set? Moreover, what are 'main cluster' and 'second cluster'? I don't see these defined anywhere.

\begin{answer} In the section "Optimizing StoatyDive with the data of SLBP" we wrote: "We investigated the cluster with the highest (Main Cluster) and second highest (Second Cluster) number of peaks in histones." So basically we just looked at each cluster and checked how many peaks in this cluster overlap with histones. Ideally the peaks that overlap with histones should be contained in the main cluster but StoatyDive is of course not flawless and perhaps the profile shape is different for some histone peaks. That is why we investigated also the cluster (second cluster) with the second highest number of peaks overlapping with histones.

Main and second clusters are therefore just different subsets of the entire peak set. \end{answer}

Similarly, "Furthermore, the same size achieved the highest TPR (0.61) and MCC (0.68) with a CV threshold of 0.2. However, the enrichment was not as good as with the clustering resulting in a mean LFC of 1.6 (median P-value = 0.108)." - I am completely lost as to what 'the same size' or 'the clustering' are referring to here in terms of datasets. This section needs to be edited to clearly state what exact set of peaks is being referred to at each point.

\begin{answer} We rewrote the paragraph saying: "\textcolor{red}{Furthermore, the same peak length of 70 nucleotides achieved the highest TPR ($0.61$) and MCC ($0.68$) when we filtered the peaks based on a CV threshold of $0.2$. However, the enrichment was not as good as with the clustering by StoatyDive, where we achieved a mean LFC of $3.4$ in the second cluster that was higher than the resulting mean LFC of $1.6$ (median P-value $= 0.108$) with the CV cutoff of $0.2$.}

The clustering refers to the aforementioned results, where we discuss the main and second cluster. StoatyDive has generally two modes, which you see in figure 7: evaluation (CV distribution) and classification (clustering). Both can be used to filter for peak profile shapes. The table assess the power of both modes for different peak length (sizes). \end{answer}

Figure 5 (and related text) - I don't understand what Fig. 5a and b are plotting, and perhaps more broadly what is being done here? The text seems to indicate that the authors used 10 peaks defined as broad, 10 as sharp, and 10 as plateau and considered what the other tools did, but then state that they ran StoatyDive on these (with 0.87 accuracy), so I'm confused where the initial definition of broad / sharp / plateau came from (my initial impression was that they were from the StoatyDive analysis above, but that doesn't make sense with it giving different classification here).

\begin{answer} We explain in the revised version of the paper: "We defined peaks as sharp (CV $>$ 1.0) and broad (CV $<$ 0.05) based on the calculated CV of StoatyDive. We selected peaks shaped like plateaus by inspecting them in a genome browser." We tried not to include a bias by our findings by StoatyDive. That is why, we did not take a pre-classified set by StoatyDive, just to show that StoatyDive classifies its own clusters correctly. However, we also needed a good way to decide what a sharp and broad peak is, thus we took the CV of StoatyDive and stayed very restrictive with the choice of a broad and sharp peak. However as seen in our results, if a broad peak resembles in its shape a sharp peak it can happen that StoatyDive makes some mistakes. \end{answer}

I'm then confused what is being displayed in Fig. 5a and b - how does Fig. 5a show a few peaks sorted incorrectly? What do the dimensions/axes mean? Similarly, for 5c and 5d, what are the colored lines versus grey lines?

\begin{answer} Figure 5a shows the dimensionally reduced space by umap of StoatyDive. Thus Dim1 and dim2 are the features found by umap, like PC1 and PC2 for principal component analysis. It is maybe not clearly visible, since all three clusters are clearly separated with 10 peaks in each clusters. But StoatyDive indeed has two peaks in cluster 1 which should belong to cluster 2 and 2 peaks in cluster 2 which should belong to cluster 1.

Figure 5b is a different tool called SIC-ChIP. They used in their analysis pre-defined features, such as half of the width of the peak at half of the maximum $w_{h/2}$. They defined for SIC-ChIP 5 features to classify the profiles. We showed in the paper only one plot with the highest cluster separation (variance), but the others can be found in the Supplementary 5.

Regarding the colored lines we state now in the caption of figure 5: "The coloured lines show the profiles of the related cluster, whereas the grey lines correspond to all other profiles." \end{answer}

"Figure 6b shows more clearly that proteins with more than one RRM tend to have more sharper peaks." - this needs some sort of significance test

\begin{answer} We included a significance test for this statement saying: "Figure 6b shows more clearly that proteins with more than one RRM tend to have sharper peaks (two-sided Wilcoxon test P-value $\approx 0.046$)." \end{answer}

Minor comments:

The text (particularly the new sections) would benefit from additional editing to improve text clarity - statements like "huge CV difference" should be avoided

\begin{answer} Thank you very much for your suggestion. We looked into the text and removed or changed fuzzy adjectives and adverbs. \end{answer}

\section{Reviewer 2}

Adding a library complexity distribution (number of peaks or a number of uniquely mapped reads) from each sample into the final CV distribution figures (Figure 2, Supplements1.) would be very helpful to interpret the results.

\begin{answer} Thank you very much, indeed this would benefit the paper. We included the number of uniquely mapped reads for each sample.

The library complexity of replicate two of the SLBP data had about $0.4$ times more uniquely mapped reads than replicate one. However, experiments such as TAF15 or RBM22 had almost a similar complexity, but different CV distributions. Furthermore, the number of uniquely mapped reads could not be linked to a higher or lower mean CV since experiments such as CPSF6 and HNRNPA1 (higher number of reads leads to a higher mean CV) show an opposite behavior than experiments such as TARDBP and U2AF1 (higher number of reads leads to a lower mean CV). Furthermore, StoatyDive also normalizes for the read coverage and investigates just the shape of the profile and the variance that is normalized by the mean coverage. \end{answer}

Same for the Figure 3, if the authors could add a number of exonic and intronic peaks from each sample or the ratio.

\begin{answer} We added the number of exonic and intronic peaks as well as the ratio in the caption of the picture.

We also randomly sub-sampled the exonic and intronic peaks for both replicates to 150 to test if the sample size has an influence on the difference. We got for the exonic peaks a mean CV of 0.53 for replicate 1 and 0.93 for replicate 2, and a p-value of the Wilcoxen test of < 0.05, which was quite similar to the results without sub-sampling (with mean CVs 0.48 and 0.90). The same was observable for the intronic peaks with a mean CV of 0.23 for replicate 1 and 1.11 for replicate 2 (p-value < 0.05). The results in the paper were 0.23 and 1.26 as a mean CV for the replicates. \end{answer}

Page 6: "Furthermore, the proteins HNRNPM, U2AF2, and PTBP1 have all more than 2 RRMs (3, 3, and 4, respectively), which led to the assumption that an increasing number of RRMs results in more sharper peaks. Figure 6b shows more clearly that proteins with more than one RRM tend to have more sharper peaks." - The authors assumption could be misleading without supporting literature or further analyses since there are not enough RBPs in their analysis to be confident enough. It would be better if authors could discuss this as one of the possibilities in the discussion section. There are also specific RBPs which are known to interact with RBP complexes, which could also affect the shape of the binding profile.

\begin{answer} Thank you very much for pointing out that there is an issue with the discussion. We now state in the paper: "Furthermore, the proteins HNRNPM, U2AF2, and PTBP1 have all more than 2 RRMs (3, 3, and 4, respectively), thus it might be possible that an increasing number of RRMs results in sharper peaks. Figure~\ref{fig:proteins}b shows more clearly that proteins with more than one RRM tend to have sharper peaks (two-sided Wilcoxon test P-value $\approx 0.046$). It is possible that RNA-proteins interactions become more specific with $> 1$ RRM and so the separation between more specific and unspecific binding sites is stricter. However, we had not taken any other RNA binding domains into account apart from RRMs and we would need to investigate more proteins to be confident about this trend."

We hope that this fits your suggestions. \end{answer}

\section{Final remark:}

We thank all reviewers for their comments, critics, ideas and questions. It was very helpful and improved the paper a lot. You will find changes in the paper marked in red. Furthermore we changed the supplements 1. Please take a look. Thank you very much for your time and effort.

Source

References

Florian, H., Rolf, B. StoatyDive: Evaluation and classification of peak profiles for sequencing data. GigaScience.

Pre-publication Review of

StoatyDive: Evaluation and classification of peak profiles for sequencing data

Reviewed On August 25, 2020 , and December 15, 2020

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on August 25, 2020

Source

Content of review 2, reviewed on December 15, 2020

Source

References