Review of Graph2GO: a multi-modal attributed network embedding method for inferring protein functions

Content of review 1, reviewed on January 21, 2020

This work describes a new method that uses attributed network representation that uses both interaction networks and protein features to predict protein function, when the function are represented as GO terms. This method of integrating heterogeneous information is, to my knowledge, novel and interesting. The methodology and results are mostly well described (but see comments below).

Major comments:

My main concern is the rather low F1 scores when compared with the apparent better performance of the AUPR scores, as shown in Table 1. (Incidentally, of showing F1 scores, why not show average precision and recall so that the reader can establish the contribution of the precision and recall to the F1 score?) The F1 scores are quite low (0.29 for CC and MF, and 0.07 for BP). I believe the authors could explain that, especially in light of the overall good performance in Table 2. Again, Figure 3 shows a low performance of F1 (0.25-0.35) of F1. How does that work with the F1 of 0.717 reported in Table 2?

I could not find definitions for m-AUPR and M-AUPR. It is my understanding that micro-averaged area under the precision-recall curve (m-AUPR) is computed by first vectorizing and then computing the AUPR; Macro-AUPR (M-AUPR) is computed by first computing the AUPR for each function separately, and then averaging these values across all functions. I would like to know how this was implemented here, and how the differences in performance are explained. While these performance metrics may be well-understood in the machine learning community, I think that the authors should elaborate a bit more on the choice of these metrics and their meaning, as Gigascience aims at a broader audience than those well-versed in machine learning.

In Table 2, the authors show that the overall performance of Graph2GO is good, when compared with a baseline of BLAST. It does seem however, that the BLAST performance is rather high. CAFA typically reports Fmax (not F1) values between 0.4 and 0.5. I am wondering how such high BLAST F1 score was achieved. One possible issue may be the random split of training and testing sets as reported in P 6 "Comparison with BLAST". A random split does not ensure that there are no homologs for the training set in the tet set, which may mean that the authors are testing on part of their training set and test set. The authors should ensure that any sequence in the test set is not too similar to the sequences in the training set.E.g. https://www.biorxiv.org/content/10.1101/626507v4.full

Minor comments:

Figure 3 is too small to be legible.

Table 3: seemingly low F1 performances, per organism.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews:

We thank the editors and reviewers for their insightful comments and suggestions and believe that after addressing each comment the manuscript is stronger. We have highlighted all our answers in red color.

Editor’s comments: Both reviewers point out several issues with the evaluation and comparison procedures, which may have introduced bias. Before making a decision on a revised manuscript we will seek further advice from our reviewers and I urge you to make an effort to fully address their comments.

Authors’ answer: We appreciate the feedback. We have worked carefully to solve the problem concerning evaluation and comparison procedures that might introduce bias. We have added a new evaluation metric F-max that is widely used by the community to replace our previous F1-score that is only based on top three predictions to make the evaluation more reliable. We include the detailed definition of these evaluation metrics so that general readers can understand. We also elaborate more on our DNN classification model and the models we compare with as suggested by reviewers. For experiments, we have added more details to clarify any confusions. Besides, we make changes to get rid of any variations that might impede the comparability of the results. We also followed other suggestions carefully and explained how we addressed them in the following point-to-point response.

Reviewer #1’s comments:
This work describes a new method that uses attributed network representation that uses both interaction networks and protein features to predict protein function, when the function are represented as GO terms. This method of integrating heterogeneous information is, to my knowledge, novel and interesting. The methodology and results are mostly well described (but see comments below).
Major comments: My main concern is the rather low F1 scores when compared with the apparent better performance of the AUPR scores, as shown in Table 1. (Incidentally, of showing F1 scores, why not show average precision and recall so that the reader can establish the contribution of the precision and recall to the F1 score?) The F1 scores are quite low (0.29 for CC and MF, and 0.07 for BP). I believe the authors could explain that, especially in light of the overall good performance in Table 2. Again, Figure 3 shows a low performance of F1 (0.25-0.35) of F1. How does that work with the F1 of 0.717 reported in Table 2?

Authors’ answer: Thanks for pointing out this problem. We should apologize for the inconsistence of the use of F1-score. In Table 1, Table 3 and Figure 3, the F1-score we used is only based on top three predictions, as used by Mashup and deepNF that we want to compare with. In Table 2, when comparing with BLAST, we used the common F1-score, since F1-score based on top three predictions cannot be calculated for BLAST. The F1-score based on top three predictions tend to be small across all models, compared to other metrics. In order to solve the inconsistence, we decide to change the evaluation metric to F-max, the maximum F1-score over all possible thresholds. This evaluation metric is also adopted by CAFA (Critical Assessment of protein Function Annotation algorithms). The detailed definition of this metric along with others are introduced in section “Evaluation metric” under “Results”. Looking at the results based on F-max, the performance is more consistent with other metrics, and our model is still better than other methods under this metric.

I could not find definitions for m-AUPR and M-AUPR. It is my understanding that micro-averaged area under the precision-recall curve (m-AUPR) is computed by first vectorizing and then computing the AUPR; Macro-AUPR (M-AUPR) is computed by first computing the AUPR for each function separately, and then averaging these values across all functions. I would like to know how this was implemented here, and how the differences in performance are explained. While these performance metrics may be well-understood in the machine learning community, I think that the authors should elaborate a bit more on the choice of these metrics and their meaning, as Gigascience aims at a broader audience than those well-versed in machine learning.

Authors’ answer: Thanks for pointing out this issue. We have added a new section “Evaluation metric” under “Results” to elaborate on the metrics we used. Since the protein function prediction is a multi-label task, we need to evaluate the performance on all labels (GO terms). One of the most popular metrics used in this setting is macro- or micro- area under receiver operating characteristic (ROC) curves. However, given the fact that the dataset is kind of skewed (more negative samples than positive samples), the precision-recall curve will give a more informative picture of an algorithm’s performance than ROC curves, as pointed by Davis and Goadrich (reference 42 in the manuscript). Therefore, we adopt micro-AUPR (m-AUPR) and macro-AUPR (M-AUPR) as two evaluation metrics. m-AUPR is computed by first vectorizing all labels (consider as one label) and then computing the AUPR, while M-AUPR is computed by first calculating AUPR for each label individually and then performing the unweighted mean of all AUPRs. Both of the metrics can properly represent the overall performance of a multi-label classifier.

In Table 2, the authors show that the overall performance of Graph2GO is good, when compared with a baseline of BLAST. It does seem however, that the BLAST performance is rather high. CAFA typically reports Fmax (not F1) values between 0.4 and 0.5. I am wondering how such high BLAST F1 score was achieved. One possible issue may be the random split of training and testing sets as reported in P 6 "Comparison with BLAST". A random split does not ensure that there are no homologs for the training set in the test set, which may mean that the authors are testing on part of their training set and test set. The authors should ensure that any sequence in the test set is not too similar to the sequences in the training set.E.g. https://urldefense.proofpoint.com/v2/url?u=https-3A__www.biorxiv.org_content_10.1101_626507v4.full&d=DwIGaQ&c=k9MF1d71ITtkuJx-PdWme51dKbmfPEvxwt8SFEkBfs4&r=ekJXXuaVVww1UTqb5FGVAqKm1D8j8S7Zredd9hAkY7k&m=0Feecqdx0RHOBGTTz1ex_w6u3ZYkDPiddrOHjdLNzoE&s=NMlQXdy4qrXzxCszFk_3ASxxc-3QmA6j1BxmSisYeLM&e=

Authors’ answer: Thanks for pointing out this question. We have added a new comparison situation where all training samples sharing more than 50% identity with one of the test samples are removed. Therefore, we ended up with two datasets: (1) full dataset that uses all randomly split training samples; (2) partial dataset that removes potential homologs. We can see that the results on the partial dataset are now similar to the results from CAFA. From the updated Table 2, we can see that Graph2GO is more robust than BLAST given the fact that Graph2GO shows smaller drop than BLAST from evaluation under full dataset condition to partial dataset condition. As a result, Graph2GO outperforms BLAST even more under the partial dataset condition across all ontologies. The detailed results and analysis are in section “Comparison with BLAST” and Table 2.

Minor comments:
Figure 3 is too small to be legible.

Authors’ answer: Thanks for pointing out the problem. We have updated Figure 3 for the updated evaluation metrics and made the figure clearer this time.

Table 3: seemingly low F1 performances, per organism.

Authors’ answer: The results of each organism is consistent with the results on human, even for the low F1 score. As we answered to the first point, the low F1 score is due to the definition of our previously used F1 (based only on top three predictions). We have changed this type of F1 score to F-max, as defined in the section “Evaluation metrics”. The updated results still seem consistent with the results on human.

Reviewer #2’s comments:
This paper presents an approach to predicting protein function that leverages heterogenous sources of knowledge about proteins, including relational information derived from protein interactions and sequence similarity relationships. The authors have clearly presented their computational architecture, which is interesting and as far as I know represents a novel contribution. However, I have some concerns. My main concerns with this manuscript are (a) a number of (apparent) variations in experimental settings that impeded comparability of results and might suggest some cherry-picking, and (b) the lack of comparison with the tasks and data sets used in the community evaluation CAFA (Critical Assessment of Function Annotation). Together, it means it is difficult to fully judge the contributions of the work.
Major point 1:
Few details of the DNN are provided; this should be elaborated.

Authors’ answer: Thanks for pointing out this problem. We have added more details about DNN in section “Deep neural network classifier” to make it clearer.

Is the CNN used in the 'Feature transfer' experiments the same as the DNN of the original experiments? If not, what are the differences, and why is the same algorithm not used for comparability? Is retraining needed in any case, can't the comparison of the sparsity vs non-sparsity group be done as an analysis over the results data? I don't feel that this is explained well.

Authors’ answer: Thanks for raising this question. The CNN model used in “Feature transfer” experiment is not the same the DNN we used in the second part of our Graph2GO model. We have added a detailed description of this model in the supplementary material named “Convolutional Neural Networks (CNNs)”.

The hypothesis we want to claim here is to show Graph2GO, a model making use of feature transfer within the network, is better at dealing with proteins without sufficient features due to the annotation bias. In other words, we want to show that the performance on the sparsity group is not a lot worse than that of on the non-sparsity group, when using Graph2GO. That’s why we use the ratio between the performance on sparsity group and non-sparsity group as a measure. We want to compare with a method without using the concept of feature transfer. Actually, we have many choices, for example, SVM, logistic regression, DNN or CNN. The reason why we chose CNN is that CNN performs better than DNN or SVM in this case since it is good at extracting features. But we can also choose other methods to compare with, no matter the method itself is good or not, since what we care about is the ratio of performance between the sparsity group and non-sparsity group.

The re-training is not needed for Graph2GO in this analysis, since we just analyzed the results data by dividing the test results into two groups and calculated the corresponding metrics.

In the comparison with deepNF and Mashup, the authors do not explain how the embeddings produced by these methods are used. The authors have explained that these methods produce embeddings that are analogous to the embeddings derived by the authors, but then do not use the same approach in the subsequent stages of the classification process; they use a multi-layer NN for Graph2Go and an SVM for the other two methods. For direct comparability, the overall architecture for learning would be kept the same, and only the input embeddings would be changed. How else can we know whether the key improvement comes from the embedding approach, or the classification model? The current comparison does not appear to be like-for-like.

Authors’ answer: Thanks for pointing out this issue. We previously touched a little bit on these two methods in the fourth paragraph in the “Introduction”. We have added a few more descriptions of these two methods.

We previously used SVM for these methods as the classification model because they used SVM in their original implementation. In order to make it more comparable, we have changed their classification model to DNN as well, with the same architecture of Graph2GO. We have updated Figure 3 accordingly with new comparison results, and we have shown that our method outperforms other two methods In CC and MF ontologies, and achieve comparable results in BP ontology.

In the comparison across species, the authors do not explain their experimental framework sufficiently. Do they control for overlap (very high sequence similarity) with the human training data they have used in developing their model? Do they 'hide'/hold back annotations [all, or just some?] for the proteins in these data sets? Furthermore, they claim "decent" performance (compared to what?) and yet the performance on these data sets is far lower than on the other test data they present. This does not at all 'prove' the robustness; it needs to be explained.

Authors’ answer: Thanks for pointing out this problem. In the comparison across species, we are actually performing experiments in each species individually. That’s to say, for each species, we download their data, and then performing cross-validation within this species, just like what we did for human dataset. We didn’t work in a transfer learning manner, training on human data and test on other species.

By conducting experiments in multiple species, we want to see whether the performance is consistent among all species to show the robustness of our method. We have also included the results on human dataset in the Table 3, and we can see that the performance of these six species are consistently decent, except that the results on s. cerevisiae and bacillus subtilis in BP ontology are a lot better than those of other species.

Major point 2:
There have been a number of studies over several years that explore methods for automatic assignment of protein function (CAFA4 has just started). The authors cite some of this work (Reference 1) but then go on to claim that homology-based methods are still the norm. That was true 10 years ago, but I am not convinced that it is still true today. Importantly, data sets have been prepared for CAFA, as well as many methods, and the authors clearly need to discuss how their work relates to that work. Indeed, I would suggest that without a direct comparison to the data sets used in CAFA, it is not possible for the authors to claim state-of-the-art performance. BLAST is still a meaningful baseline method to compare with, but it is certainly not representative of the state-of-the-art. This comparison may not be entirely feasible given that CAFA test sets are large and only a fraction of them are used for the final evaluation, but IMO it is important to explore this and at least attempt to come close to a similar experimental framework to test the authors' approach. At a minimum, the authors should be using the same metrics for evaluation as CAFA -- perhaps in addition to their own metrics -- to enable meaningful comparison, and of course discussing results on that task in relation to theirs. A comparison that only considers embedding methods is incomplete, in particular given that there are several methods used in CAFA that explicitly support integration of multiple sources of information, including relational information (see [1] for instance).

Authors’ answer: Thanks for the great suggestions. For the evaluation metrics, we use similar metrics with CAFA: F-max, macro-AUPR and micro-AUPR. We have added the section “Evaluation metrics” under “Results” to describe the metrics in detail. AUROC is used for term-centric evaluation in CAFA. Here we adopt AUPR instead, since as suggested by Davis and Goadrich (reference 42 in the manuscript), when the dataset is highly skewed, which is the case in this protein function prediction task, AUPR is a more informative metric than AUROC.

Mashup and deepNF are both state-of-the-art network-based embedding methods for predicting protein functions by integrating multiple network information. By comparing with these two methods, we have shown that Graph2GO is among the state-of-the-art network-based embedding methods for protein function prediction. We agree that it is ideal to compare with a larger pool of methods, including integrative methods not based on embedding in CAFA. However, we feel that given the scope of the study, we would like to focus on several methods that are recently published and well received in the community. We would like to continue the development and comparison in the lines of CAFA in the future, which may need our own implementations of CAFA methods that require devotion of more time. We think participating in a large-scale competition like CAFA in the future might be beneficial to our method comparison and development.

In summary, our method not only outperforms Mashup and deepNF in performance, but also exhibits some improvements in the method design. For example, Graph2GO can consider both heterogenous network structural information and protein attributes (including sequences, locations and protein domains) while other two methods only consider network features. Besides, the concept of feature transfer adopted by Graph2GO is advantageous for solving the annotation bias, which is a novel contribution.

On metrics, the use of the groupings of GO terms based on sparsity of annotations is not well justified; I do not understand what this represents, in particular given that the number of proteins annotated to a given GO term is not reflective of any biological reality -- it simply reflects areas of study and our current level of biological understanding. GO is organised hierarchically along semantic/ontological lines, and the authors do not appear to recognise this characteristic. Furthermore, there exists a grouping of GO terms called GO Slims which already seeks to reduce the number of classes that should be considered [2] (now called subsets:https://urldefense.proofpoint.com/v2/url?u=http-3A__geneontology.org_docs_go-2Dsubset-2Dguide_&d=DwIGaQ&c=k9MF1d71ITtkuJx-PdWme51dKbmfPEvxwt8SFEkBfs4&r=ekJXXuaVVww1UTqb5FGVAqKm1D8j8S7Zredd9hAkY7k&m=0Feecqdx0RHOBGTTz1ex_w6u3ZYkDPiddrOHjdLNzoE&s=J9pWUSCW_9XLhgcFjbPLOw2acB-nA9wZrNl5fqLnx44&e= ).

Authors’ answer: Thanks for raising this question. We group GO terms based on the sparsity of annotations to assess the impact of annotation bias on function prediction, and to show the performance of our method in more detail. Typically, the more data we have (meaning the annotations are less sparse), the better the function prediction model can perform. By dividing the GO terms in groups with different number of annotated proteins and analyzing the results separately, we can show how our model works when the data is sparse or when the data is abundant.

As for GO Slims, we found that they are grouped mainly based on different species. Since we already analyze the results for each species individually, we feel it is better to stay with the original GO groupings. Thanks for the suggestions.

In addition, why are only the top three predictions considered for evaluation purposes, in calculating F1-score? Having an absolute number, rather than based on confidence/probability etc. requires justification.

Authors’ answer: Thanks for pointing out this problem. We have changed this metric to F-max as suggested. The definition is provided in section “Evaluation metrics”. Previously we used F1-score based on top three predictions since it was used by Mashup and deepNF, and we adopted F1-score as well for the comparison. Now we use F-max to compare all these methods, which is reasonable and more widely used.

In the methods, the authors do not adequately explain their output representation -- is it framed as a multi-label classification problem, or as a multi-class problem (i.e. can a single input have multiple labels in the output)? Is it just a flat set of GO terms, so that the semantic interdependencies between the terms are not considered? (I later found this detail in the Results; it should come earlier.) It is actually a hierarchical classification problem [3]. This hierarchical nature also interacts with metrics and evaluation; please see [4] and [5]. The authors should consider/discuss their evaluation protocols in this context.

Authors’ answer: Thanks for raising this issue. Our task is a multi-label problem, we mentioned that in section “Deep neural network classifier” under “Methods”. Our output representation is a flat set of GO terms. In our current implementation, we didn’t consider the semantic interdependencies between the terms, which might cause inconsistency between the prediction of leaf GO terms and the corresponding parent GO terms. However, when we generate the dataset for model training and testing, we do assign the parent GO term to the protein when the child GO term is present. We think this will partially solve the issue of incomplete GO term annotations in the dataset. Besides, since we report a probability in range [0, 1] for each GO term, the users can decide the reliability of the GO term (no matter leaf or parent GO term) by the probability.

We have also added some discussion about how we can treat this as a hierarchical classification problem in the future under the “Discussion” section. In the future, one way to treat it as a hierarchical classification is to add the hierarchical constraints in the output layer to force the predictions of leaf nodes are consistent with the predictions of the corresponding parent nodes. Another potential method is to explicitly add the GO hierarchy into the model, along with our other network components to inform the training process, since we can regard the GO hierarchy as a network. We will leave this integration to the future work.

In the section of "Graph2Go pipeline" the authors mention two alternative architectures. Ideally, they would test both architectures (possibly moving the detail to Supplementary) and indicate which worked better.

Authors’ answer: Thanks for pointing out this problem. We have added the comparison in the supplementary material. From the comparison, we see that the model that trains independent VGAE for each network and combines their embeddings is better than the model that first combines the networks and train one VGAE to obtain the overall embeddings. This result supports our Graph2GO architecture.

On language; the writing is in general very good. There are a few colloquialisms that could be replaced: "plenty of" -> "substantial"; "really good" -> "very good" or "excellent". "noises" should just be "noise". "well suitable" should be "well suited". I found at least one typo "nueral".

Authors’ answer: Thanks for pointing these out. We have corrected the language problems.

Source

Content of review 2, reviewed on May 29, 2020

I am satisfied that my concerns were addressed. One minor comment, I would re-title "Evaluation metric" (p. 5) to "Evaluation metrics"

I declare that I have no competing interests.

Source

References

Kunjie, F., Yuanfang, G., Yan, Z. 2020. Graph2GO: a multi-modal attributed network embedding method for inferring protein functions. GigaScience.

Pre-publication Review of

Graph2GO: a multi-modal attributed network embedding method for inferring protein functions

Reviewed On January 21, 2020 , and May 29, 2020

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on January 21, 2020

Source

Content of review 2, reviewed on May 29, 2020

Source

References