Content of review 1, reviewed on January 06, 2021

Comments on abstract, title, references

This paper explores two common methods for speech emotion recognition, CNN and LSTM. The abstract states the aim of the paper (investigating pros and cons of two methods and observe the stability of results on cross-corpus evaluation), but lack the motivation why conducting the experiments. The goal should be clearer, cross-corpus evaluation, or finding which one is better (LSTM vs. CNN). There is a gap between motivation and research aim.

The title is clear and represents its contents. There is no more suitable title except the one used in the title.

The reference is excellent including the one which has similarity to this paper (17. Schmitt, M., Cummins, N., Schuller, B.W.: Continuous emotion recognition in speech - do we need recurrence? In: Proceedings of Interspeech, pp. 2808–2812 (2019))

Comments on introduction/background

The goal of the paper is very clear but lack of motivation and importance. For instance, cross-corpus evaluation is one of the big issues id SER which needs deep evaluation. However, the paper's introduction only highlights the extent of the study from the previous finding (ref. [17]) to other datasets. Moreover, the paper only evaluated multi-corpus, not cross-corpus. The motivation for using multi-corpus instead of cross-corpus should be clearly mentioned.

The motivation to evaluate LSTM vs. CNN is not clearly explained, except for the continuation of a previous study. The reasons to study this topic should be clear. A hypothesis should be provided, which one will be better: LSTM or CNN, and what is the argument for this hypothesis are.

As previously mentioned, there is no explanation about the importance of the work evaluating recurrence or convolution method for multi-corpus SER. Although the aim is clear, the necessary motivation and importance for conducting the research should be provided in the introduction.

Comments on methodology

The methodology used in the paper follows the previous research (e.g., ref. [17]). The dataset and feature sets seem easy to follow by other researchers interested in this paper. Although the technical details are not given, it will be better if the codes to reproduce the experiment are stored publicly in the public repository. If it is not an option, more technical details should be given, e.g., how to extract the 47th feature in eGeMAPS-47 denoting the presence of the speaker.

As mentioned in Section 4.1, the authors proposed to use RMSE as loss function. The reason to choose RMSE is to neutralize the effect of slow variance on the reference. This choice of RMSE makes sense but needs to be confirmed with other loss functions its effectiveness.

Comments on data and results

The data (tables and figures) presented in the paper are enough to report the experiment results. Table 4 is not necessary since it only contains one row. This table can be replaced by a paragraph. If this table is kept, additional features evaluations are needed to allow such a comparison.

Comments on discussion and conclusions

The discussions and conclusions are sufficient except in the CNN part. The conclusion that stated "CNNs perform better on arousal, valence, and liking but seem to be very sensitive to filter initialization" is not backed by the evidence. CNN only conducted once in the experiment. It achieves the best result on valence, arousal, and liking prediction. However, no evidence backs that CNN is very sensitive to filter initialization. The authors should focus on recurrent only on the current version. If CNN is included, such evaluations (e.g., investigating the effect of filter initialization) should be performed.

Source

    © 2021 the Reviewer.

References

    Macary, M., Lebourdais, M., Tahon, M., Estève, Y., Rousseau, A. 2020. Multi-corpus Experiment on Continuous Speech Emotion Recognition: Convolution or Recurrence?. Lecture Notes in Computer Science: 304.