Content of review 1, reviewed on September 09, 2014
This study aims to test the rather unusual hypothesis that the brains of two individuals separated geographically by almost 200 km can form a telepathic link that is measurable with EEG. While this is arguably an implausible hypothesis, it is certainly testable. Unfortunately, I believe this is not an adequate experiment to test such a hypothesis. There are major problems both with the experimental design and the statistical analysis. While the authors may be able to address some of my comments with additional analyses and control experiments, the sheer number of issues demands a complete overhaul of the entire study. I broke up my concerns into three sections:
1. Non-naive participants and predictability of the protocol:
- There were only 7 participants in this experiment. This means that in order to collect the 20 data sets (ignoring 2 excluded ones) every person participated in multiple recording sessions, both as “sender” and as "receiver" (except for one person (subject A) who acted only once as sender but thrice as receiver). Therefore even if participants were naive about the experimental protocol before their first recording session, in subsequent sessions they would be very familiar with both the arousing auditory stimulus (sound of a crying baby) and with the sequence of events (30s stimulus/signal periods interspersed by 60s periods of silence). In fact, the “senders” were even explicitly told about this sequence at the start of each session. Moreover, typically the roles of “sender” and “receiver” were reversed for a pair of participants right after their first recording. Thus the knowledge of the design would have been fresh in the second “receiver’s” mind.
- On this note, based on the subject initials in the data spreadsheets I wonder if the first author participated in this experiment? This is not explicitly stated so perhaps it is simply a coincidence that the initials match. All of the other subject initials also match those of other authors, however, so at the very least it should be acknowledged who (if any) among the participants were authors. Of course, it can be entirely acceptable to take part in your own study but this should always be reported in the methods and it very much depends on the experimental design. Certainly, for a study of this kind, where the predictability of experimental events is critical, I would be very concerned how an in-depth knowledge of the experimental protocol affects the results. It would certainly make the claim somewhat questionable that participants could not have known the randomisation of the protocols.
- Perhaps the biggest improvement over the pilot experiment, in which the sequence and duration of the stimulus protocol was always fixed and thus completelypredictable, is the fact that the overall duration of the protocol was randomised (from three options, i.e. protocols with 1, 2, or 3 stimulus periods) and that the duration of the initial silence period was apparently randomised between 1-3 minutes (but see point 1.4). Thus the ‘receiver’ should have been less able to predict the exact onset of the first stimulation period. However, after this initial onset the protocol was always fixed (i.e. 30s stimulus periods separated by 60s silence), so provided they could make a reasonable guess about the onset of the first stimulus period, the rest of the session would still have been rather predictable.
- Inspection of the traces of the stimulus protocols suggests that for the three different protocols the sequence of stimulation events was always perfectly aligned with the onsets of several other protocols. This is because the randomization of the initial silence period was not somewhere between 1-3 minutes as implied in the methods (actually this states “seconds”, but the corresponding author already acknowledged that this is a typo). Instead, as far as I can tell the initial silence period appears to have been either 1 min or 2 min. This of course means that the onset of the first stimulus period was either been at 1 min (12/20 sessions) or 2 min (8/20 sessions). As discussed in point 1.3, after this onset the sequence would have been fairly predictable to the participant. The duration of the initial silence was therefore the most unpredictable part of the experiment for the “receiver” – provided they had no other cues (see point 1.5) or had prior knowledge of the randomisation (see points 1.2 and 1.6).
- Is it conceivable that there were any cues for the “receiver” whether this session had 1 min or 2min of initial silence? It is unclear from the description of the methods whether the picture of the “sender” appeared in the goggles from the beginning of the recording session, including the initial silence period (but I assume this was the case). What could the participants hear and feel from the experimental room, noises made by the attending research assistants, etc?
- Assuming that the timestamps on the spreadsheets indicate the timing of recording, it can be seen that for two thirds of the pairs, the duration of the initial silence period in the second recording was the opposite of that in the first recording. Since these pairs were just reversing the role of “sender” and “receiver”, the “receiver” could then be predisposed to expect a shorter/longer initial silence period in the second recording compared to the first. So even if “receivers” always assumed that the onset of the stimulus was the opposite of the first session they would have been correct more than half of the time (see also my discussion of incorrect statistical assumptions in point 3). Moreover, blocks of sessions typically had one participant in common (sessions 1-4 subject F, sessions 5-10 subject D, sessions 11-13 subject A, 14-30 subject PT). It thus seems likely that “receivers” were implicitly aware of the randomisation of the onsets.
- Regarding the predictability of the sequence, is it possible that there were any time cues helping the “receiver” keep the timing and thus predict the sequence of stimulus events? While the participants were wearing headphones and goggles, could they have heard the ticking of a clock or a dripping tap or other regular noises (perhaps from the experimental equipment)? Could there have been signals in other sensory modalities (floor vibrations, air flow in the room)? Such cues need not even be external for the participants could have kept time, e.g. by using their respiration. In particular considering that all participants had experience with meditation, yoga, or similar practices this does not seem unrealistic.
- As discussed, the only aspect of the experiment that was comparably unpredictable (except for the potential caveats discussed in the previous points) was the onset of the initial silence period. Subsequently, the sequencing of stimulus and silence was fixed and it was actually fairly unimportant whether the overall protocol duration was short (1 stimulus), medium (2 stimuli), or long (3 stimuli) because participants would only have to maintain the fixed rhythm of 30s stimulus followed by 60s silence. The fact that decoding becomes progressively worse for segments later in the protocol (as shown by Tables 1a-c) may thus be a result of the “receiver’s” inability to maintain the rhythm as the session progressed. This is in part supported by some of the traces in which the classifier detected stimulus periods that considerably exceeded 30s in the latter half of the session (in particular, session “tLrPT”). The deterioration of decoding accuracy could also be due to the uncertainty as to whether there would be more stimulus periods or not because the participant could not be sure that they were in a short, medium, or long session.
In summary, there were multiple problems with familiarity of participants with the experimental paradigm and the predictability with the rhythm of stimulus and silence periods. To address this, the experiment should have been made much more unpredictable with properly randomized onsets and jittered durations for all the silence events.
2.The nature of the decoded signal:
- The participants’ familiarity with the crying baby stimulus also raises further questions. First, it somehow undermines the whole idea of transmitting information between two brains. Except for the first session all participants already knew that the stimulus was a crying baby. This makes transmitting that information redundant. More importantly, it also means that participants could have been imagining (or at least thinking about) the crying baby sound at regular intervals prescribed by the experiment. In that case, the classifier algorithm would have simply decoded the thoughts/imagery or the mental effort of the "receiver" to receive the crying baby sound. Combined with the issues with predictability of the sequence discussed in point 1, this would make the results hardly surprising. One way to address this would be to have two very distinct signal events and training the classifier to distinguish those in addition to the silence period (e.g. a crying baby vs a calming surf). I can however understand that the authors focused only on binary events (stimulus or silence) but in that case at the very least they would have to address the concerns with the predictability of the stimulus sequence discussed in point 1.
- A related problem with the decoding analysis is that there is no way of knowing whether the decoded signal has anything to do with the “sender’s” experience of a crying baby. Was there any debriefing of participants? Did any of the “receivers” hear a crying baby during the recording session? Or perhaps that is expecting too much. Did they at the very least report the feeling of receiving any information from the sender? One way to control for this would have been to have sessions both with and without a “sender” (obviously randomised so that the “receiver” could not know) and to see if the classifier still identifies stimulus and silence periods at these regular intervals.
- The previous suggestion would also help to address another concern about the nature of the decoded signal. As the authors themselves (briefly) acknowledge in the discussion, the alpha and gamma frequency bands are markers of attentional engagement, arousal, or mind wandering. Thus the decoding might instead simply exploit the temporal evolution of the EEG signal over the course of the session related to these factors. While this does not entirely explain why the decoding is so high for the first stimulus period (but see discussion of this problem in point 1), it certainly would be an alternative explanation for why decoding becomes progressively worse over the course of the session (see also point 3.5).
3.Incorrect statistical assumptions and questions about analysis:
- The statistical analysis used for testing whether decoding performance was above chance levels is incorrect, because the authors did not take into account that this is an unbalanced design. Therefore, contrary to the authors’ description in the methods the expected chance level is not 50%. Because stimulus periods made up less of the overall duration than silence periods, even if the classifier consistently (and incorrectly) assigned the silence label decoding accuracy would be greater than 70% (i.e. the proportion of silence within a session). The propensity of the classifier to choose one over another class label is also not necessarily 50%, especially in unbalanced designs. The use of a standard binomial test against 50% chance performance is therefore not correct. Instead the authors should have used a permutation test that estimates the true chance performance under these conditions.
- The “coincidences” measure used by the authors is also questionable. It is not immediately clear whether the definition of overlapping segments could have inflated the decoding results somehow. It seems odd that this measure was used at all considering that it should be straightforward to compare the traces directly. It also seems strange that there was a subjective disagreement between the two raters – the definition of coincidences sounds pretty simple.
- The methods do not provide nearly sufficient detail to understand how the decoding analysis was performed. The authors state that they used PCA to reduce the dimensionality of the EEG channels but they do not state what data were actually used for classification. Was it the band-pass filtered EEG signal trace within short time windows? Was it the frequency-power spectrum within each time window? How long were the time windows? Or was a sliding time window used?
- In this context it is also quite odd that the classifier performs so consistently. It certainly seems very odd that the classifier would hardly ever misclassify the initial silence period or that there are never any gaps within stimulus periods. Physiological data are typically very noisy. It is surprising to see such reliable classification even for the “sender”, let alone the “receiver”. Again, this is difficult to understand without a clear idea of how exactly the classification was performed.
- It is also unclear what the classifier was trained on. The authors state that a randomly selected “fifty percent of these data” were used for training. Did the authors use 50% of the data for training separately for each participant in the pair and then test the classifier on the remaining 50%? This would be incorrect because there are likely to be temporal correlations between adjacent data points (again, this would be a lot clearer if we knew what data were actually used). Or were these 50% from the “sender” and then used to classifier 100% of data from the “receiver”? It seems more defensible to assume that the “sender” and “receiver” are statistically independent (unless of course the hypothesis of a telepathic link is true). However, the temporal proximity of data points might still be a concern even in this case (see also points 2.3). Especially considering the fact that the authors used one of the most powerful linear SVM kernels (radial basis function) for classification, it is very unclear what attributes in the data the classifier exploited. One of the main problems with such multivariate decoding analyses is in fact that it is entirely opportunistic – the algorithm will find the most diagnostic information about the class labels in the data to produce the an accurate classification without any regard to whether this diagnostic information is actually meaningful to the hypothesis. So without any better understanding of what was done, it is quite plausible that the classification simply decoded how much time had passed since the start of the experiment. One way to reveal this would be to rerun the classification with different class labels that are orthogonal to the stimulation sequence. If the classifier exploited some attribute about the temporal evolution of the signal it should still perform well under those circumstances.
- The lack of methodological detail also makes it impossible to understand the description of the correlation analyses. It is stated that recordings were broken up into 4s time bins. How does this relate to the correlations that were calculated? In Figure 2 alpha power in different channels are plotted. This does however not indicate which of the periods (i.e. first, second or third stimulus, or average across them?) these power values came from (at least according to my count pair 15 should have had three stimulus periods). It also does not explain whether this is just the data from one of the 4s bins or an entire segment or if it was averaged across all segments?
- How were the correlations listed in Table 2 averaged across pairs? Was it taken into account that the same participants contributed to several of these correlations? Moreover, was any correction for multiple comparisons applied to the number of frequency bands?
- Several of the decoding traces for the “receiver” contain zeros. Two examples are in fact shown in Figure 1 in the top and middle rows. What does that mean? Was the recording simply stopped at that point? If so, why? The stimulus protocol with the sender was still running at that time?
- The authors state that the recording for the “receiver” was triggered manually by the research assistant after receiving the signal via the internet from the lab with the “sender”. Would this not introduce an uncontrolled lag in the recordings? Surely it should be technically feasible to automate this and trigger the recording simultaneously (or at least with a fixed lag due to the internet transmission)?
Source
Reviewed on October 23, 2014
Source
© 2014 the Reviewer (CC BY 3.0).
References
Tressoldi, P. E., Pederzoli, L., Bilucaglia, M., Caini, P., Fedele, P., Ferrini, A., Melloni, S., Richeldi, D., Richeldi, F., Accardo, A. 2014. Brain-to-Brain (mind-to-mind) interaction at distance: a confirmatory study. F1000Research, 3: 182.