4.0 | Quality
2.0 | Significance
Content of review 1, reviewed on January 31, 2014

Quality Comment

Media reporting on this study has ranged from (mostly) non-critical repetition of the conclusions to (frequently) silly scatological humor to (thankfully) even a little skeptical investigation. The basic conclusion of the authors is that dogs line themselves up with the magnetic field of the Earth when defecating (and, sometimes, when urinating). As usual, there is a devil in the details. While the findings of this study are hardly earthshattering news, whether true or not, the questionable practices utilized to generate the conclusions most people are reading in the news are illustrative of how weak hypotheses can be made to seem stronger than they are.

The Study

The hypothesis in this case was that because there is evidence some animals can sense and orient to the Earth’s magnetic field (convincing evidence in some cases, less convincing in others), it might be the case that dogs can do so This would be cool in itself but also convenient for researchers interested in “magnetoreception” since dogs are a plentiful and convenient research subject.

The researchers recruited 37 private dog owners with a total of 70 dogs in Germany and the Czech Republic and had the owners determine the alignment of their dogs’ spine from tail to head during defecation and urination using a hand-held compass. The dogs were off leash and in locations where they were not obviously constrained in their movements by human structures or activities. Routes the dogs were walked on were supposed to be changed haphazardly by owners.

The Results

The authors hypothesized the dogs would orient preferentially in a North-South direction while urinating and defecating, suggesting they were detecting and responding to the orientation of the Earth’s magnetic field. This didn’t turn out to be true, and the dogs’ orientation was essentially random. This is where the authors decide to go fishing. As they put it,

"After sampling and the first analysis (which yielded negative or at least ambiguous results) had been completed, we decided to sort the data according to the geomagnetic conditions predominating during the respective sampling times….Second order analysis was performed on the data which yielded the higher significance in the first order analysis."

In other words, the authors started thinking of all sorts of different ways to group and organize the data for analysis, and then they threw out those ways that didn’t yield the statistically significant results they were looking for and kept the ones that did. After lots of re-arranging and multiple statistical tests on various arrangements, they finally found a significant result:

"The relative declination change proved to be the best predictor of alignment, i.e.,. sorting the data according to this parameter provided the most significant results. Analysis of pooled recordings as well as of mean vectors of recordings in dogs sampled during calm magnetic field conditions (relative change in declination = 0%; minimum of 5 observations per dog) revealed a highly significant axial preference for North–South alignment during defecation."

The technical term for this is data-dredging, and it’s kind of cheating. Testing numerous hypotheses only thought up after the initial hypothesis has turned out to be wrong and the data have already been explored virtually guarantees one will find something that can be spun into a validation of the original idea and justification for the current study and others in the future. This is the stock-un-trade of alternative medicine research, though it is by no means limited to that field, and it is one of the most important reasons bad ideas are able to survive in science.

The authors, to be fair, are quite up front about how they approached their analyses:

"…No one, not even the coordinators of the study, hypothesized that expression of alignment could have been affected by the geomagnetic situation, and particularly by such subtle changes of the magnetic declination. The idea leading to the discovery of the correlation emerged after sampling was closed and the first statistical analyses (with rather negative results, cf. Figure 1) had been performed.

The relative declination change proved to be the best predictor of alignment, i.e., sorting the data according to this parameter provided the most significant results."

Unfortunately, they don’t seem to realize that this makes most of their statistical analyses invalid, and rather seriously undermines their conclusions. The dividing up of the observations into tiny categories reduces the power of the statistical tests used, and failing to compensate for performing multiple unplanned comparisons renders the results unreliable.

The specific arrangement of data that was found to yield a non-random pattern of alignment during defecation was when only measurements taking during periods of stability in the magnetic field were considered. The Earth’s magnetic field fluctuates slightly much of the time, and the authors were only able to find a statistically significant pattern in the dogs’ position during periods when it was not fluctuating. As challenging as it is to explain why dogs’ would have evolved to orient themselves according to the magnetic field of the Earth while defecating, it is even harder to explain why they would only be able to do this during relatively brief and unpredictable periods when the field was not fluctuating.

There are obviously a large number of factors that one can guess might influence how dogs position themselves during elimination besides the magnetic field of the Earth. The authors acknowledge a few of these, but they don’t do a very convincing job account for many in their design, analysis, or discussion.

Despite the haphazard and unrecorded attempts of dog owners to vary their routes, there is no way to determine if the habitual walking patterns of either dogs or owners influences the direction the dogs faced during elimination. The authors address the possibility of the position of the sun influencing the dogs, but only with speculation.

"…generally, there are on average 1,450 sunshine hours per year at maximum in the Czech Republic and in Germany, on localities where measurements were done. Even if we would assume that these sunshine hours were evenly distributed over the daylight period and the year (as our observations were), there would only be a probability of 33% that the observation was made when the sun was visible. Hence, with high probability (67%) most walks during the daylight period were made when it was cloudy."

The sun position wasn’t actually recorded, just assumed to be irrelevant since it is usually cloudy there most of the time. The dogs’ orientation didn’t appear to vary significantly with time of day, but again given all the subdividing of data for analysis, this doesn’t reliably exclude the sun as a possible influence.

The authors also don’t do much to account for the possible influence of other sights, sounds, or most importantly smells on the position of the dogs. All of these variables seem at least as plausible as magnetic field orientation as influences on how dogs position themselves when defecating.

And finally, the question of what possible reason there might be for dogs to align themselves with the magnetic field of the Earth while eliminating isn’t convincingly addressed. The general suggestion is made that the ability to detect the Earth’s magnetic field might have some evolutionary benefits in terms of navigating through one’s territory, and that aligning with this filed when pooping might be a way of “calibrating” the system against visual or other landmarks. Of course, if the results of this study are accurate, they could only do this, or use the field for positioning at all, during the times when it is not fluctuating.


Despite the significant and rather obvious problems with the design and analysis of this study, it does touch on an interesting subject. Some animals do use magnetic field detection for orientation and navigation, and it would be interesting if dogs proved to have such an ability. At best, this study might be considered useful in generating some hypotheses for further testing, though the largely negative results don’t justify much optimism about the outcome of additional studies. Predictably, however, the authors spin their results on the most positive possible terms to claim a groundbreaking achievement.

"In this study, we provide the first clear and simply measurable evidence for influence of geomagnetic field variations on mammal behavior. Furthermore, it is the first demonstration of the effect of the shift of declination, which has to our knowledge never been investigated before."

The authors engage in some spectacular mental contortions when using the positive results wrung out of their data to suggest that they have not only discovered a revolutionary new phenomenon but explained the failure of past research to support the magnetic field detection abilities of animals.

"the findings that already small fluctuations in Earth’s magnetic field elicit a behavioral response and the fact that “normal” magnetic conditions under which dogs express their orientation behavior occur only in about 30% of all cases call for caution. When extrapolated upon other animals and other experiments and observations on animal magnetoreception, this might explain the non-replicability of many findings and high scatter in others."

Rather than acknowledge that the most likely explanation for the failure of their a priori hypothesis and most of the analyses they conducted, as well as the negative results of other studies, was that there is no underlying relationship to find, the authors choose to conclude everyone has simply been looking for the wrong thing, and that their creative data mining has finally stumbled across the right variable. Time (or replication, really) will tell, of course, but it strikes me as a bit of stretch.

Since it is theoretically impossible to definitively prove a negative, the positive findings of poorly controlled research and data dredging unfortunately make it possible to argue for more research on almost any topic, regardless of how implausible the underlying theory or how consistently negative the results (yes, Homeopathy, I’m looking at you).

This study, while not dealing with as serious a subject as a medical treatment, exemplifies some of the ways in which research can be structured and analyzed, to eliminate any chance of actually falsifying a hypothesis and to justify continuing research even in the face of repeated and consistent negative findings. Though I would be pleasantly surprised if the findings turn out to be correct, it wouldn’t alter the fact that the approach represents some common and significant problems in the process of finding the truth about nature through science.


    © 2014 the Reviewer (CC BY-SA 3.0).

Comments   (Guidelines)

Hynek Burda

9:57 a.m., 14 May 14 (UTC) | Link

The summary of the reactions of the media on our paper is very fitting and we agree. The critic of our study is, however, biased and indicates that the author did not read the paper carefully, misinterpreted it in some cases, and, in any case is so "blinded" by statistics that he forgets biology. Statistics is just a helpful mean to prove or disprove observed phenomena. The problem is that statistics can "prove" phenomena and relations which actually do not exist, but it can also "disprove" phenomena which objectively exist. So, not only approaches which ignore proper statistics might be wrong but also uncritical sticking on statistical purity and ignoring real life.

The author of this critic blames us of "data mining". Well, first we should realize that there is nothing wrong about data mining. This is an approach normally used in current biology and a source of many interesting and important findings. We would like to point out that we have not "played" with statistics in order to find out eventually some "positive" results. And we have definitively not sorted data out. We just tested several hypotheses and always when we rejected one, we returned all the cards (i.e. data) into the game and tested, independently, anew, another hypothesis. Note also that we performed this search for the best explanation in a single data sample of one dog only, the borzoi Diadem, for which we had most data. When we had found a clue, we tested this final hypothesis in other dogs, now without Diadem.

Let us illustrate our above arguments about statistics and "real life" on two examples. Most medical diagnoses are done through exclusion or verification of different hypotheses in subsequent steps. Does it mean that when the physician eventually finds that a patient suffers under certain illness, the diagnosis must be considered improbable because the physician has already before tested (and rejected) several other hypotheses?

Or imagine that we want to test the hypothesis that the healthy human can run one kilometer with an average speed of 3 m/s. We find volunteers all over the country who should organize races and measure the speed. We shall get a huge sample of data, we have an impression that our hypothesis is correct but the large scatter makes the result insignificant. So we try to find out what could be the factors influencing speed. We test the age - and find out that indeed older people are slower than younger ones, so we divide the sample into age categories, but the scatter is still too high, so we test the effect of sex, we find a slight influence, but it still cannot explain the scatter, we test the position of the sun and time of the day, but find no effect, we test the effect of wind, but the wind was weak or it was windless during races, so we find no effect. We are desperate and we visit the places where the races took place - and we find the clue: some races were done downhill (and people ran much faster), some uphill (and people ran much slower), those who ran in flat land ran on average with the speed we expected. So we can now conclude that our hypothesis was correct and moreover we found an effect of the slope on running speed. We publish a paper describing these findings and then you publish a critic arguing that our approach was just data mining and was wrong and hence our observation is worthless and that the slope has no effect on running speed at all. Absurd!

Hynek Burda and coauthors

Brennen McKenzie

7:07 p.m., 16 May 14 (UTC) | Link

I am sorry that the authors appear to be annoyed by my critique or feel that I have misunderstood or misrepresented their work. I appreciate this response, and it answers some of the concerns expressed in the original review.

I believe the author and I agree that statistics are easily and commonly misused in science. Unfortunately, this response seems to perpetuate some of the misconceptions about the role of statistics in testing hypotheses I discussed in my original critique.

Statistics never prove or disprove anything. Schema such as Hill’s Criteria of Causation and other mechanisms for evaluating the evidence for relationships observed in research studies illustrate the fact that establishing the reality of hypothesized phenomena in nature is a complex business that must rest on a comprehensive evaluation of many different kinds of evidence. It is unfortunate that p-values have become the sine qua non of validating explanations of natural phenomena, at least in medicine (which is the domain I am most familiar with). The work of John Ionnidis and the growing interest in Bayesian statistical methods are examples of the move in medical research to address the problem of improper use and reliance on frequentist statistical methods.

That said, these methods do have an important role in data analysis, and they contribute significantly to our ability to control for chance and other sources of error in research. The proper role of statistical hypothesis testing is to help assess the likelihood that our findings might be due to chance or confounding variables, which humans are notoriously terrible at recognizing. If we employ these tools improperly, then they cease to fulfill this function and instead they generate a false impression of truth or reliability for results that may easily be artifacts of chance or bias.

The authors accuse me of being “so ‘blinded’ by statistics that he forgets biology.” This is ironic since their paper uses statistics to “prove” something which a broader consideration of biology, evolution, and other information would suggest is improbable. Even if the statistical methods were perfectly and properly applied, they would not be “proof” of anything any more than improper use of statistics would be definitive “disproof” or the authors’ hypothesis. While I discussed some concerns about how statistics were used in the paper, my objections were broader than that, which the authors do not appear to acknowledge.

In terms of data mining, though I am not a statistician, I believe there is a consensus that while exploratory analysis of data is, of course, appropriate and necessary, the post-hoc application of statistical significance tests to data after patterns in the data have already been observed is incorrect and misleading. This is what the paper appeared to suggest was done, and this would fit the definition of inappropriate data-dredging.

The details of the data exploration process were not described in the original paper. If the exploratory analysis was done with one data set while the authors remained blind to the data set actually analyzed in the paper, then that would be an appropriate method of data analysis. The subsequent statistically significant results would not, of course, necessarily prove the hypothesis to be true, but they would at least reliably indicate the likelihood that they were due solely to chance effects.

This does not, however, entirely answer the concern that the study began without a defined hypothesis and examined a broad range of behaviors and magnetic variables in order to identify a pattern or relationship. As exploratory, descriptive work this is, of course, completely appropriate. But the authors then use statistical hypothesis testing to support very strong claims to have “proven” a hypothesis not even identified until after the data collection was completed. This seems a questionable way to employ frequentist statistical methods.

Brennen McKenzie

7:07 p.m., 16 May 14 (UTC) | Link

The analogy of a doctor seeking a diagnosis is inapplicable. The process of inductive reasoning a clinician engages in to seek a diagnosis in an individual patient is not truly analogous to the process of collecting data and then evaluating it statistically to assess the likelihood that patterns seen in the data are due to chance. Making multiple statistical comparisons, particularly after one has already sought for patterns in the data, invalidates the application of statistical hypothesis testing. The fact that in other contexts, and without the use of such statistical methods, people consider possible explanations and then accept or reject them based on their observations is irrelevant.

The second hypothetical example simply describes a process for considering and evaluating multiple variables in order to explain an observed outcome, which is not the objection raised to the original paper. If the only hypothesis in a study such as described here was that at least one human being could run this fast, then a single data point would be sufficient proof and statistics would be unnecessary. However, if one is trying to explain differences in the average speed of different groups of people based on the sorts of variables mentioned, the reliability of the conclusions and the appropriateness of the statistical methods used would depend on how the data was collected and analyzed. In any case, nothing about this has any direct relevance to whether or not the data collection and analysis in the original paper was appropriate or justified the authors’ conclusions.

As I said in the original critique, this study raises an interesting possibility; that dogs may adjust their behavior to features of the magnetic field of the earth. The study was clearly a broadly targeted exploration of behavior and various features of the magnetic environment: “we monitored spontaneous alignment in dogs during diverse activities (resting, feeding and excreting) and eventually focused on excreting (defecation and urination incl. marking) as this activity appeared to be most promising with regard to obtaining large sets of data independent of time and space, and at the same time it seems to be least prone to be affected by the surroundings.” It did not apparently start with a specific, clearly defined hypothesis and prediction, so in this sense it seems an interesting exploratory project.

However, with such a broad focus, with mostly post-hoc hypothesis generation, and with a lack of clear controls for a number of possible alternative explanations, the study cannot be viewed as definitive “proof” of the validity of the explanation the authors provide for their observations, though this is what is claimed in the paper: “…for the first time that (a) magnetic sensitivity was proved in dogs, (b) a measurable, predictable behavioral reaction upon natural MF fluctuations could be unambiguously proven in a mammal, and (c) high sensitivity to small changes in polarity, rather than in intensity, of MF was identified as biologically meaningful.”

I agree with the authors that their results are interesting and should be a stimulus for further research, but I do not agree that the results provide the unambiguous proof they claim. As always, replication and research focused on testing specific predictions based on the hypothesis put forward in this report, with efforts to account for alternative explanations of these observations, will be needed to determine whether the authors’ confidence in their findings is justified.

August Pamplona

9:42 p.m., 28 May 19 (UTC) | Link

I would also like to address the plausibility of the effect going away due to "high sensitivity to small changes in polarity, rather than in intensity". I think that it strains credibility to think that this may be plausible.

Their claim, essentially, is that they were not able to find magnetosensitivity in dogs by looking at what direction they faced because the effect can only be seen during "calm" magnetic conditions. This condition of "calm" was determined by measuring the rate of change in magnetic declination (if you watched the direction of a compass needle for movement, how fast would it be moving?). As they put it, a dog's magnetosensitivity may be disturbed because of "high sensitivity to small changes in polarity". This measurement is misleadingly presented as a percent value (percent of what and over what time period). Intuitively, as a person's everyday experience is with degrees, a mental shortcut might be to consider that this percentage value should a value consistent with that sort of measure (again, over what time period?).

They created (apparently, well after the fact) three bins (why not 2, why not 5?) to slice up the data in search of hidden correlations: 0.0%>=x>0.1%, which they are calling "0%"; 0.1%>=x>2.0%, which they call "0.1-2%"; and 2.0>=x>∞, which they call ">2%". They called magnetic conditions calm if relative changes of magnetic field declination fell in the first bin rather than in the other two (however, in Table 6, for male Borzoi, M07, under separate analysis, their binning is different and the first bin is 0.0%>=x>1.7% —this is justified as being the values producing roughly equal sized bins but binning width is not justified elsewhere).

Let's look at what this means. For a datum to be placed in the first bin (calm magnetic field), relative change of magnetic field declination has to be less than 0.1%. What is this value, anyway, and why is it unitless?

The answer to the second question is that it shouldn't be unitless, it should be some sort of angular measure (degrees, gradians or radians) over some unit of time.

As for the answer to first question, it is defined in the caption of Figure 4 with the help of an example where they note declination changing from from 142' to 132' in the time spanned between 9:00 a.m. and 1:00 p.m.. Since that is 10 arc minutes (142'-132') over 240 minutes, they are calling that 0.042 or 4.2% which classifies this time period as magnetically not calm since it places it firmly within the third data bin rather than the first bin. A correct way of expressing this would be to say the value is 0.042 arc minutes per minute but apparently they thought that the units could be cancelled out because both the numerator and the denominator have the word "minute" in them. That means that if you watched the needle of a compass move during this period which they firmly consider to be "not calm" you would see it move 0.0007 arc minutes per second (divide a circle into ~30 million equal parts to get that angular distance). Note that you may get very different values depending on your measurement interval since this is actually a derived value which is, in fact, a derivative (that is, if, for example, you measure this rate of change over a minute rather than over four hours you will capture more detail and get a higher range of values).

Somehow we are expected to believe that, even though dogs are unaffected by small changes in magnetic flux density (this would hardly seem shocking in the example shown since this makes for a variation on the order of ~10 nanotesla over 4 hours —so ~7 picoteslas per second— in a base flux density value close to 50000 nanotesla ) but they are totally thrown off their game because the field infinitesimally shifts direction?

Please log in to leave a comment.


    Vlastimil, H., Petra, N., Malkemper, E. P., Sabine, B., Vladimir, H., Milos, J., Tomas, K., Veronika, N., Jana, A., Katerina, B., Jaroslav, C., Hynek, B. 2013. Dogs are sensitive to small variations of the Earth's magnetic field. Frontiers in Zoology.