September 1, 2015

Reproducing Study Results: Why It’s Hard, and Why It’s Important

An article in the August 27 edition of The New York Times reports that the results of scientific studies may not be as dependable – or at least as reproducible – as we might think.  

Hundreds of millions of people – doctors and patients around the world – use those studies to make important healthcare decisions. This blog regularly features recent study results.

In a new analysis called the Reproducibility Project, University of Virginia psychologist Brian Nosek recruited a team of 250 researchers four years ago. They identified 100 studies published in 2008 in three of psychology’s top journals (Psychological Science, the Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory, and Cognition).

Next, in close collaboration with those studies’ original authors, Nosek’s team undertook the daunting task of reproducing the results.

Almost 2/3 of All Study Results Didn't Hold Up
Their finding? Of the 100 studies, 35 held up, while 62 did not, according to a statistical measure of the likelihood that a result did not occur by chance. The remaining three studies were excluded because statistical significance wasn’t clear.

The team asked original study authors for guidance in replicating study design, methodology, and materials. In most cases, Nosek’s replications involved more subjects than the original studies, thus giving his own results more statistical heft.

The research team also measured whether the original research groups’ expertise or academic affiliations – their “prestige” – had any effect on the likelihood that results would hold up. They didn’t. Only one factor seemed to matter: the robustness of the original finding.

No Fraud or Falseness… Only Weakness
The Reproducibility Project team was quick to point out that they found no evidence of fraud or falsehood. But they made this clear conclusion: the evidence from most published findings wasn’t nearly as strong as originally claimed.  

The 100 studies covered very broad territory. Among those that didn’t “hold up” were these three:
  • People were more likely to cheat on a test if they had just read that their behavior was predestined.
  • At low-fertility phases of their menstrual cycles, partnered women favored attached men, but preferred single men (if masculine, i.e., advertising good genetic quality) when conception risk was high.
  • Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.
Is There a Problem with Tests in Psychology?
One issue has generated lots of questions about Nosek’s review: the studies in question all fall in the arena of psychology. Some critics suggest that studies in such a “soft” science would of course tend toward ambiguity, and would not – as a result – reproduce well. Analyzing human behavior, they say, is a much fuzzier business than actually seeing what happens in a test tube or under a microscope.

But others are not so quick to undermine the team’s findings. As the article in the Times reported:
Dr. John Ioannidis, a director of Stanford University’s Meta-Research Innovation Center, who once estimated that about half of published results across medicine were inflated or wrong, noted the proportion in psychology was even larger than he had thought. He said the problem could be even worse in other fields, including cell biology, economics, neuroscience, clinical medicine, and animal research.

So, What’s the Deal with Reproducibility?
Studies are described in many ways. Those deemed most reliable, most scientific, typically share these characteristics:
  • Large sample” – thousands of test subjects (preferably human – not insect or rodent --if results are supposed to have ramifications for people).
  • Double blind” – neither the study organizers nor the study subjects know who’s receiving what treatment or who’s in a control group. An uninvolved third party reveals the connections to study organizers only AFTER the data has been collected.
  • Longitudinal” –  studies conducted over a long period of time, perhaps decades.
  • Peer-reviewed” – evaluated by other uninvolved scientists.
  • No conflict of interest” – Researchers do not stand to benefit in any way by the results.

We just don’t hear about “reproducibility” very often. It’s too bad, because almost nothing validates a study more convincingly than its having been replicated – again and again – with the same results.

But there are reasons why we don’t hear about reproducibility more often:
  • Scientists want to conduct their own original, sexy, even groundbreaking research. They don’t want to spend their time and energy checking other scientists’ work.
  • Research journals have big incentives to publish new and fascinating material, and not to report on reproductions of earlier work.

Getting at the Truth
It’s a paradigm that critics think must change. Stefano Bertuzzi, the executive director of the American Society for Cell Biology, acknowledges the publication biases in his own area of medical expertise: “I call it cartoon biology, where there’s this pressure to publish cleaner, simpler results that don’t tell the entire story, in all its complexity.”

During an interesting interview on NPR’s Morning Edition with host David Greene, Brian Nosek commented on the difficult process of discovering what is really true:
Our best methodologies to try to figure out truth mostly reveal to us that figuring out truth is really hard. And we're going to get contradictions. One year, we're going to learn that coffee is good for us. The next year, we're going to learn that it's bad for us. The next year, we're going to learn we don't know.

1 comment:

John Margtin said...

Very nice summary, John. Add to this: Professional journals typically do not publish "negative results" (as in "we looked for a relationship between X and Y, and did not find one"). It's a world therefore where hundreds of studies are done and only a few are published. Given that most published research uses the "p<.05" cut-off for statistical significance (meaning you'd expect these results less than 5 times out of one hundred by pure chance) and all those unknown, unpublished studies, it's not hard to imagine that a *LOT* of published study results are among those 5 of 100 that are pure chance. When I was in grad school in the 1970s, by mentor/supervisor wouldn't sanction writing up a study for publication until we were able to replicate it ourselves. "If you toss the dice enough times, you're going to get double 6's 5 times in a row -- this doesn't mean there's anything special about the dice".