The reproducibility crisis is in my opinion the biggest crisis facing medicine right now. Many of the studies that are integral in medicine have either never been replicated or when replication was attempted it failed. This is important because these studies are used to inform policy and treatment decisions.
Script (Remember I ad-lib and go off script, only complete version is the actual podcast):
For today’s episode I am going to talk about the biggest crisis in science right now, the replication crisis. In good science we start to accept the results not when they are published, but when we have seen them repeatedly replicated. The issue is scientists do not want to spend their time replicating someone else’s work, they would much rather be doing their own original work. This leads to an issue wherein many work has never been replicated, and when people attempt to replicate they are not finding the same results that were initially published. This crisis started in psychology but has started to show up in medicine and drug trials too.
Before we begin there are some things we need to discuss here before we get into the scale of the problem. First it can be difficult to replicate certain experiments. When you try to replicate an experiment you need to take care to follow the same methods as closely as possible, however, often the published methods do not give enough detail to replicate it carefully, so the best practice is to contact the original researcher and try to find out exactly how it was done. However, it can sometimes be difficult to get a hold of some researchers. The other problem we need to grapple with is what it means when we do fail to replicate a certain result. Does that mean the original study was erroneous? That the follow-up study was erroneous? Or is it actually possible that both are trending towards the same conclusion, but it is being obscured because of the nature of p-values and significance. (Which if you want to learn more about you should listen to episode 12: Beans, Cancer, Heart Disease, and P-hacking). It is possible that both studies could be showing a trend towards the same result, but one could be significant and the other not because of the binary nature of p-values. When that happens should we consider it a failure to replicate, or possibly a flaw in the power of the study. There is also the issue of what I call conceptual replication, which differs from true replication. True replication as we have established requires replicating the exact method as exactly as possible, and that is going to be the gold standard for replication. Conceptual replication is what I call studies that are looking at the same effect but there methods differ significantly. For example could be more subjects, longer study, etc… In these cases if the study supports the same concept should we consider it replicated? My general tendency in this case is to say yes, but it is important to realize that we have not verified that the original method produced those results. As you can see there are a whole bunch of features of replication that are feeding into this crisis. The question is jsut how bad is this problem.
I first heard about the replication crisis in regards to psychology so we are going to take a look at that first and try to get a feel for the scale of the problem. This was actually a surprisingly difficult problem for me to suss out, because there is some debate in the published literature about the scale of the effect. I first read about this crisis back in 2015 when there was a paper published in Science that attempted to replicate 100 psychology experiments that were published in high ranking psychology journals. What they observed was that only about ⅓ to 1/.2 of the studies were actually able to replicate. This suggests that up to ⅔ of the psychology studies in reputable journals may not replicate. This would obviously be a problem. However, there was a response article also published in Science in 2016 that claims that there are issues with this original study and they conclude that the flaws in that original study make it impossible to conclude the same effect. The response article in 2016 actually brings up some of the same issues I have already discussed today, for example effect size, wherein the p value may be non significant, but due to chance we cannot objectively conclude just because one is significant and one is non significant that there was a failure to replicate. They also point out that the original study did not take care to exactly recreate the methods of the original studies, and as such is not a true test as to whether or not those studies replicate. Reading through their criticism of the methods, many are quite egregious in my opinion. The most striking for me was this quote, “An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon.” Not only did they fail to match the populations, but they are also looking at a different effect. Furthermore, based on a different studies the authors here claim that allowing these kind of infidelities we would expect approximately ⅓ of studies to fail to replicate based on chance alone. This was based on the many lab studies, which allowed similar infidelities but focused on only 15 experiments, but each replicated across ~35 labs, allowing for a better idea for each of them. This study suggested ~85% of their studies replicated. These criticisms suggest to me that the original claim was at best exaggerated and at worst was a deliberate attempt to change the methods to allow for bias in favor of their results. Based on this analysis I would say that the replication crisis in psychology is actually much smaller than many people assume.
So now we are going to take a look at medicine which is now having its own little replication crisis. And by little I mean massive and terrifying to the degree where it is hard for me to continue to run this website, because so many studies are failing to replicate. I’m actually kinda disappointed in myself because I knew of the replication crisis in psychology before I knew about the one in medicine, despite the fact that there were papers published earlier about the one in medicine. The first study we are going to look at was actually published all the way back in 2008 and looked at studies that had been cited over 1,000 times. They found that of those 49 studies analyzed 16% had been contradicted by later research, 16% found significantly stronger effects than later research, 44% had been successfully replicated, and 24% had been unchallenged meaning replication had not been attempted. As we can see here there is already about a third wherein the original conclusions had to be significantly modified and another huge chunk that have yet to be replicated. Furthermore, we can see the importance of some of the good study design features I pointed out in episode 22 good studies and bad studies coming into play as randomized studies had significantly better results than non-randomized. And it’s also important to remember that these are the highly important, often cited studies, in reputable journals. The results are likely going to be worse for many studies. The FDA did an audit and they estimate that approximately 10-20% of studies have serious issues. There was a really interesting study and it’s open access so everyone can read it that was published on PLOS that looked at reproducibility of pre-clinical research and they estimated that 50% of studies in preclinical research are not reproducible with a cost of approximately 28 billion dollars per years. That is an absurd amount of money. All of these together seem to suggest that there is a serious reproducibility crisis in medicine.
I know this podcast has a chance of making more people distrust science, but that is not my goal. My goal is to make you more educated so that you are aware of the scale of this problem and you are keeping it in mind when you analyze studies and the like. Science is still our best tool to try to find accurate and useful information, but it does have flaws and we need to keep that in mind. If you enjoyed this podcast and learned something new, please share it with just one friend. I want more people to know this kind of thing.
Bibliography (I may not directly address these studies in the episode but I looked at them and thought they might be valuable):
Aarts AA, Anderson JE, Anderson CJ, Attridge PR, Attwood A, Axt JR, Babel M, Bahnik S, Baranski E, Barnett-Cowan M, et al. 2015. Estimating the reproducibility of psychological science. Science 349(6251):aac4716.
Begley CG. 2013. Reproducibility: Six red flags for suspect work. Nature 497(7450):433.
Begley CG and Ellis LM. 2012. Drug development: Raise standards for preclinical cancer research. Nature 483(7391):531.
Daniel T Gilbert, Gary King, Stephen Pettigrew, Timothy D Wilson. 2016. Comment on “estimating the reproducibility of psychological science”. Science 351(6277):1037.
Freedman LP, Cockburn IM, Simcoe TS. 2015. The economics of reproducibility in preclinical research. PLoS Biology 13(6):e1002165.
Glick JL. 1992. Scientific data audit—A key management tool. Accountability in Research 2(3):153-68.
Ioannidis JPA. 2005. Contradicted and initially stronger effects in highly cited clinical research. Jama 294(2):218-28.
Klein RA, Ratliff KA, Vianello M, Adams RBJ, Bahník S, Bernstein MJ, Bocian K, Brandt MJ, Brooks B, Brumbaugh CC, et al. 2014. Investigating variation in replicability. A “many labs” replication project. Social Psychology 45(3):152.
Find this podcast on: