One of the cornerstones of scientific research is the reproducibility of findings. Novel scientific observations need to be validated by subsequent studies in order to be considered robust. This has proven to be somewhat of a challenge for many biomedical research areas, including high impact studies in cancer research and stem cell research. The fact that an initial scientific finding of a research group cannot be confirmed by other researchers does not mean that the initial finding was wrong or that there was any foul play involved. The most likely explanation in biomedical research is that there is tremendous biological variability. Human subjects and patients examined in one research study may differ substantially from those in follow-up studies. Biological cell lines and tools used in basic science studies can vary widely, depending on so many details such as the medium in which cells are kept in a culture dish. The variability in findings is not a weakness of biomedical research, in fact it is a testimony to the complexity of biological systems. Therefore, initial findings need to always be treated with caution and presented with the inherent uncertainty. Once subsequent studies – often with larger sample sizes – confirm the initial observations, they are then viewed as being more robust and gradually become accepted by the wider scientific community.
Even though most scientists become aware of the scientific uncertainty associated with an initial observation as their career progresses, non-scientists may be puzzled by shifting scientific narratives. People often complain that “scientists cannot make up their minds” – citing examples of newspaper reports such as those which state drinking coffee may be harmful only to be subsequently contradicted by reports which laud the beneficial health effects of coffee drinking. Accurately communicating scientific findings as well as the inherent uncertainty of such initial findings is a hallmark of critical science journalism.
A group of researchers led by Dr. Estelle Dumas-Mallet at the University of Bordeaux recently studied the extent of uncertainty communicated to the public by newspapers when reporting initial medical research findings in their recently published paper “Scientific Uncertainty in the Press: How Newspapers Describe Initial Biomedical Findings“. Dumas-Mallet and her colleagues examined 426 English-language newspaper articles published between 1988 and 2009 which described 40 initial biomedical research studies. They focused on scientific studies in which a new risk factor such as smoking or old age had been newly associated with a disease such as schizophrenia, autism, Alzheimer’s disease or breast cancer (total of 12 diseases). The researchers only included scientific studies which had subsequently been re-evaluated by follow-up research studies and found that less than one third of the scientific studies had been confirmed by subsequent research. Dumas-Mallet and her colleagues were therefore interested in whether the newspaper articles, which were published shortly after the release of the initial research paper, adequately conveyed the uncertainty surrounding the initial findings and thus adequately preparing their readers for subsequent research that may confirm or invalidate the initial work.
The University of Bordeaux researchers specifically examined whether headlines of the newspaper articles were “hyped” or “factual”, whether they mentioned whether or not this was an initial study and clearly indicated they need for replication or validation by subsequent studies. Roughly 35% of the headlines were “hyped”. One example of a “hyped” headline was “Magic key to breast cancer fight” instead of using a more factual headline such as “Scientists pinpoint genes that raise your breast cancer risk“. Dumas-Mallet and her colleagues found that even though 57% of the newspaper articles mentioned that these medical research studies were initial findings, only 21% of newspaper articles included explicit “replication statements” such as “Tests on larger populations of adults must be performed” or “More work is needed to confirm the findings”.
The researchers next examined the key characteristics of the newspaper articles which were more likely to convey the uncertainty or preliminary nature of the initial scientific findings. Newspaper articles with “hyped” headlines were less likely to mention the need for replicating and validating the results in subsequent studies. On the other hand, newspaper articles which included a direct quote from one of the research study authors were three times more likely to include a replication statement. In fact, approximately half of all the replication statements mentioned in the newspaper articles were found in author quotes, suggesting that many scientists who conducted the research readily emphasize the preliminary nature of their work. Another interesting finding was the gradual shift over time in conveying scientific uncertainty. “Hyped” headlines were rare before 2000 (only 15%) and become more frequent during the 2000s (43%). On the other hand, replication statements were more common before 2000 (35%) than after 2000 (16%). This suggests that there was a trend towards conveying less uncertainty after 2000, which is surprising because debate about scientific replicability in the biomedical research community seems to have become much more widespread in the past decade.
As in all scientific studies, we need to be aware of the analysis performed by Dumas-Mallet and her colleagues. They focused on analyzing a very narrow area of biomedical research – newly identified risk factors for selected diseases. It remains to be seen whether other areas of biomedical research such as treatment of diseases or basic science discoveries of new molecular pathways are also reported with “hyped” headlines and without replication statements. In other words – this research on “replication statements” in newspaper articles also needs to be replicated. It is not clear that the worrisome trend of over-selling robustness of initial research findings after the year 2000 still persists since the work by Dumas-Mallet and colleagues stopped analyzing studies published after 2009. One would hope that the recent discussions about replicability issues in science among scientists would reverse this trend. Even though the findings of the University of Bordeaux researchers need to be replicated by others, science journalists and readers of newspapers can glean some important information from this study: One needs to be wary of “hyped” headlines and it can be very useful to interview authors of scientific studies when reporting about new research, especially asking them about the limitations of their work. “Hyped” newspaper headlines and an exaggerated sense of certainty in initial scientific findings may erode the long-term trust of the public in scientific research, especially if subsequent studies fail to replicate the initial results. Critical and comprehensive reporting of biomedical research studies – including their limitations and uncertainty – by science journalists is therefore a very important service to society which contributes to science literacy and science-based decision making.
Reproducibility of findings is a core foundation of science. If scientific results only hold true in some labs but not in others, then how can researchers feel confident about their discoveries? How can society put evidence-based policies into place if the evidence is unreliable?
Recognition of this “crisis” has prompted calls for reform. Researchers are feeling their way, experimenting with different practices meant to help distinguish solid science from irreproducible results. Some people are even starting to reevaluate how choices are made about what research actually gets tackled. Breaking innovative new ground is flashier than revisiting already published research. Does prioritizing novelty naturally lead to this point?
Incentivizing the wrong thing?
One solution to the reproducibility crisis could be simply to conduct lots of replication studies. For instance, the scientific journal eLife is participating in an initiative to validate and reproduce important recent findings in the field of cancer research. The first set of these “rerun” studies was recently released and yielded mixed results. The results of 2 out of 5 research studies were reproducible, one was not and two additional studies did not provide definitive answers.
But there’s at least one major obstacle to investing time and effort in this endeavor: the quest for novelty. The prestige of an academic journal depends at least partly on how often the research articles it publishes are cited. Thus, research journals often want to publish novel scientific findings which are more likely to be cited, not necessarily the results of newly rerun older research.
Genetics researcher Barak Cohen at Washington University in St. Louis recently published a commentary analyzing this growing push for novelty. He suggests that progress in science depends on a delicate balance between novelty and checking the work of other scientists. When rewards such as funding of grants or publication in prestigious journals emphasize novelty at the expense of testing previously published results, science risks developing cracks in its foundation.
One of his main concerns is that scientific papers now inflate their claims in order to emphasize their novelty and the relevance of biomedical research for clinical applications. By exchanging depth of research for breadth of claims, researchers may be at risk of compromising the robustness of the work. By claiming excessive novelty and impact, researchers may undermine its actual significance because they may fail to provide solid evidence for each claim.
Prestigious journals often now demand complete scientific stories, from basic molecular mechanisms to proving their relevance in various animal models. Unexplained results or unanswered questions are seen as weaknesses. Instead of publishing one exciting novel finding that is robust, and which could spawn a new direction of research conducted by other groups, researchers now spend years gathering a whole string of findings with broad claims about novelty and impact.
Balancing fresh findings and robustness
A challenge for editors and reviewers of scientific manuscripts is assessing the novelty and likely long-term impact of the work they’re assessing. The eventual importance of a new, unique scientific idea is sometimes difficult to recognize even by peers who are grounded in existing knowledge. Many basic research studies form the basis of future practical applications. One recent study found that of basic research articles that received at least one citation, 80 percent were eventually cited by a patent application. But it takes time for practical significance to come to light.
A collaborative team of economics researchers recently developed an unusual measure of scientific novelty by carefully studying the references of a paper. They ranked a scientific paper as more novel if it cited a diverse combination of journals. For example, a scientific article citing a botany journal, an economics journal and a physics journal would be considered very novel if no other article had cited this combination of varied references before.
This measure of novelty allowed them to identify papers which were more likely to be cited in the long run. But it took roughly four years for these novel papers to start showing their greater impact. One may disagree with this particular indicator of novelty, but the study makes an important point: It takes time to recognize the full impact of novel findings.
Realizing how difficult it is to assess novelty should give funding agencies, journal editors and scientists pause. Progress in science depends on new discoveries and following unexplored paths – but solid, reproducible research requires an equal emphasis on the robustness of the work. By restoring the balance between demands and rewards for novelty and robustness, science will achieve even greater progress.
Murder your darlings. The British writer Sir Arthur Quiller Crouch shared this piece of writerly wisdom when he gave his inaugural lecture series at Cambridge, asking writers to consider deleting words, phrases or even paragraphs that are especially dear to them. The minute writers fall in love with what they write, they are bound to lose their objectivity and may not be able to judge how their choice of words will be perceived by the reader. But writers aren’t the only ones who can fall prey to the Pygmalion syndrome. Scientists often find themselves in a similar situation when they develop “pet” or “darling” hypotheses.
How do scientists decide when it is time to murder their darling hypotheses? The simple answer is that scientists ought to give up scientific hypotheses once the experimental data is unable to support them, no matter how “darling” they are. However, the problem with scientific hypotheses is that they aren’t just generated based on subjective whims. A scientific hypothesis is usually put forward after analyzing substantial amounts of experimental data. The better a hypothesis is at explaining the existing data, the more “darling” it becomes. Therefore, scientists are reluctant to discard a hypothesis because of just one piece of experimental data that contradicts it.
In addition to experimental data, a number of additional factors can also play a major role in determining whether scientists will either discard or uphold their darling scientific hypotheses. Some scientific careers are built on specific scientific hypotheses which set apart certain scientists from competing rival groups. Research grants, which are essential to the survival of a scientific laboratory by providing salary funds for the senior researchers as well as the junior trainees and research staff, are written in a hypothesis-focused manner, outlining experiments that will lead to the acceptance or rejection of selected scientific hypotheses. Well written research grants always consider the possibility that the core hypothesis may be rejected based on the future experimental data. But if the hypothesis has to be rejected then the scientist has to explain the discrepancies between the preferred hypothesis that is now falling in disrepute and all the preliminary data that had led her to formulate the initial hypothesis. Such discrepancies could endanger the renewal of the grant funding and the future of the laboratory. Last but not least, it is very difficult to publish a scholarly paper describing a rejected scientific hypothesis without providing an in-depth mechanistic explanation for why the hypothesis was wrong and proposing alternate hypotheses.
For example, it is quite reasonable for a cell biologist to formulate the hypothesis that protein A improves the survival of neurons by activating pathway X based on prior scientific studies which have shown that protein A is an activator of pathway X in neurons and other studies which prove that pathway X improves cell survival in skin cells. If the data supports the hypothesis, publishing this result is fairly straightforward because it conforms to the general expectations. However, if the data does not support this hypothesis then the scientist has to explain why. Is it because protein A did not activate pathway X in her experiments? Is it because in pathway X functions differently in neurons than in skin cells? Is it because neurons and skin cells have a different threshold for survival? Experimental results that do not conform to the predictions have the potential to uncover exciting new scientific mechanisms but chasing down these alternate explanations requires a lot of time and resources which are becoming increasingly scarce. Therefore, it shouldn’t come as a surprise that some scientists may consciously or subconsciously ignore selected pieces of experimental data which contradict their darling hypotheses.
Let us move from these hypothetical situations to the real world of laboratories. There is surprisingly little data on how and when scientists reject hypotheses, but John Fugelsang and Kevin Dunbar at Dartmouth conducted a rather unique study “Theory and data interactions of the scientific mind: Evidence from the molecular and the cognitive laboratory” in 2004 in which they researched researchers. They sat in at scientific laboratory meetings of three renowned molecular biology laboratories at carefully recorded how scientists presented their laboratory data and how they would handle results which contradicted their predictions based on their hypotheses and models.
In their final analysis, Fugelsang and Dunbar included 417 scientific results that were presented at the meetings of which roughly half (223 out of 417) were not consistent with the predictions. Only 12% of these inconsistencies lead to change of the scientific model (and thus a revision of hypotheses). In the vast majority of the cases, the laboratories decided to follow up the studies by repeating and modifying the experimental protocols, thinking that the fault did not lie with the hypotheses but instead with the manner how the experiment was conducted. In the follow up experiments, 84 of the inconsistent findings could be replicated and this in turn resulted in a gradual modification of the underlying models and hypotheses in the majority of the cases. However, even when the inconsistent results were replicated, only 61% of the models were revised which means that 39% of the cases did not lead to any significant changes.
The study did not provide much information on the long-term fate of the hypotheses and models and we obviously cannot generalize the results of three molecular biology laboratory meetings at one university to the whole scientific enterprise. Also, Fugelsang and Dunbar’s study did not have a large enough sample size to clearly identify the reasons why some scientists were willing to revise their models and others weren’t. Was it because of varying complexity of experiments and models? Was it because of the approach of the individuals who conducted the experiments or the laboratory heads? I wish there were more studies like this because it would help us understand the scientific process better and maybe improve the quality of scientific research if we learned how different scientists handle inconsistent results.
In my own experience, I have also struggled with results which defied my scientific hypotheses. In 2002, we found that stem cells in human fat tissue could help grow new blood vessels. Yes, you could obtain fat from a liposuction performed by a plastic surgeon and inject these fat-derived stem cells into animal models of low blood flow in the legs. Within a week or two, the injected cells helped restore the blood flow to near normal levels! The simplest hypothesis was that the stem cells converted into endothelial cells, the cell type which forms the lining of blood vessels. However, after several months of experiments, I found no consistent evidence of fat-derived stem cells transforming into endothelial cells. We ended up publishing a paper which proposed an alternative explanation that the stem cells were releasing growth factors that helped grow blood vessels. But this explanation was not as satisfying as I had hoped. It did not account for the fact that the stem cells had aligned themselves alongside blood vessel structures and behaved like blood vessel cells.
Even though I “murdered” my darling hypothesis of fat –derived stem cells converting into blood vessel endothelial cells at the time, I did not “bury” the hypothesis. It kept ruminating in the back of my mind until roughly one decade later when we were again studying how stem cells were improving blood vessel growth. The difference was that this time, I had access to a live-imaging confocal laser microscope which allowed us to take images of cells labeled with red and green fluorescent dyes over long periods of time. Below, you can see a video of human bone marrow mesenchymal stem cells (labeled green) and human endothelial cells (labeled red) observed with the microscope overnight. The short movie compresses images obtained throughout the night and shows that the stem cells indeed do not convert into endothelial cells. Instead, they form a scaffold and guide the endothelial cells (red) by allowing them to move alongside the green scaffold and thus construct their network. This work was published in 2013 in the Journal of Molecular and Cellular Cardiology, roughly a decade after I had been forced to give up on the initial hypothesis. Back in 2002, I had assumed that the stem cells were turning into blood vessel endothelial cells because they aligned themselves in blood vessel like structures. I had never considered the possibility that they were scaffold for the endothelial cells.
This and other similar experiences have lead me to reformulate the “murder your darlings” commandment to “murder your darling hypotheses but do not bury them”. Instead of repeatedly trying to defend scientific hypotheses that cannot be supported by emerging experimental data, it is better to give up on them. But this does not mean that we should forget and bury those initial hypotheses. With newer technologies, resources or collaborations, we may find ways to explain inconsistent results years later that were not previously available to us. This is why I regularly peruse my cemetery of dead hypotheses on my hard drive to see if there are ways of perhaps resurrecting them, not in their original form but in a modification that I am now able to test.
Fugelsang, J., Stein, C., Green, A., & Dunbar, K. (2004). Theory and Data Interactions of the Scientific Mind: Evidence From the Molecular and the Cognitive Laboratory. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale, 58 (2), 86-95 DOI: 10.1037/h0085799
Fareed Zakaria recently wrote an article in the Washington Post lamenting the loss of liberal arts education in the United States. However, instead of making a case for balanced education, which integrates various forms of creativity and critical thinking promoted by STEM (science, technology, engineering and mathematics) and by a liberal arts education, Zakaria misrepresents STEM education as primarily teaching technical skills and also throws in a few cliches about Asians. You can read my response to his article at 3Quarksdaily.
Research institutions in the life sciences engage in two types of regular scientific meet-ups: scientific seminars and lab meetings. The structure of scientific seminars is fairly standard. Speakers give Powerpoint presentations (typically 45 to 55 minutes long) which provide the necessary scientific background, summarize their group’s recent published scientific work and then (hopefully) present newer, unpublished data. Lab meetings are a rather different affair. The purpose of a lab meeting is to share the scientific work-in-progress with one’s peers within a research group and also to update the laboratory heads. Lab meetings are usually less formal than seminars, and all members of a research group are encouraged to critique the presented scientific data and work-in-progress. There is no need to provide much background information because the audience of peers is already well-acquainted with the subject and it is not uncommon to show raw, unprocessed data and images in order to solicit constructive criticism and guidance from lab members and mentors on how to interpret the data. This enables peer review in real-time, so that, hopefully, major errors and flaws can be averted and newer ideas incorporated into the ongoing experiments.
During the past two decades that I have actively participated in biological, psychological and medical research, I have observed very different styles of lab meetings. Some involve brief 5-10 minute updates from each group member; others develop a rotation system in which one lab member has to present the progress of their ongoing work in a seminar-like, polished format with publication-quality images. Some labs have two hour meetings twice a week, other labs meet only every two weeks for an hour. Some groups bring snacks or coffee to lab meetings, others spend a lot of time discussing logistics such as obtaining and sharing biological reagents or establishing timelines for submitting manuscripts and grants. During the first decade of my work as a researcher, I was a trainee and followed the format of whatever group I belonged to. During the past decade, I have been heading my own research group and it has become my responsibility to structure our lab meetings. I do not know which format works best, so I approach lab meetings like our experiments. Developing a good lab meeting structure is a work-in-progress which requires continuous exploration and testing of new approaches. During the current academic year, I decided to try out a new twist: incorporating literature and philosophy into the weekly lab meetings.
My research group studies stem cells and tissue engineering, cellular metabolism in cancer cells and stem cells and the inflammation of blood vessels. Most of our work focuses on identifying molecular and cellular pathways in cells, and we then test our findings in animal models. Over the years, I have noticed that the increasing complexity of the molecular and cellular signaling pathways and the technologies we employ makes it easy to forget the “big picture” of why we are even conducting the experiments. Determining whether protein A is required for phenomenon X and whether protein B is a necessary co-activator which acts in concert with protein A becomes such a central focus of our work that we may not always remember what it is that compels us to study phenomenon X in the first place. Some of our research has direct medical relevance, but at other times we primarily want to unravel the awe-inspiring complexity of cellular processes. But the question of whether our work is establishing a definitive cause-effect relationship or whether we are uncovering yet another mechanism within an intricate web of causes and effects sometimes falls by the wayside. When asked to explain the purpose or goals of our research, we have become so used to directing a laser pointer onto a slide of a cellular model that it becomes challenging to explain the nature of our work without visual aids.
This fall, I introduced a new component into our weekly lab meetings. After our usual round-up of new experimental data and progress, I suggested that each week one lab member should give a brief 15 minute overview about a book they had recently finished or were still reading. The overview was meant to be a “teaser” without spoilers, explaining why they had started reading the book, what they liked about it, and whether they would recommend it to others. One major condition was to speak about the book without any Powerpoint slides! But there weren’t any major restrictions when it came to the book; it could be fiction or non-fiction and published in any language of the world (but ideally also available in an English translation). If lab members were interested and wanted to talk more about the book, then we would continue to discuss it, otherwise we would disband and return to our usual work. If nobody in my lab wanted to talk about a book then I would give an impromptu mini-talk (without Powerpoint) about a topic relating to the philosophy or culture of science. I use the term “culture of science” broadly to encompass topics such as the peer review process and post-publication peer review, the question of reproducibility of scientific findings, retractions of scientific papers, science communication and science policy – topics which have not been traditionally considered philosophy of science issues but still relate to the process of scientific discovery and the dissemination of scientific findings.
One member of our group introduced us to “For Whom the Bell Tolls” by Ernest Hemingway. He had also recently lived in Spain as a postdoctoral research fellow and shared some of his own personal experiences about how his Spanish friends and colleagues talked about the Spanish Civil War. At another lab meeting, we heard about “Sycamore Row” by John Grisham and the ensuring discussion revolved around race relations in Mississippi. I spoke about “A Tale for a Time Being” by Ruth Ozeki and the difficulties that the book’s protagonist faced as an outsider when her family returned to Japan after living in Silicon Valley. I think that the book which got nearly everyone in the group talking was “Far From the Tree: Parents, Children and the Search for Identity” by Andrew Solomon. The book describes how families grapple with profound physical or cognitive differences between parents and children. The PhD student who discussed the book focused on the “Deafness” chapter of this nearly 1000-page tome but she also placed it in the broader context of parenting, love and the stigma of disability. We stayed in the conference room long after the planned 15 minutes, talking about being “disabled” or being “differently abled” and the challenges that parents and children face.
On the weeks where nobody had a book they wanted to present, we used the time to touch on the cultural and philosophical aspects of science such as Thomas Kuhn’s concept of paradigm shifts in “The Structure of Scientific Revolutions“, Karl Popper’s principles of falsifiability of scientific statements, the challenge of reproducibility of scientific results in stem cell biology and cancer research, or the emergence of Pubpeer as a post-publication peer review website. Some of the lab members had heard of Thomas Kuhn’s or Karl Popper’s ideas before, but by coupling it to a lab meeting, we were able to illustrate these ideas using our own work. A lot of 20th century philosophy of science arose from ideas rooted in physics. When undergraduate or graduate students take courses on philosophy of science, it isn’t always easy for them to apply these abstract principles to their own lab work, especially if they pursue a research career in the life sciences. Thomas Kuhn saw Newtonian and Einsteinian theories as distinct paradigms, but what constitutes a paradigm shift in stem cell biology? Is the ability to generate induced pluripotent stem cells from mature adult cells a paradigm shift or “just” a technological advance?
It is difficult for me to know whether the members of my research group enjoy or benefit from these humanities blurbs at the end of our lab meetings. Perhaps they are just tolerating them as eccentricities of the management and maybe they will tire of them. I personally find these sessions valuable because I believe they help ground us in reality. They remind us that it is important to think and read outside of the box. As scientists, we all read numerous scientific articles every week just to stay up-to-date in our area(s) of expertise, but that does not exempt us from also thinking and reading about important issues facing society and the world we live in. I do not know whether discussing literature and philosophy makes us better scientists but I hope that it makes us better people.
Here is a graphic showing the usage of the words “scientists”, “researchers”, “soldiers” in English-language books published in 1900-2008. The graphic was generated using the Google N-gram Viewer which scours all digitized books in the Google database for selected words and assesses the relative word usage frequencies.
It is depressing that soldiers are mentioned more frequently than scientists or researchers (even when the word frequencies of “scientists” and “researchers” are combined) in English-language books even though the numbers of researchers in the countries which produce most English-language books are comparable or higher than the number of soldiers.
Here are the numbers of researchers (data from the 2010 UNESCO Science report, numbers are reported for the year 2007, PDF) in selected English-language countries and the corresponding numbers of armed forces personnel (data from the World Bank, numbers reported for 2012):
United States: 1.4 million researchers vs. 1.5 million armed forces personnel
United Kingdom: 255,000 researchers vs. 169,000 armed forces personnel
Canada: 139,000 researchers vs. 66,000 armed forces personnel
I find it disturbing that our books – arguably one of our main cultural legacies – give a disproportionately greater space to discussing or describing the military than to our scientific and scholarly endeavors. But I am even more worried about the recent trends. The N-gram Viewer evaluates word usage up until 2008, and “soldiers” has been steadily increasing since the 1990s. The usage of “scientists” and “researchers” has reached a plateau and is now decreasing. I do not want to over-interpret the importance of relative word frequencies as indicators of society’s priorities, but the last two surges of “soldiers” usage occurred during the two World Wars and in 2008, “soldiers” was used as frequently as during the first years of World War II.
It is mind-boggling for us scientists that we have to struggle to get funding for research which has the potential to transform society by providing important new insights into the nature of our universe, life on this planet, our environment and health, whereas the military receives substantially higher amounts of government funding (at least in the USA) for its destructive goals. Perhaps one reason for this discrepancy is that voters hear, see and read much more about wars and soldiers than about science and research. Depictions of heroic soldiers fighting evil make it much easier for voters to go along with allocation of resources to the military. Most of my non-scientist friends can easily name books or movies about soldiers, but they would have a hard time coming up with books and movies about science and scientists. My take-home message from the N-gram Viewer results is that scientists have an obligation to reach out to the public and communicate the importance of science in an understandable manner if they want to avoid the marginalization of science.
We often laud intellectual diversity of a scientific research group because we hope that the multitude of opinions can help point out flaws and improve the quality of research long before it is finalized and written up as a manuscript. The recent events surrounding the research in one of the world’s most famous stem cell research laboratories at Harvard shows us the disastrous effects of suppressing diverse and dissenting opinions.
The infamous “Orlic paper” was a landmark research article published in the prestigious scientific journal Nature in 2001, which showed that stem cells contained in the bone marrow could be converted into functional heart cells. After a heart attack, injections of bone marrow cells reversed much of the heart attack damage by creating new heart cells and restoring heart function. It was called the “Orlic paper” because the first author of the paper was Donald Orlic, but the lead investigator of the study was Piero Anversa, a professor and highly respected scientist at New York Medical College.
Anversa had established himself as one of the world’s leading experts on the survival and death of heart muscle cells in the 1980s and 1990s, but with the start of the new millennium, Anversa shifted his laboratory’s focus towards the emerging field of stem cell biology and its role in cardiovascular regeneration. The Orlic paper was just one of several highly influential stem cell papers to come out of Anversa’s lab at the onset of the new millenium. A 2002 Anversa paper in the New England Journal of Medicine – the world’s most highly cited academic journal –investigated the hearts of human organ transplant recipients. This study showed that up to 10% of the cells in the transplanted heart were derived from the recipient’s own body. The only conceivable explanation was that after a patient received another person’s heart, the recipient’s own cells began maintaining the health of the transplanted organ. The Orlic paper had shown the regenerative power of bone marrow cells in mouse hearts, but this new paper now offered the more tantalizing suggestion that even human hearts could be regenerated by circulating stem cells in their blood stream.
A 2003 publication in Cell by the Anversa group described another ground-breaking discovery, identifying a reservoir of stem cells contained within the heart itself. This latest coup de force found that the newly uncovered heart stem cell population resembled the bone marrow stem cells because both groups of cells bore the same stem cell protein called c-kit and both were able to make new heart muscle cells. According to Anversa, c-kit cells extracted from a heart could be re-injected back into a heart after a heart attack and regenerate more than half of the damaged heart!
These Anversa papers revolutionized cardiovascular research. Prior to 2001, most cardiovascular researchers believed that the cell turnover in the adult mammalian heart was minimal because soon after birth, heart cells stopped dividing. Some organs or tissues such as the skin contained stem cells which could divide and continuously give rise to new cells as needed. When skin is scraped during a fall from a bike, it only takes a few days for new skin cells to coat the area of injury and heal the wound. Unfortunately, the heart was not one of those self-regenerating organs. The number of heart cells was thought to be more or less fixed in adults. If heart cells were damaged by a heart attack, then the affected area was replaced by rigid scar tissue, not new heart muscle cells. If the area of damage was large, then the heart’s pump function was severely compromised and patients developed the chronic and ultimately fatal disease known as “heart failure”.
Anversa’s work challenged this dogma by putting forward a bold new theory: the adult heart was highly regenerative, its regeneration was driven by c-kit stem cells, which could be isolated and used to treat injured hearts. All one had to do was harness the regenerative potential of c-kit cells in the bone marrow and the heart, and millions of patients all over the world suffering from heart failure might be cured. Not only did Anversa publish a slew of supportive papers in highly prestigious scientific journals to challenge the dogma of the quiescent heart, he also happened to publish them at a unique time in history which maximized their impact.
In the year 2001, there were few innovative treatments available to treat patients with heart failure. The standard approach was to use medications that would delay the progression of heart failure. But even the best medications could not prevent the gradual decline of heart function. Organ transplants were a cure, but transplantable hearts were rare and only a small fraction of heart failure patients would be fortunate enough to receive a new heart. Hopes for a definitive heart failure cure were buoyed when researchers isolated human embryonic stem cells in 1998. This discovery paved the way for using highly pliable embryonic stem cells to create new heart muscle cells, which might one day be used to restore the heart’s pump function without resorting to a heart transplant.
The dreams of using embryonic stem cells to regenerate human hearts were soon squashed when the Bush administration banned the generation of new human embryonic stem cells in 2001, citing ethical concerns. These federal regulations and the lobbying of religious and political groups against human embryonic stem cells were a major blow to research on cardiovascular regeneration. Amidst this looming hiatus in cardiovascular regeneration, Anversa’s papers appeared and showed that one could steer clear of the ethical controversies surrounding embryonic stem cells by using an adult patient’s own stem cells. The Anversa group re-energized the field of cardiovascular stem cell research and cleared the path for the first human stem cell treatments in heart disease.
Instead of having to wait for the US government to reverse its restrictive policy on human embryonic stem cells, one could now initiate clinical trials with adult stem cells, treating heart attack patients with their own cells and without having to worry about an ethical quagmire. Heart failure might soon become a disease of the past. The excitement at all major national and international cardiovascular conferences was palpable whenever the Anversa group, their collaborators or other scientists working on bone marrow and cardiac stem cells presented their dizzyingly successful results. Anversa received numerous accolades for his discoveries and research grants from the NIH (National Institutes of Health) to further develop his research program. He was so successful that some researchers believed Anversa might receive the Nobel Prize for his iconoclastic work which had redefined the regenerative potential of the heart. Many of the world’s top universities were vying to recruit Anversa and his group, and he decided to relocate his research group to Harvard Medical School and Brigham and Women’s Hospital 2008.
There were naysayers and skeptics who had resisted the adult stem cell euphoria. Some researchers had spent decades studying the heart and found little to no evidence for regeneration in the adult heart. They were having difficulties reconciling their own results with those of the Anversa group. A number of practicing cardiologists who treated heart failure patients were also skeptical because they did not see the near-miraculous regenerative power of the heart in their patients. One Anversa paper went as far as suggesting that the whole heart would completely regenerate itself roughly every 8-9 years, a claim that was at odds with the clinical experience of practicing cardiologists. Other researchers pointed out serious flaws in the Anversa papers. For example, the 2002 paper on stem cells in human heart transplant patients claimed that the hearts were coated with the recipient’s regenerative cells, including cells which contained the stem cell marker Sca-1. Within days of the paper’s publication, many researchers were puzzled by this finding because Sca-1 was a marker of mouse and rat cells – not human cells! If Anversa’s group was finding rat or mouse proteins in human hearts, it was most likely due to an artifact. And if they had mistakenly found rodent cells in human hearts, so these critics surmised, perhaps other aspects of Anversa’s research were similarly flawed or riddled with artifacts.
At national and international meetings, one could observe heated debates between members of the Anversa camp and their critics. The critics then decided to change their tactics. Instead of just debating Anversa and commenting about errors in the Anversa papers, they invested substantial funds and efforts to replicate Anversa’s findings. One of the most important and rigorous attempts to assess the validity of the Orlic paper was published in 2004, by the research teams of Chuck Murry and Loren Field. Murry and Field found no evidence of bone marrow cells converting into heart muscle cells. This was a major scientific blow to the burgeoning adult stem cell movement, but even this paper could not deter the bone marrow cell champions.
The skeptics who had doubted Anversa’s claims all along may now feel vindicated, but this is not the time to gloat. Instead, the discipline of cardiovascular stem cell biology is now undergoing a process of soul-searching. How was it possible that some of the most widely read and cited papers were based on heavily flawed observations and assumptions? Why did it take more than a decade since the first refutation was published in 2004 for scientists to finally accept that the near-magical regenerative power of the heart turned out to be a pipe dream.
One reason for this lag time is pretty straightforward: It takes a tremendous amount of time to refute papers. Funding to conduct the experiments is difficult to obtain because grant funding agencies are not easily convinced to invest in studies replicating existing research. For a refutation to be accepted by the scientific community, it has to be at least as rigorous as the original, but in practice, refutations are subject to even greater scrutiny. Scientists trying to disprove another group’s claim may be asked to develop even better research tools and technologies so that their results can be seen as more definitive than those of the original group. Instead of relying on antibodies to identify c-kit cells, the 2014 refutation developed a transgenic mouse in which all c-kit cells could be genetically traced to yield more definitive results – but developing new models and tools can take years.
The scientific peer review process by external researchers is a central pillar of the quality control process in modern scientific research, but one has to be cognizant of its limitations. Peer review of a scientific manuscript is routinely performed by experts for all the major academic journals which publish original scientific results. However, peer review only involves a “review”, i.e. a general evaluation of major strengths and flaws, and peer reviewers do not see the original raw data nor are they provided with the resources to replicate the studies and confirm the veracity of the submitted results. Peer reviewers rely on the honor system, assuming that the scientists are submitting accurate representations of their data and that the data has been thoroughly scrutinized and critiqued by all the involved researchers before it is even submitted to a journal for publication. If peer reviewers were asked to actually wade through all the original data generated by the scientists and even perform confirmatory studies, then the peer review of every single manuscript could take years and one would have to find the money to pay for the replication or confirmation experiments conducted by peer reviewers. Publication of experiments would come to a grinding halt because thousands of manuscripts would be stuck in the purgatory of peer review. Relying on the integrity of the scientists submitting the data and their internal review processes may seem naïve, but it has always been the bedrock of scientific peer review. And it is precisely the internal review process which may have gone awry in the Anversa group.
Just like Pygmalion fell in love with Galatea, researchers fall in love with the hypotheses and theories that they have constructed. To minimize the effects of these personal biases, scientists regularly present their results to colleagues within their own groups at internal lab meetings and seminars or at external institutions and conferences long before they submit their data to a peer-reviewed journal. The preliminary presentations are intended to spark discussions, inviting the audience to challenge the veracity of the hypotheses and the data while the work is still in progress. Sometimes fellow group members are truly skeptical of the results, at other times they take on the devil’s advocate role to see if they can find holes in their group’s own research. The larger a group, the greater the chance that one will find colleagues within a group with dissenting views. This type of feedback is a necessary internal review process which provides valuable insights that can steer the direction of the research.
Considering the size of the Anversa group – consisting of 20, 30 or even more PhD students, postdoctoral fellows and senior scientists – it is puzzling why the discussions among the group members did not already internally challenge their hypotheses and findings, especially in light of the fact that they knew extramural scientists were having difficulties replicating the work.
“I think that most scientists, perhaps with the exception of the most lucky or most dishonest, have personal experience with failure in science—experiments that are unreproducible, hypotheses that are fundamentally incorrect. Generally, we sigh, we alter hypotheses, we develop new methods, we move on. It is the data that should guide the science.
In the Anversa group, a model with much less intellectual flexibility was applied. The “Hypothesis” was that c-kit (cd117) positive cells in the heart (or bone marrow if you read their earlier studies) were cardiac progenitors that could: 1) repair a scarred heart post-myocardial infarction, and: 2) supply the cells necessary for cardiomyocyte turnover in the normal heart.
This central theme was that which supplied the lab with upwards of $50 million worth of public funding over a decade, a number which would be much higher if one considers collaborating labs that worked on related subjects.
In theory, this hypothesis would be elegant in its simplicity and amenable to testing in current model systems. In practice, all data that did not point to the “truth” of the hypothesis were considered wrong, and experiments which would definitively show if this hypothesis was incorrect were never performed (lineage tracing e.g.).”
Discarding data that might have challenged the central hypothesis appears to have been a central principle.
According to the whistleblower, Anversa’s group did not just discard undesirable data, they actually punished group members who would question the group’s hypotheses:
“In essence, to Dr. Anversa all investigators who questioned the hypothesis were “morons,” a word he used frequently at lab meetings. For one within the group to dare question the central hypothesis, or the methods used to support it, was a quick ticket to dismissal from your position.“
The group also created an environment of strict information hierarchy and secrecy which is antithetical to the spirit of science:
“The day to day operation of the lab was conducted under a severe information embargo. The lab had Piero Anversa at the head with group leaders Annarosa Leri, Jan Kajstura and Marcello Rota immediately supervising experimentation. Below that was a group of around 25 instructors, research fellows, graduate students and technicians. Information flowed one way, which was up, and conversation between working groups was generally discouraged and often forbidden.
Raw data left one’s hands, went to the immediate superior (one of the three named above) and the next time it was seen would be in a manuscript or grant. What happened to that data in the intervening period is unclear.
A side effect of this information embargo was the limitation of the average worker to determine what was really going on in a research project. It would also effectively limit the ability of an average worker to make allegations regarding specific data/experiments, a requirement for a formal investigation.“
This segregation of information is a powerful method to maintain an authoritarian rule and is more typical for terrorist cells or intelligence agencies than for a scientific lab, but it would definitely explain how the Anversa group was able to mass produce numerous irreproducible papers without any major dissent from within the group.
In addition to the secrecy and segregation of information, the group also created an atmosphere of fear to ensure obedience:
“Although individually-tailored stated and unstated threats were present for lab members, the plight of many of us who were international fellows was especially harrowing. Many were technically and educationally underqualified compared to what might be considered average research fellows in the United States. Many also originated in Italy where Dr. Anversa continues to wield considerable influence over biomedical research.
This combination of being undesirable to many other labs should they leave their position due to lack of experience/training, dependent upon employment for U.S. visa status, and under constant threat of career suicide in your home country should you leave, was enough to make many people play along.
Even so, I witnessed several people question the findings during their time in the lab. These people and working groups were subsequently fired or resigned. I would like to note that this lab is not unique in this type of exploitative practice, but that does not make it ethically sound and certainly does not create an environment for creative, collaborative, or honest science.”
Foreign researchers are particularly dependent on their employment to maintain their visa status and the prospect of being fired from one’s job can be terrifying for anyone.
This is an anonymous account of a whistleblower and as such, it is problematic. The use of anonymous sources in science journalism could open the doors for all sorts of unfounded and malicious accusations, which is why the ethics of using anonymous sources was heavily debated at the recent ScienceOnline conference. But the claims of the whistleblower are not made in a vacuum – they have to be evaluated in the context of known facts. The whistleblower’s claim that the Anversa group and their collaborators received more than $50 million to study bone marrow cell and c-kit cell regeneration of the heart can be easily verified at the public NIH grant funding RePORTer website. The whistleblower’s claim that many of the Anversa group’s findings could not be replicated is also a verifiable fact. It may seem unfair to condemn Anversa and his group for creating an atmosphere of secrecy and obedience which undermined the scientific enterprise, caused torment among trainees and wasted millions of dollars of tax payer money simply based on one whistleblower’s account. However, if one looks at the entire picture of the amazing rise and decline of the Anversa group’s foray into cardiac regeneration, then the whistleblower’s description of the atmosphere of secrecy and hierarchy seems very plausible.
The investigation of Harvard into the Anversa group is not open to the public and therefore it is difficult to know whether the university is primarily investigating scientific errors or whether it is also looking into such claims of egregious scientific misconduct and abuse of scientific trainees. It is unlikely that Anversa’s group is the only group that might have engaged in such forms of misconduct. Threatening dissenting junior researchers with a loss of employment or visa status may be far more common than we think. The gravity of the problem requires that the NIH – the major funding agency for biomedical research in the US – should look into the prevalence of such practices in research labs and develop safeguards to prevent the abuse of science and scientists.