Reproducibility of findings is a core foundation of science. If scientific results only hold true in some labs but not in others, then how can researchers feel confident about their discoveries? How can society put evidence-based policies into place if the evidence is unreliable?
Recognition of this “crisis” has prompted calls for reform. Researchers are feeling their way, experimenting with different practices meant to help distinguish solid science from irreproducible results. Some people are even starting to reevaluate how choices are made about what research actually gets tackled. Breaking innovative new ground is flashier than revisiting already published research. Does prioritizing novelty naturally lead to this point?
Incentivizing the wrong thing?
One solution to the reproducibility crisis could be simply to conduct lots of replication studies. For instance, the scientific journal eLife is participating in an initiative to validate and reproduce important recent findings in the field of cancer research. The first set of these “rerun” studies was recently released and yielded mixed results. The results of 2 out of 5 research studies were reproducible, one was not and two additional studies did not provide definitive answers.
But there’s at least one major obstacle to investing time and effort in this endeavor: the quest for novelty. The prestige of an academic journal depends at least partly on how often the research articles it publishes are cited. Thus, research journals often want to publish novel scientific findings which are more likely to be cited, not necessarily the results of newly rerun older research.
Genetics researcher Barak Cohen at Washington University in St. Louis recently published a commentary analyzing this growing push for novelty. He suggests that progress in science depends on a delicate balance between novelty and checking the work of other scientists. When rewards such as funding of grants or publication in prestigious journals emphasize novelty at the expense of testing previously published results, science risks developing cracks in its foundation.
One of his main concerns is that scientific papers now inflate their claims in order to emphasize their novelty and the relevance of biomedical research for clinical applications. By exchanging depth of research for breadth of claims, researchers may be at risk of compromising the robustness of the work. By claiming excessive novelty and impact, researchers may undermine its actual significance because they may fail to provide solid evidence for each claim.
Prestigious journals often now demand complete scientific stories, from basic molecular mechanisms to proving their relevance in various animal models. Unexplained results or unanswered questions are seen as weaknesses. Instead of publishing one exciting novel finding that is robust, and which could spawn a new direction of research conducted by other groups, researchers now spend years gathering a whole string of findings with broad claims about novelty and impact.
Balancing fresh findings and robustness
A challenge for editors and reviewers of scientific manuscripts is assessing the novelty and likely long-term impact of the work they’re assessing. The eventual importance of a new, unique scientific idea is sometimes difficult to recognize even by peers who are grounded in existing knowledge. Many basic research studies form the basis of future practical applications. One recent study found that of basic research articles that received at least one citation, 80 percent were eventually cited by a patent application. But it takes time for practical significance to come to light.
A collaborative team of economics researchers recently developed an unusual measure of scientific novelty by carefully studying the references of a paper. They ranked a scientific paper as more novel if it cited a diverse combination of journals. For example, a scientific article citing a botany journal, an economics journal and a physics journal would be considered very novel if no other article had cited this combination of varied references before.
This measure of novelty allowed them to identify papers which were more likely to be cited in the long run. But it took roughly four years for these novel papers to start showing their greater impact. One may disagree with this particular indicator of novelty, but the study makes an important point: It takes time to recognize the full impact of novel findings.
Realizing how difficult it is to assess novelty should give funding agencies, journal editors and scientists pause. Progress in science depends on new discoveries and following unexplored paths – but solid, reproducible research requires an equal emphasis on the robustness of the work. By restoring the balance between demands and rewards for novelty and robustness, science will achieve even greater progress.
Universities and the scientific infrastructures in Muslim-majority countries need to undergo radical reforms if they want to avoid falling by the wayside in a world characterized by major scientific and technological innovations. This is the conclusion reached by Nidhal Guessoum and Athar Osama in their recent commentary “Institutions: Revive universities of the Muslim world“, published in the scientific journal Nature. The physics and astronomy professor Guessoum (American University of Sharjah, United Arab Emirates) and Osama, who is the founder of the Muslim World Science Initiative, use the commentary to summarize the key findings of the report “Science at Universities of the Muslim World” (PDF), which was released in October 2015 by a task force of policymakers, academic vice-chancellors, deans, professors and science communicators. This report is one of the most comprehensive analyses of the state of scientific education and research in the 57 countries with a Muslim-majority population, which are members of the Organisation of Islamic Cooperation (OIC).
Here are some of the key findings:
1. Lower scientific productivity in the Muslim world: The 57 Muslim-majority countries constitute 25% of the world’s population, yet they only generate 6% of the world’s scientific publications and 1.6% of the world’s patents.
2. Lower scientific impact of papers published in the OIC countries: Not only are Muslim-majority countries severely under-represented in terms of the numbers of publications, the papers which do get published are cited far less than the papers stemming from non-Muslim countries. One illustrative example is that of Iran and Switzerland. In the 2014 SCImago ranking of publications by country, Iran was the highest-ranked Muslim-majority country with nearly 40,000 publications, just slightly ahead of Switzerland with 38,000 publications – even though Iran’s population of 77 million is nearly ten times larger than that of Switzerland. However, the average Swiss publication was more than twice as likely to garner a citation by scientific colleagues than an Iranian publication, thus indicating that the actual scientific impact of research in Switzerland was far greater than that of Iran.
To correct for economic differences between countries that may account for the quality or impact of the scientific work, the analysis also compared selected OIC countries to matched non-Muslim countries with similar per capita Gross Domestic Product (GDP) values (PDF). The per capita GDP in 2010 was $10,136 for Turkey, $8,754 for Malaysia and only $7,390 for South Africa. However, South Africa still outperformed both Turkey and Malaysia in terms of average citations per scientific paper in the years 2006-2015 (Turkey: 5.6; Malaysia: 5.0; South Africa: 9.7).
3. Muslim-majority countries make minimal investments in research and development: The world average for investing in research and development is roughly 1.8% of the GDP. Advanced developed countries invest up to 2-3 percent of their GDP, whereas the average for the OIC countries is only 0.5%, less than a third of the world average! One could perhaps understand why poverty-stricken Muslim countries such as Pakistan do not have the funds to invest in research because their more immediate concerns are to provide basic necessities to the population. However, one of the most dismaying findings of the report is the dismally low rate of research investments made by the members of the Gulf Cooperation Council (GCC, the economic union of six oil-rich gulf countries Saudi Arabia, Kuwait, Bahrain, Oman, United Arab Emirates and Qatar with a mean per capita GDP of over $30,000 which is comparable to that of the European Union). Saudi Arabia and Kuwait, for example, invest less than 0.1% of their GDP in research and development, far lower than the OIC average of 0.5%.
So how does one go about fixing this dire state of science in the Muslim world? Some fixes are rather obvious, such as increasing the investment in scientific research and education, especially in the OIC countries which have the financial means and are currently lagging far behind in terms of how much funds are made available to improve the scientific infrastructures. Guessoum and Athar also highlight the importance of introducing key metrics to assess scientific productivity and the quality of science education. It is not easy to objectively measure scientific and educational impact, and one can argue about the significance or reliability of any given metric. But without any metrics, it will become very difficult for OIC universities to identify problems and weaknesses, build new research and educational programs and reward excellence in research and teaching. There is also a need for reforming the curriculum so that it shifts its focus from lecture-based teaching, which is so prevalent in OIC universities, to inquiry-based teaching in which students learn science hands-on by experimentally testing hypotheses and are encouraged to ask questions.
In addition to these commonsense suggestions, the task force also put forward a rather intriguing proposition to strengthen scientific research and education: place a stronger emphasis on basic liberal arts in science education. I could not agree more because I strongly believe that exposing science students to the arts and humanities plays a key role in fostering the creativity and curiosity required for scientific excellence. Science is a multi-disciplinary enterprise, and scientists can benefit greatly from studying philosophy, history or literature. A course in philosophy, for example, can teach science students to question their basic assumptions about reality and objectivity, encourage them to examine their own biases, challenge authority and understand the importance of doubt and uncertainty, all of which will likely help them become critical thinkers and better scientists.
However, the specific examples provided by Guessoum and Athar do not necessarily indicate a support for this kind of a broad liberal arts education. They mention the example of the newly founded private Habib University in Karachi which mandates that all science and engineering students also take classes in the humanities, including a two semester course in “hikma” or “traditional wisdom”. Upon reviewing the details of this philosophy course on the university’s website, it seems that the course is a history of Islamic philosophy focused on antiquity and pre-modern texts which date back to the “Golden Age” of Islam. The task force also specifically applauds an online course developed by Ahmed Djebbar. He is an emeritus science historian at the University of Lille in France, which attempts to stimulate scientific curiosity in young pre-university students by relating scientific concepts to great discoveries from the Islamic “Golden Age”. My concern is that this is a rather Islamocentric form of liberal arts education. Do students who have spent all their lives growing up in a Muslim society really need to revel in the glories of a bygone era in order to get excited about science? Does the Habib University philosophy course focus on Islamic philosophy because the university feels that students should be more aware of their cultural heritage or are there concerns that exposing students to non-Islamic ideas could cause problems with students, parents, university administrators or other members of society who could perceive this as an attack on Islamic values? If the true purpose of liberal arts education is to expand the minds of students by exposing them to new ideas, wouldn’t it make more sense to focus on non-Islamic philosophy? It is definitely not a good idea to coddle Muslim students by adulating the “Golden Age” of Islam or using kid gloves when discussing philosophy in order to avoid offending them.
This leads us to a question that is not directly addressed by Guessoum and Osama: How “liberal” is a liberal arts education in countries with governments and societies that curtail the free expression of ideas? The Saudi blogger Raif Badawi was sentenced to 1,000 lashes and 10 years in prison because of his liberal views that were perceived as an attack on religion. Faculty members at universities in Saudi Arabia who teach liberal arts courses are probably very aware of these occupational hazards. At first glance, professors who teach in the sciences may not seem to be as susceptible to the wrath of religious zealots and authoritarian governments. However, the above-mentioned interdisciplinary nature of science could easily spell trouble for free-thinking professors or students. Comments about evolutionary biology, the ethics of genome editing or discussing research on sexuality could all be construed as a violation of societal and religious norms.
The 2010 study Faculty perceptions of academic freedom at a GCC university surveyed professors at an anonymous GCC university (most likely Qatar University since roughly 25% of the faculty members were Qatari nationals and the authors of the study were based in Qatar) regarding their views of academic freedom. The vast majority of faculty members (Arab and non-Arab) felt that academic freedom was important to them and that their university upheld academic freedom. However, in interviews with individual faculty members, the researchers found that the professors were engaging in self-censorship in order to avoid untoward repercussions. Here are some examples of the comments from the faculty at this GCC University:
“I am fully aware of our culture. So, when I suggest any topic in class, I don’t need external censorship except mine.”
“Yes. I avoid subjects that are culturally inappropriate.”
“Yes, all the time. I avoid all references to Israel or the Jewish people despite their contributions to world culture. I also avoid any kind of questioning of their religious tradition. I do this out of respect.”
This latter comment is especially painful for me because one of my heroes who inspired me to become a cell biologist was the Italian Jewish scientist Rita Levi-Montalcini. She revolutionized our understanding of how cells communicate with each other using growth factors. She was also forced to secretly conduct her experiments in her bedroom because the Fascists banned all “non-Aryans” from going to the university laboratory. Would faculty members who teach the discovery of growth factors at this GCC University downplay the role of the Nobel laureate Levi-Montalcini because she was Jewish? We do not know how prevalent this form of self-censorship is in other OIC countries because the research on academic freedom in Muslim-majority countries is understandably scant. Few faculty members would be willing to voice their concerns about government or university censorship and admitting to self-censorship is also not easy.
The task force report on science in the universities of Muslim-majority countries is an important first step towards reforming scientific research and education in the Muslim world. Increasing investments in research and development, using and appropriately acting on carefully selected metrics as well as introducing a core liberal arts curriculum for science students will probably all significantly improve the dire state of science in the Muslim world. However, the reform of the research and education programs needs to also include discussions about the importance of academic freedom. If Muslim societies are serious about nurturing scientific innovation, then they will need to also ensure that scientists, educators and students will be provided with the intellectual freedom that is the cornerstone of scientific creativity.
“If we achieve what we are working on, I will have a job forever.”
“We have reason to believe that there is a single gene that distinguishes those people who give funding for scientific research and those who do not give. When we find that gene, we are going to be able to introduce it into individuals and change them so that now they are going to become donors.”
“The question that really motivates my research is how can I get people to give me more money, to give money to science.”
“By using the CRISPR/Cas9 technique, we are hoping to ensure that all future generations of politicians have the ‘Fund Science’ gene.”
Created for the American Association of Immunologists 2015 Annual Meeting
Written And Directed by Matt Klinman
Shot and Edited by Alec Cohen
Here is a graphic showing the usage of the words “scientists”, “researchers”, “soldiers” in English-language books published in 1900-2008. The graphic was generated using the Google N-gram Viewer which scours all digitized books in the Google database for selected words and assesses the relative word usage frequencies.
It is depressing that soldiers are mentioned more frequently than scientists or researchers (even when the word frequencies of “scientists” and “researchers” are combined) in English-language books even though the numbers of researchers in the countries which produce most English-language books are comparable or higher than the number of soldiers.
Here are the numbers of researchers (data from the 2010 UNESCO Science report, numbers are reported for the year 2007, PDF) in selected English-language countries and the corresponding numbers of armed forces personnel (data from the World Bank, numbers reported for 2012):
United States: 1.4 million researchers vs. 1.5 million armed forces personnel
United Kingdom: 255,000 researchers vs. 169,000 armed forces personnel
Canada: 139,000 researchers vs. 66,000 armed forces personnel
I find it disturbing that our books – arguably one of our main cultural legacies – give a disproportionately greater space to discussing or describing the military than to our scientific and scholarly endeavors. But I am even more worried about the recent trends. The N-gram Viewer evaluates word usage up until 2008, and “soldiers” has been steadily increasing since the 1990s. The usage of “scientists” and “researchers” has reached a plateau and is now decreasing. I do not want to over-interpret the importance of relative word frequencies as indicators of society’s priorities, but the last two surges of “soldiers” usage occurred during the two World Wars and in 2008, “soldiers” was used as frequently as during the first years of World War II.
It is mind-boggling for us scientists that we have to struggle to get funding for research which has the potential to transform society by providing important new insights into the nature of our universe, life on this planet, our environment and health, whereas the military receives substantially higher amounts of government funding (at least in the USA) for its destructive goals. Perhaps one reason for this discrepancy is that voters hear, see and read much more about wars and soldiers than about science and research. Depictions of heroic soldiers fighting evil make it much easier for voters to go along with allocation of resources to the military. Most of my non-scientist friends can easily name books or movies about soldiers, but they would have a hard time coming up with books and movies about science and scientists. My take-home message from the N-gram Viewer results is that scientists have an obligation to reach out to the public and communicate the importance of science in an understandable manner if they want to avoid the marginalization of science.
We often laud intellectual diversity of a scientific research group because we hope that the multitude of opinions can help point out flaws and improve the quality of research long before it is finalized and written up as a manuscript. The recent events surrounding the research in one of the world’s most famous stem cell research laboratories at Harvard shows us the disastrous effects of suppressing diverse and dissenting opinions.
The infamous “Orlic paper” was a landmark research article published in the prestigious scientific journal Nature in 2001, which showed that stem cells contained in the bone marrow could be converted into functional heart cells. After a heart attack, injections of bone marrow cells reversed much of the heart attack damage by creating new heart cells and restoring heart function. It was called the “Orlic paper” because the first author of the paper was Donald Orlic, but the lead investigator of the study was Piero Anversa, a professor and highly respected scientist at New York Medical College.
Anversa had established himself as one of the world’s leading experts on the survival and death of heart muscle cells in the 1980s and 1990s, but with the start of the new millennium, Anversa shifted his laboratory’s focus towards the emerging field of stem cell biology and its role in cardiovascular regeneration. The Orlic paper was just one of several highly influential stem cell papers to come out of Anversa’s lab at the onset of the new millenium. A 2002 Anversa paper in the New England Journal of Medicine – the world’s most highly cited academic journal –investigated the hearts of human organ transplant recipients. This study showed that up to 10% of the cells in the transplanted heart were derived from the recipient’s own body. The only conceivable explanation was that after a patient received another person’s heart, the recipient’s own cells began maintaining the health of the transplanted organ. The Orlic paper had shown the regenerative power of bone marrow cells in mouse hearts, but this new paper now offered the more tantalizing suggestion that even human hearts could be regenerated by circulating stem cells in their blood stream.
A 2003 publication in Cell by the Anversa group described another ground-breaking discovery, identifying a reservoir of stem cells contained within the heart itself. This latest coup de force found that the newly uncovered heart stem cell population resembled the bone marrow stem cells because both groups of cells bore the same stem cell protein called c-kit and both were able to make new heart muscle cells. According to Anversa, c-kit cells extracted from a heart could be re-injected back into a heart after a heart attack and regenerate more than half of the damaged heart!
These Anversa papers revolutionized cardiovascular research. Prior to 2001, most cardiovascular researchers believed that the cell turnover in the adult mammalian heart was minimal because soon after birth, heart cells stopped dividing. Some organs or tissues such as the skin contained stem cells which could divide and continuously give rise to new cells as needed. When skin is scraped during a fall from a bike, it only takes a few days for new skin cells to coat the area of injury and heal the wound. Unfortunately, the heart was not one of those self-regenerating organs. The number of heart cells was thought to be more or less fixed in adults. If heart cells were damaged by a heart attack, then the affected area was replaced by rigid scar tissue, not new heart muscle cells. If the area of damage was large, then the heart’s pump function was severely compromised and patients developed the chronic and ultimately fatal disease known as “heart failure”.
Anversa’s work challenged this dogma by putting forward a bold new theory: the adult heart was highly regenerative, its regeneration was driven by c-kit stem cells, which could be isolated and used to treat injured hearts. All one had to do was harness the regenerative potential of c-kit cells in the bone marrow and the heart, and millions of patients all over the world suffering from heart failure might be cured. Not only did Anversa publish a slew of supportive papers in highly prestigious scientific journals to challenge the dogma of the quiescent heart, he also happened to publish them at a unique time in history which maximized their impact.
In the year 2001, there were few innovative treatments available to treat patients with heart failure. The standard approach was to use medications that would delay the progression of heart failure. But even the best medications could not prevent the gradual decline of heart function. Organ transplants were a cure, but transplantable hearts were rare and only a small fraction of heart failure patients would be fortunate enough to receive a new heart. Hopes for a definitive heart failure cure were buoyed when researchers isolated human embryonic stem cells in 1998. This discovery paved the way for using highly pliable embryonic stem cells to create new heart muscle cells, which might one day be used to restore the heart’s pump function without resorting to a heart transplant.
The dreams of using embryonic stem cells to regenerate human hearts were soon squashed when the Bush administration banned the generation of new human embryonic stem cells in 2001, citing ethical concerns. These federal regulations and the lobbying of religious and political groups against human embryonic stem cells were a major blow to research on cardiovascular regeneration. Amidst this looming hiatus in cardiovascular regeneration, Anversa’s papers appeared and showed that one could steer clear of the ethical controversies surrounding embryonic stem cells by using an adult patient’s own stem cells. The Anversa group re-energized the field of cardiovascular stem cell research and cleared the path for the first human stem cell treatments in heart disease.
Instead of having to wait for the US government to reverse its restrictive policy on human embryonic stem cells, one could now initiate clinical trials with adult stem cells, treating heart attack patients with their own cells and without having to worry about an ethical quagmire. Heart failure might soon become a disease of the past. The excitement at all major national and international cardiovascular conferences was palpable whenever the Anversa group, their collaborators or other scientists working on bone marrow and cardiac stem cells presented their dizzyingly successful results. Anversa received numerous accolades for his discoveries and research grants from the NIH (National Institutes of Health) to further develop his research program. He was so successful that some researchers believed Anversa might receive the Nobel Prize for his iconoclastic work which had redefined the regenerative potential of the heart. Many of the world’s top universities were vying to recruit Anversa and his group, and he decided to relocate his research group to Harvard Medical School and Brigham and Women’s Hospital 2008.
There were naysayers and skeptics who had resisted the adult stem cell euphoria. Some researchers had spent decades studying the heart and found little to no evidence for regeneration in the adult heart. They were having difficulties reconciling their own results with those of the Anversa group. A number of practicing cardiologists who treated heart failure patients were also skeptical because they did not see the near-miraculous regenerative power of the heart in their patients. One Anversa paper went as far as suggesting that the whole heart would completely regenerate itself roughly every 8-9 years, a claim that was at odds with the clinical experience of practicing cardiologists. Other researchers pointed out serious flaws in the Anversa papers. For example, the 2002 paper on stem cells in human heart transplant patients claimed that the hearts were coated with the recipient’s regenerative cells, including cells which contained the stem cell marker Sca-1. Within days of the paper’s publication, many researchers were puzzled by this finding because Sca-1 was a marker of mouse and rat cells – not human cells! If Anversa’s group was finding rat or mouse proteins in human hearts, it was most likely due to an artifact. And if they had mistakenly found rodent cells in human hearts, so these critics surmised, perhaps other aspects of Anversa’s research were similarly flawed or riddled with artifacts.
At national and international meetings, one could observe heated debates between members of the Anversa camp and their critics. The critics then decided to change their tactics. Instead of just debating Anversa and commenting about errors in the Anversa papers, they invested substantial funds and efforts to replicate Anversa’s findings. One of the most important and rigorous attempts to assess the validity of the Orlic paper was published in 2004, by the research teams of Chuck Murry and Loren Field. Murry and Field found no evidence of bone marrow cells converting into heart muscle cells. This was a major scientific blow to the burgeoning adult stem cell movement, but even this paper could not deter the bone marrow cell champions.
The skeptics who had doubted Anversa’s claims all along may now feel vindicated, but this is not the time to gloat. Instead, the discipline of cardiovascular stem cell biology is now undergoing a process of soul-searching. How was it possible that some of the most widely read and cited papers were based on heavily flawed observations and assumptions? Why did it take more than a decade since the first refutation was published in 2004 for scientists to finally accept that the near-magical regenerative power of the heart turned out to be a pipe dream.
One reason for this lag time is pretty straightforward: It takes a tremendous amount of time to refute papers. Funding to conduct the experiments is difficult to obtain because grant funding agencies are not easily convinced to invest in studies replicating existing research. For a refutation to be accepted by the scientific community, it has to be at least as rigorous as the original, but in practice, refutations are subject to even greater scrutiny. Scientists trying to disprove another group’s claim may be asked to develop even better research tools and technologies so that their results can be seen as more definitive than those of the original group. Instead of relying on antibodies to identify c-kit cells, the 2014 refutation developed a transgenic mouse in which all c-kit cells could be genetically traced to yield more definitive results – but developing new models and tools can take years.
The scientific peer review process by external researchers is a central pillar of the quality control process in modern scientific research, but one has to be cognizant of its limitations. Peer review of a scientific manuscript is routinely performed by experts for all the major academic journals which publish original scientific results. However, peer review only involves a “review”, i.e. a general evaluation of major strengths and flaws, and peer reviewers do not see the original raw data nor are they provided with the resources to replicate the studies and confirm the veracity of the submitted results. Peer reviewers rely on the honor system, assuming that the scientists are submitting accurate representations of their data and that the data has been thoroughly scrutinized and critiqued by all the involved researchers before it is even submitted to a journal for publication. If peer reviewers were asked to actually wade through all the original data generated by the scientists and even perform confirmatory studies, then the peer review of every single manuscript could take years and one would have to find the money to pay for the replication or confirmation experiments conducted by peer reviewers. Publication of experiments would come to a grinding halt because thousands of manuscripts would be stuck in the purgatory of peer review. Relying on the integrity of the scientists submitting the data and their internal review processes may seem naïve, but it has always been the bedrock of scientific peer review. And it is precisely the internal review process which may have gone awry in the Anversa group.
Just like Pygmalion fell in love with Galatea, researchers fall in love with the hypotheses and theories that they have constructed. To minimize the effects of these personal biases, scientists regularly present their results to colleagues within their own groups at internal lab meetings and seminars or at external institutions and conferences long before they submit their data to a peer-reviewed journal. The preliminary presentations are intended to spark discussions, inviting the audience to challenge the veracity of the hypotheses and the data while the work is still in progress. Sometimes fellow group members are truly skeptical of the results, at other times they take on the devil’s advocate role to see if they can find holes in their group’s own research. The larger a group, the greater the chance that one will find colleagues within a group with dissenting views. This type of feedback is a necessary internal review process which provides valuable insights that can steer the direction of the research.
Considering the size of the Anversa group – consisting of 20, 30 or even more PhD students, postdoctoral fellows and senior scientists – it is puzzling why the discussions among the group members did not already internally challenge their hypotheses and findings, especially in light of the fact that they knew extramural scientists were having difficulties replicating the work.
“I think that most scientists, perhaps with the exception of the most lucky or most dishonest, have personal experience with failure in science—experiments that are unreproducible, hypotheses that are fundamentally incorrect. Generally, we sigh, we alter hypotheses, we develop new methods, we move on. It is the data that should guide the science.
In the Anversa group, a model with much less intellectual flexibility was applied. The “Hypothesis” was that c-kit (cd117) positive cells in the heart (or bone marrow if you read their earlier studies) were cardiac progenitors that could: 1) repair a scarred heart post-myocardial infarction, and: 2) supply the cells necessary for cardiomyocyte turnover in the normal heart.
This central theme was that which supplied the lab with upwards of $50 million worth of public funding over a decade, a number which would be much higher if one considers collaborating labs that worked on related subjects.
In theory, this hypothesis would be elegant in its simplicity and amenable to testing in current model systems. In practice, all data that did not point to the “truth” of the hypothesis were considered wrong, and experiments which would definitively show if this hypothesis was incorrect were never performed (lineage tracing e.g.).”
Discarding data that might have challenged the central hypothesis appears to have been a central principle.
According to the whistleblower, Anversa’s group did not just discard undesirable data, they actually punished group members who would question the group’s hypotheses:
“In essence, to Dr. Anversa all investigators who questioned the hypothesis were “morons,” a word he used frequently at lab meetings. For one within the group to dare question the central hypothesis, or the methods used to support it, was a quick ticket to dismissal from your position.“
The group also created an environment of strict information hierarchy and secrecy which is antithetical to the spirit of science:
“The day to day operation of the lab was conducted under a severe information embargo. The lab had Piero Anversa at the head with group leaders Annarosa Leri, Jan Kajstura and Marcello Rota immediately supervising experimentation. Below that was a group of around 25 instructors, research fellows, graduate students and technicians. Information flowed one way, which was up, and conversation between working groups was generally discouraged and often forbidden.
Raw data left one’s hands, went to the immediate superior (one of the three named above) and the next time it was seen would be in a manuscript or grant. What happened to that data in the intervening period is unclear.
A side effect of this information embargo was the limitation of the average worker to determine what was really going on in a research project. It would also effectively limit the ability of an average worker to make allegations regarding specific data/experiments, a requirement for a formal investigation.“
This segregation of information is a powerful method to maintain an authoritarian rule and is more typical for terrorist cells or intelligence agencies than for a scientific lab, but it would definitely explain how the Anversa group was able to mass produce numerous irreproducible papers without any major dissent from within the group.
In addition to the secrecy and segregation of information, the group also created an atmosphere of fear to ensure obedience:
“Although individually-tailored stated and unstated threats were present for lab members, the plight of many of us who were international fellows was especially harrowing. Many were technically and educationally underqualified compared to what might be considered average research fellows in the United States. Many also originated in Italy where Dr. Anversa continues to wield considerable influence over biomedical research.
This combination of being undesirable to many other labs should they leave their position due to lack of experience/training, dependent upon employment for U.S. visa status, and under constant threat of career suicide in your home country should you leave, was enough to make many people play along.
Even so, I witnessed several people question the findings during their time in the lab. These people and working groups were subsequently fired or resigned. I would like to note that this lab is not unique in this type of exploitative practice, but that does not make it ethically sound and certainly does not create an environment for creative, collaborative, or honest science.”
Foreign researchers are particularly dependent on their employment to maintain their visa status and the prospect of being fired from one’s job can be terrifying for anyone.
This is an anonymous account of a whistleblower and as such, it is problematic. The use of anonymous sources in science journalism could open the doors for all sorts of unfounded and malicious accusations, which is why the ethics of using anonymous sources was heavily debated at the recent ScienceOnline conference. But the claims of the whistleblower are not made in a vacuum – they have to be evaluated in the context of known facts. The whistleblower’s claim that the Anversa group and their collaborators received more than $50 million to study bone marrow cell and c-kit cell regeneration of the heart can be easily verified at the public NIH grant funding RePORTer website. The whistleblower’s claim that many of the Anversa group’s findings could not be replicated is also a verifiable fact. It may seem unfair to condemn Anversa and his group for creating an atmosphere of secrecy and obedience which undermined the scientific enterprise, caused torment among trainees and wasted millions of dollars of tax payer money simply based on one whistleblower’s account. However, if one looks at the entire picture of the amazing rise and decline of the Anversa group’s foray into cardiac regeneration, then the whistleblower’s description of the atmosphere of secrecy and hierarchy seems very plausible.
The investigation of Harvard into the Anversa group is not open to the public and therefore it is difficult to know whether the university is primarily investigating scientific errors or whether it is also looking into such claims of egregious scientific misconduct and abuse of scientific trainees. It is unlikely that Anversa’s group is the only group that might have engaged in such forms of misconduct. Threatening dissenting junior researchers with a loss of employment or visa status may be far more common than we think. The gravity of the problem requires that the NIH – the major funding agency for biomedical research in the US – should look into the prevalence of such practices in research labs and develop safeguards to prevent the abuse of science and scientists.
The cancer researchers Glenn Begley and Lee Ellis made a rather remarkable claim last year. In a commentary that analyzed the dearth of efficacious novel cancer therapies, they revealed that scientists at the biotechnology company Amgen were unable to replicate the vast majority of published pre-clinical research studies. Only 6 out of 53 landmark cancer studies could be replicated, a dismal success rate of 11%! The Amgen researchers had deliberately chosen highly innovative cancer research papers, hoping that these would form the scientific basis for future cancer therapies that they could develop. It should not come as a surprise that progress in developing new cancer treatments is so sluggish. New clinical treatments are often based on innovative scientific concepts derived from pre-clinical laboratory research. However, if the pre-clinical scientific experiments cannot be replicated, it would be folly to expect that clinical treatments based on these questionable scientific concepts would succeed.
Reproducibility of research findings is the cornerstone of science. Peer-reviewed scientific journals generally require that scientists conduct multiple repeat experiments and report the variability of their findings before publishing them. However, it is not uncommon for researchers to successfully repeat experiments and publish a paper, only to learn that colleagues at other institutions can’t replicate the findings. This does not necessarily indicate foul play. The reasons for the lack of reproducibility include intentional fraud and misconduct, yes, but more often it’s negligence, inadvertent errors, imperfectly designed experiments and the subliminal biases of the researchers or other uncontrollable variables.
Clinical studies, of new drugs, for example, are often plagued by the biological variability found in study participants. A group of patients in a trial may exhibit different responses to a new medication compared to patients enrolled in similar trials at different locations. In addition to genetic differences between patient populations, factors like differences in socioeconomic status, diet, access to healthcare, criteria used by referring physicians, standards of data analysis by researchers or the subjective nature of certain clinical outcomes – as well as many other uncharted variables – might all contribute to different results.
The claims of low reproducibility made by Begley and Ellis, however, did not refer to clinical cancer research but to pre-clinical science. Pre-clinical scientists attempt to reduce the degree of experimental variability by using well-defined animal models and standardized outcomes such as cell division, cell death, cell signaling or tumor growth. Without the variability inherent in patient populations, pre-clinical research variables should in theory be easier to control. The lack of reproducibility in pre-clinical cancer research has a significance that reaches far beyond just cancer research. Similar or comparable molecular and cellular experimental methods are also used in other areas of biological research, such as stem cell biology, neurobiology or cardiovascular biology. If only 11% of published landmark papers in cancer research are reproducible, it raises questions about how published papers in other areas of biological research fare.
Following the publication of Begley and Ellis’ commentary, cancer researchers wanted to know more details. Could they reveal the list of the irreproducible papers? How were the experiments at Amgen conducted to assess reproducibility? What constituted a successful replication? Were certain areas of cancer research or specific journals more prone to publishing irreproducible results? What was the cause of the poor reproducibility? Unfortunately, the Amgen scientists were bound by confidentiality agreements that they had entered into with the scientists whose work they attempted to replicate. They could not reveal which papers were irreproducible or specific details regarding the experiments, thus leaving the cancer research world in a state of uncertainty. If so much published cancer research cannot be replicated, how can the field progress?
Lee Ellis has now co-authored another paper to delve further into the question. In the study, published in the journal PLOS One, Ellis teamed up with colleagues at the renowned University of Texas MD Anderson Cancer Center to survey faculty members and trainees (PhD students and postdoctoral fellows) at the center. Only 15-17% of their colleagues responded to the anonymous survey, but the responses confirmed that reproducibility of papers in peer-reviewed scientific journals is a major problem. Two-thirds of the senior faculty respondents revealed they had been unable to replicate published findings, and the same was true for roughly half of the junior faculty members as well as trainees. Seventy-eight percent of the scientists had attempted to contact the authors of the original scientific paper to identify the problem, but only 38.5% received a helpful response. Nearly 44% of the researchers encountered difficulties when trying to publish findings that contradicted the results of previously published papers.
The list of scientific journals in which some of the irreproducible papers were published includes the the “elite” of scientific publications: The prestigious Nature tops the list with ten mentions, but one can also find Cancer Research (nine mentions), Cell (six mentions), PNAS (six mentions) and Science (three mentions).
Does this mean that these high-profile journals are the ones most likely to publish irreproducible results? Not necessarily. Researchers typically choose to replicate the work published in high-profile journals and use that as a foundation for new projects. Researchers at MD Anderson Cancer Center may not have been able to reproduce the results of ten cancer research papers published in Nature, but the survey did not provide any information regarding how many cancer research papers in Nature were successfully replicated.
The lack of data on successful replications is a major limitation of this survey. We know that more than half of all scientists responded “Yes” to the rather opaque question “Have you ever tried to reproduce a finding from a published paper and not been able to do so?”, but we do not know how often this occurred. Researchers who successfully replicated nine out of ten papers and researchers who failed to replicate four out of four published papers would have both responded “Yes.” Other limitations of this survey include that it does not list the specific irreproducible papers or clearly define what constitutes reproducibility. Published scientific papers represent years of work and can encompass five, ten or more distinct experiments. Does successful reproducibility require that every single experiment in a paper be replicated or just the major findings? What if similar trends are seen but the magnitude of effects is smaller than what was published in the original paper?
Does this mean that cancer research is facing a crisis? If only 11% of pre-clinical cancer research is reproducible, as originally proposed by Begley and Ellis, then it might be time to sound the alarm bells. But since we don’t know how exactly reproducibility was assessed, it is impossible to ascertain the extent of the problem. The word “crisis” also has a less sensationalist meaning: the time for a crucial decision. In that sense, cancer research and perhaps much of contemporary biological and medical research needs to face up to the current quality control “crisis.” Scientists need to wholeheartedly acknowledge that reproducibility is a major problem and crucial steps must be taken to track and improve the reproducibility of published scientific work.
First, scientists involved in biological and medical research need to foster a culture that encourages the evaluation of reproducibility and develop the necessary infrastructure. When scientists are unable to replicate results of published papers and contact the authors, the latter need to treat their colleagues with respect and work together to resolve the issue. Many academic psychologists have already recognized the importance of tracking reproducibility and initiated a large-scale collaborative effort to tackle the issue; the Harvard psychologists Joshua Hartshorne and Adena Schachner also recently proposed using a formal approach to track the reproducibility of research. Biological and medical scientists should consider adopting similar infrastructures for their research, because reproducibility is clearly not just a problem for psychology research.
Second, grant-funding agencies should provide adequate research funding for scientists to conduct replication studies. Currently, research grants are awarded to those who propose the most innovative experiments, but few — if any — funds are available for researchers who want to confirm or refute a published scientific paper. While innovation is obviously important, attempts to replicate published findings deserve recognition and funding because new work can only succeed if it is built on solid, reproducible scientific data.
In the U.S., it can take 1-2 years from when researchers submit a grant proposal to when they receive funding to conduct research. Funding agencies could consider an alternate approach, one that allows for rapid approval of small-budget grant proposals so that researchers can immediately start evaluating the reproducibility of recent breakthrough discoveries. Such funding for reproducibility testing could be provided to individual laboratories or teams of scientists such as the Reproducibility Initiative or the recent efforts of chemistry bloggers to document reproducibility.
Lastly, it is also important that scientific journals address the issue of reproducibility. One of the most common and also most heavily criticized metrics for the success of a scientific journal is its “impact factor,” an indicator of how often an average article published in the journal is cited. Even irreproducible scientific papers can be cited thousands of times and boost a journal’s “impact.”
If a system tracked the reproducibility of scientific papers, one could conceivably calculate a reproducibility score for any scientific journal. That way, a journal’s reputation would not only rest on the average number of citations but also on the reliability of the papers it publishes. Scientific journals should also consider supporting reproducibility initiatives by encouraging the publication of papers that attempted to replicate previous papers — as long as the reproducibility was tested in a rigorous fashion and independent of whether or not the replication attempts were successful.
There is no need to publish the 20th replication study that merely confirms what 19 previous studies have previously found, but publication of replication attempts is sorely needed before a consensus is reached regarding a scientific discovery. The journal PLOS One has partnered up with the Reproducibility Initiative to provide a forum for the publication of replication studies, but there is no reason why other journals should not follow.
While PLOS One publishes many excellent papers, current requirements for tenure and promotion at academic centers often require that researchers publish in certain pre-specified scientific journals, including those affiliated with certain professional societies and which carry prestige in a designated field of research. If these journals also encouraged the publication of replication attempts, more researchers would conduct them and contribute to the post-publication quality control of scientific literature.
The recent questions raised about the reproducibility of biological and medical research findings is forcing scientists to embark on a soul-searching mission. It is likely that this journey will shake up many long-held beliefs. But this reappraisal will ultimately lead to a more rigorous and reliable science.
One major obstacle in the “infotainment versus critical science writing” debate is that there is no universal definition of what constitutes “critical analysis” in science writing. How can we decide whether or not critical science writing is adequately represented in contemporary science writing or science journalism, if we do not have a standardized method of assessing it? For this purpose, I would like to propose the following checklist of points that can be addressed in news articles or blog-posts which focus on the critical analysis of published scientific research. This checklist is intended for the life sciences – biological and medical research – but it can be easily modified and applied to critical science writing in other areas of research. Each category contains examples of questions which science writers can direct towards members of the scientific research team, institutional representatives or by performing an independent review of the published scientific data. These questions will have to be modified according to the specific context of a research study.
1. Novelty of the scientific research:
Most researchers routinely claim that their findings are novel, but are the claims of novelty appropriate? Is the research pointing towards a fundamentally new biological mechanism or introducing a completely new scientific tool? Or does it just represent a minor incremental growth in our understanding of a biological problem?
2. Significance of the research:
How does the significance of the research compare to the significance of other studies in the field? A biological study might uncover new regulators of cell death or cell growth, but how many other such regulators have been discovered in recent years? How does the magnitude of the effect in the study compare to magnitude of effects in other research studies? Suppressing a gene might prolong the survival of a cell or increase the regeneration of an organ, but have research groups published similar effects in studies which target other genes? Some research studies report effects that are statistically significant, but are they also biologically significant?
Have the findings of the scientific study been replicated by other research groups? Does the research study attempt to partially or fully replicate prior research? If the discussed study has not yet been replicated, is there any information available on the general replicability success rate in this area of research?
4. Experimental design:
Did the researchers use an appropriate experimental design for the current study by ensuring that they included adequate control groups and addressed potential confounding factors? Were the experimental models appropriate for the questions they asked and for the conclusions they are drawing? Did the researchers study the effects they observed at multiple time points or just at one single time point? Did they report the results of all the time points or did they just pick the time points they were interested in?
Examples of issues: 1) Stem cell studies in which human stem cells are transplanted into injured or diseased mice are often conducted with immune deficient mice to avoid rejection of the human cells. Some studies do not assess whether the immune deficiency itself impacted the injury or disease, which could be a confounding factor when interpreting the results. 2) Studies which investigate the impact of the 24-hour internal biological clock on the expression of genes sometimes perform the studies in humans and animals who maintain a regular sleep-wake schedule. This obscures the cause-effect relationship because one is unable to ascertain whether the observed effects are truly regulated by an internal biological clock or whether they merely reflect changes associated with being awake versus asleep.
5. Experimental methods:
Are the methods used in the research study accepted by other researchers? If the methods are completely novel, have they been appropriately validated? Are there any potential artifacts that could explain the findings? How did the findings in a dish (“in vitro“) compare to the findings in an animal experiment (“in vivo“)? If new genes were introduced into cells or into animals, was the level of activity comparable to levels found in nature or were the gene expression levels 10-, 100- or even 1000-fold higher than physiologic levels?
Examples of issues: In stem cell research, a major problem faced by researchers is how stem cells are defined, what constitutes cell differentiation and how the fate of stem cells is tracked. One common problem that has plagued peer-reviewed studies published in high-profile journals is the inadequate characterization of stem cells and function of mature cells derived from the stem cells. Another problem in the stem cell literature is the fact that stem cells are routinely labeled with fluorescent markers to help track their fate, but it is increasingly becoming apparent that unlabeled cells (i.e. non-stem cells) can emit a non-specific fluorescence that is quite similar to that of the labeled stem cells. If a study does not address such problems, some of its key conclusions may be flawed.
6. Statistical analysis:
Did the researchers use the appropriate statistical tests to test the validity of their results? Were the experiments adequately powered (have a sufficient sample size) to draw valid conclusions? Did the researchers pre-specify the number of repeat experiments, animals or humans in their experimental groups prior to conducting the studies? Did they modify the number of animals or human subjects in the experimental groups during the course of the study?
7. Consensus or dissent among scientists:
What do other scientists think about the published research? Do they agree with the novelty, significance and validity of the scientific findings as claimed by the authors of the published paper or do they have specific concerns in this regard?
8. Peer review process:
What were the major issues raised during the peer review process? How did the researchers address the concerns of the reviewers? Did any journals previously reject the study before it was accepted for publication?
9. Financial interests:
How was the study funded? Did the organization or corporation which funded the study have any say in how the study was designed, how the data was analyzed and what data was included in the publication? Do the researchers hold any relevant patents, own stock or receive other financial incentives from institutions or corporations that could benefit from this research?
10. Scientific misconduct, fraud or breach of ethics
Are there any allegations or concerns about scientific misconduct, fraud or breach of ethics in the context of the research study? If such concerns exist, what are the specific measures taken by the researchers, institutions or scientific journals to resolve the issues? Have members of the research team been previously investigated for scientific misconduct or fraud? Are there concerns about how informed consent was obtained from the human subjects?
This is just a preliminary list and I would welcome any feedback on how to improve this list in order to develop tools for assessing the critical analysis content in science writing. It may not always be possible to obtain the pertinent information. For example, since the peer review process is usually anonymous, it may be impossible for a science writer to find out details about what occurred during the peer review process if the researchers themselves refuse to comment on it.
One could assign a point value to each of the categories in this checklist and then score individual science news articles or science blog-posts that discuss specific research studies. A greater in-depth discussion of any issue should result in a greater point score for that category.
Points would not only be based on the number of issues raised but also on the quality of analysis provided in each category. Listing all the funding sources is not as helpful as providing an analysis of how the funding could have impacted the data interpretation. Similarly, if the science writer notices errors in the experimental design, it would be very helpful for the readers to understand whether these errors invalidate all major conclusions of the study or just some of its conclusions. Adding up all the points would then generate a comprehensive score that could become a quantifiable indicator of the degree of critical analysis contained in a science news article or blog-post.
EDIT: The checklist now includes a new category – scientific misconduct, fraud or breach of ethics.