Analysis and Correction of Inappropriate Image …

Analysis and Correction of Inappropriate Image Duplication: The Molecular and Cellular Biology Experience

Update: This paper has now been published at mBio

“Analysis and Correction of Inappropriate Image Duplication: The Molecular and Cellular Biology Experience” by Arturo Casadevall, Elisabeth M Bik, Ferric C Fang, Amy Kullas, Roger J Davis. [Ed note: we’re using the bioRxiv listing information because that’s what we sent for review, but the author order has changed on the published paper and on that Dr. Bik is first author.]

I selected this article for review for a few reasons:

  • The integrity of the scientific literature is interesting to me.
  • I’m interested in approaches that automatically identify inappropriate image adjustments, so I was interested in the results of a manual process.
  • The manuscript was widely discussed, including on Retraction Watch, which was where I found it.

Large scale issues with reproducibility in the scientific literature have been reported in the past, including an Amgen report that work could be reproduced in only 11% of cases. There were some caveats with the Amgen finding, namely that Amgen did not name findings that were or were not reproduced. There are many potential drivers of irreproducibility, including a desire by certain journals to highlight surprising findings, difficulty obtaining adequate peer review for certain manuscripts, cherry picking of findings combined with inadequate statistical power, sloppy record keeping by scientists, and potential manipulation of data to strengthen support for a specific conclusion.

This article fits most into the final two cases. The authors describe an analysis of nearly one thousand papers from Molecular and Cellular Biology. They find 59 instances of inappropriate image duplication, which resulted in five retractions. I sent this paper to two reviewers: one was computational and focused on large-scale detection of image fraud, while the other was a cell biologist with expertise working with these types of images.

After reading the reviewers’ comments I agree that the work remains potentially important. Both reviewers noted substantial weaknesses as well. The lack of representative images, detail about the types of images examined, and other factors related to the design of this experiment were perceived to be problems.

As a procedural note, we sent this paper out for review when it was a preprint, thought it has since been published. A quick read of the published manuscript does not show changes that would have addressed the concerns of our reviewers.

One thing that we, as editors at biOverlay, noticed was that implication that “as many as 35,000 papers in the literature are candidates for retraction” appears to be poorly justified. Instead, the number in the paper appears to be correctly specified: “we can estimate that approximately 35,000 (CI 6,584-86,911) papers are candidates for retraction due to image duplication.” Unfortunately, the number from the abstract appears to have been picked up by many news sources including Retraction Watch, Chemistry World, and The Scientist. Some sources, such as Science Translational Medicine, do appear to have gotten things right [update: Retraction Watch notes that they dive into the nuance within the article]. The estimate from the abstract of the preprint did also make it into the abstract of the published paper.

I want to thank the authors for sharing their work on bioRxiv before it has undergone peer review. I also want to take an opportunity to thank the reviewers who have donated their time to evaluate the novelty, strengths, and limitations of this work. One reviewer has chosen to be named. Both reviewers are faculty members. The two reviews are available in full below.

Reviewer 1 (Anonymous reviewer with expertise in cell biology)

The authors describe a survey of image duplication in the journal Molecular and Cell Biology. The numbers reported in the abstract are substantial. Upon reading the paper I have concerns about the reproducibility of the methods and the accuracy of the analysis.

Major concerns:

  • The methods section is inadequate. The selection of papers seems adequately described, but the initial screening description just says results were inspected by eye. There is no description of the number of images flagged by EMB that were not verified by AC and FCF or the inter-individual concordance between AC and FCF. The experimental design would have been substantially stronger if randomly selected non-flagged images were also provided to AC and FCF in a blinded manner.
  • The rate at which image allegations were confirmed using ORI forensic software is not described.
  • Were the types of image duplication that occurred before 2013 and during or after 2013 different?

The authors state: “Consequently, we are now able to provide information as to how inappropriate image duplications occur.” I am not convinced. We do not know the sensitivity and specificity of this screening process. Even if the specificity is high, there may be types of alteration that were not detected, and they may occur differently than the types that were detected.

The main difficulty with this paper is that there is no way to verify that any of the data presented are true. The only evidence that the process does not result in false positives is the response of the authors to contact by the journal. It is typical when providing quantification of photographic data that a representative image is shown, but none are shown here. The images covered by this study are also not provided, which makes judging the accuracy of the work difficult.

Daniel Ernesto Acuna (

This articles examines the effect of reporting inappropriate image duplication, the effort required to resolve those reports, and a pilot program to perform image duplication detection during submission. This research was done in the journal Molecular and Cellular Biology (MCB) and the initial analysis of duplication detection was performed on 960 articles from 2009 through 2016. The results suggest that inappropriate image duplication is largely unintentional. Predictably, correcting errors or misconduct requires significantly less time during pre publication than after publication.

This article is an important step toward understanding how to correct and prevent cases of inappropriate image duplication. It complements previous work by the authors on the detection of these cases (Bik et al., 2016) and the potential causes of this type of misconduct (Costas et al., 2018). Given the reported large proportion of unintentional errors, it also strengthens the case for pre-publication scanning of publications either by staff or automated tools (e.g., Acuna et al., 2018). The authors report 30 minutes of staff time to scan a single submission, which I imagine is a significant but worthy effort. Finally, I commend the journal and editors for conducting a retrospective study of their journal. In my experience, journals, publishers, and ORIs are more interested in preventing problems for future publications and are a bit wary of reexamining the past. However, we must perform retroactive studies to rectify results that could be used in science produced today.

Reanalysis of corrections. Playing devil’s advocate, I would be interested in knowing whether the corrections submitted by contained inappropriate images (again). There have been cases of researchers producing corrections that again were found to be suspicious, ultimately leading to retractions. According to the authors, most flagged images are due to figure preparation. I would imagine therefore that figure preparation produces certain types of errors—e.g., exact copies—whereas misconduct produces others—e.g., rotations, changes in contrast. For example, a breakdown of image duplication categories (I, II, III) vs follow-up results (correction, retraction, or no response) may illuminate this point.

Generalizability. How expert was the staff that examined the figures? Were they trained? I know that the some of the authors have great experience scanning images for duplication and I am wondering how their experience compares to that of the staff. Also, how are the types of figures different from those reviewed in Bik et al., 2016? It would be good to discuss how the experience from this journal would generalize to other journals. Perhaps more details about the categories of duplication and the types of images examined (e.g., western blobs, microscopy imagery, plots, x-rays, etc) would help to translate these results to other journals. While the rate of inappropriate images was similar to their previous study, adding further discussion would be good.

Casey Greene is an Assistant Professor in the Department of Systems Pharmacology and Translational Therapeutics at the University of Pennsylvania's Perelman School of Medicine and the Director of Alex's Lemonade Stand Foundation's Childhood Cancer Data Lab. His lab aims to develop deep learning methods that integrate distinct large-scale datasets to extract the rich and intrinsic information embedded in such integrated data.