The surge in paper mill papers in scientific literature has prompted scientists to create new tools for identifying fake publications.
Over the last two decades, the scientific literature has been flooded by low-quality research papers produced by for-profit organizations known as paper mills. It is estimated that suspected paper mill products account for two to 46 percent of manuscripts submitted to scientific journals, with the estimated rate of problematic articles in biomedical research reaching nearly six percent in 2023.1,2
To churn out manuscripts, paper mills often rely on templates, resulting in scientific articles with shared features. These may include textual and layout similarities, superficial descriptions of hypotheses and experimental designs, manipulated or reused digital images, and the incorrect description of reagents.3 While these manuscript “recipes” may speed up paper mills’ production, they also act as fingerprints that science integrity researchers can identify to flag papers as potential paper mill products.
In a study published earlier this year in The BMJ, a team of scientists led by statistician Adrian Barnett from the Queensland University of Technology developed a new machine learning tool to screen publications in cancer research and flag those likely to be from paper mills.4 They found that nearly ten percent of the cancer research literature screened with the tool could have originated in paper mills—a percentage that exceeds the estimated prevalence of paper mill papers in biomedical research and indicates that cancer research is a major target of these fraudulent companies.2

Of the 2.6 million cancer papers screened, nearly ten percent (261,245 publications) showed textual signs in their abstracts and titles that suggested they might have originated in a paper mill. Gastric, bone, liver, esophageal, and ovarian cancers were the cancer types with the most flagged papers.
Erin Lemieux
“We have few solutions and even fewer researchers trying to design solution[s] for the problem, so this is really amazing,” said João Phillipe Cardenuto, a postdoctoral researcher and digital forensics scientist from the University of Campinas, Brazil, who was not involved in the study.
To identify cancer papers that likely originated from paper mills, Barnett and his colleagues developed a machine learning-based tool that identifies patterns in text and then compares these to textual patterns present in retracted paper mill papers. While previous research suggested that text templates can be used to train machine learning models to identify paper mill products, this approach had never been tested in the cancer research field.5 “Unfortunately, cancer has been quite a target for these kinds of papers,” Barnett explained. “Partly, it’s the prestige of working in cancer. There are a lot of journals in cancer. Partly, basic science has been a bit of a target for these paper mills because it’s kind of a little bit easier to make up data.”
The team focused their analysis on the abstracts and titles of papers as these components were easy to access. They developed their model using papers tagged as originating from paper mills in the Retraction Watch database, and then they validated the tool’s performance using an online list of problematic papers compiled by integrity sleuths. In the performance test runs, the machine learning tool correctly flagged problematic papers with about 90 percent accuracy.
The scientists then ran their screening tool against 2.6 million cancer research articles published between 1999 and 2024. Among the publications, 261,245 papers—nearly 10 percent of the entire literature corpus analyzed—showed textual similarities with retracted paper mill papers.
While the percentage of flagged cancer papers seems high, Barnett explained that it may underestimate the actual prevalence of paper mill products in the field, as these companies have ramped up their production over the years—a trend also observed in the current study. “If it’s actually ten percent, we don’t really know. It could actually be more because we’re just detecting one particular kind of template,” he said. “If the mills have other templates that are more sophisticated, we would have missed them.”
The potential fake papers were most frequently associated with certain types of cancers, including gastric (22 percent), bone (21 percent), and liver (20 percent) cancers. The team also found that the percentage of flagged papers in top-tier journals showed a sustained increase, indicating that paper mills are not limited to low impact journals and suggesting that impact factors might not be reliable proxies for research quality.6
The machine learning screening tool also revealed that authors from Chinese institutions accounted for the bulk of potential paper mill papers (36 percent), a set of findings that are consistent with previous data on the origins of these fraudulent papers.2 Despite the authors’ efforts to balance the training datasets by language, the overrepresentation of Chinese researchers might still introduce a bias in the model, explained Cardenuto, as the tool may learn patterns frequently used by Chinese authors in scientific writing instead of features that are linked to fake publications.
While the new screening tool was designed with scientific article publishers in mind, Barnett hopes it may also bring the paper mill issue into the spotlight and raise more awareness among researchers. “It’s unfortunately now something that you need to be thinking about when you’re reading papers or reviewing papers,” he said.
#Ten #Percent #Cancer #Papers #Flagged #Potentially #Fake