bims-arines 2026-03-22 papers

J Clin Epidemiol. 2026 Mar 12. pii: S0895-4356(26)00096-X. [Epub ahead of print] 112221

Large language models show promising performance for some systematic review tasks but call for cautious implementation : a systematic review.

Florian Laignelot, Guillaume L Martin, Mohamad Ossman, Ophélie Pingeon, Amine Boubaker, Emma Picovschi, Jia Kim, Xavier Tannier, Jérémie F Cohen, Agnès Dechartres.

OBJECTIVES: With the exponential growth of biomedical literature, the challenge of conducting systematic reviews is becoming increasingly burdensome. We aimed to evaluate the performance of LLMs in the automation of some or all steps of systematic reviews and meta-analyses.
STUDY DESIGN AND SETTING: In this systematic review, we searched PubMed, Embase, the Cochrane Library and preprint platforms up to 14/01/2025. We included any studies assessing the performance of LLMs (e.g., GPT, Claude, Mistral) in any step of the systematic review process. Pairs of reviewers independently extracted data and assessed risk of bias. We conducted analyses using median(IQR) for positive (PPA) and negative percent agreement (NPA), respectively analogous to sensitivity and specificity, between LLMs and human reviewers.
RESULTS: From 3,889 unique references, we included 63 studies of which 52 reporting performance metrics for a total of 148 LLM performance assessments. Most assessments concerned GPT models (n=114, 77%). The most frequently evaluated tasks were Title and Abstract Screening (n=78, 53%), Data Extraction (n=23, 16%), and Full-Text screening (n=20, 14%). For Title and Abstract screening, overall median PPA was 0.92 (IQR 0.69-0.98) and median NPA was 0.89 (0.72-0.95). For full text screening, the overall median PPA was 0.93 (0.87-1.00) and median NPA was 0.92 (0.78-0.97). Late-generation LLMs released after GPT-4 seemed to provide higher performance than earlier models. For other tasks, authors reported overall good performances, but variability of performance metrics precluded complete quantitative synthesis. Global accuracy for data extraction tasks ranged from 0.36 to 1.00, with a median accuracy of 0.95 (IQR 0.91-0.97, n=11). For the 'Risk of Bias assessment' task, accuracy ranged from 0.44 to 0.90 (median = 0.62, IQR 0.53-0.76, n=6).
CONCLUSION: The performance of LLMs, particularly newer generations, shows promise in automating some repetitive steps of systematic reviews such as screening. However, their successful integration will require appropriate safeguards and careful implementation.

Keywords: Artificial intelligence; Large language models; Meta-analyses; Methodology; Systematic reviews

DOI: https://doi.org/10.1016/j.jclinepi.2026.112221

Laryngorhinootologie. 2026 Mar 17.

[Automation of evidence synthesis exemplified by a review on harm endpoints in ear canal cleaning].

Maxi Schulz, Jan Löhler, Orlando Guntinas-Lichius, Tim Mathes.

Automation approaches have the potential to significantly reduce the effort required for systematic literature reviews. These approaches include the use of AI-based automation software. This article presents a rapid review approach using AI-based automation tools for evidence synthesis of harm endpoints, using the example of the risks and complications associated with ear canal cleaning for cerumen removal.Based on comparative studies of rapid review methods, the standard steps of a systematic review were adapted and the most suitable supporting AI and software tools for implementation were selected, taking into account the current scientific evidence.A total of 1,521 titles and abstracts and 52 full texts were screened for eligibility. Three reviewers were involved in different stages of the process over a period of six weeks. A web-based platform supported the screening of titles and abstracts, while a large language model web service was used for data extraction.While systematic reviews require an average of five reviewers and over a year to complete, the combination of rapid review approaches and AI/automation tools for the steps of literature research, study selection, data extraction, data synthesis and visualisation significantly reduced the time and personnel required.

DOI: https://doi.org/10.1055/a-2803-8009

JMIR Form Res. 2026 Mar 19. 10 e82896

AI-Assisted Systematic Review: Humans Still Need to Review All Abstracts for Inclusion.

Hyelin Sung, Deyana Altahsh, Scott Garrison.

Unlabelled: Although a general purpose (GPT-5), and a fine-tuned (ASReviewLab) artificial intelligence were able to rank abstracts for likely inclusion in a variety of Cochrane systematic reviews, some actually included studies were not highly ranked, necessitating human review of all abstracts.

Keywords: AI; ASReviewLab; ChatGPT; GPT-5; LLM; artificial intelligence; large language model; machine learning; systematic review; title and abstract screening

DOI: https://doi.org/10.2196/82896

Account Res. 2026 Mar 15. 2645390

Hallucinated citations produced by generative artificial intelligence may constitute research misconduct when citations function as data in scholarly papers.

David B Resnik, Mohammad Hosseini.

In this article, we discuss the growing problem of hallucinated citations produced by Generative Artificial Intelligence (GenAI) in scholarly research and writing. We argue that GenAI hallucinated citations might qualify as a provable instance of research misconduct under the U.S. federal regulations when a) the researcher uses a GenAI tool to produce hallucinated (i.e., nonexistent) citations for a research document; b) the citations function as data because they directly support research findings, as in, for example, review articles or bibliometric studies; and c) the researcher demonstrates indifference to the risk of fabrication of the data (i.e. citations) because they did not check the GenAI's output for veracity and accuracy. Other types of problematic citations such as bibliometrically incorrect citations, or contextually inaccurate citations, are indicative of poor scholarship and irresponsible behavior, but do not qualify as research misconduct. Recognizing that GenAI hallucinated citations could be regarded as research misconduct in certain cases will hopefully encourage researchers to take this problem more seriously than they do now. In partnership with scientific institutions, funders and professional societies, the scholarly community should work on establishing, promoting, and enforcing standards for responsible use of AI in research, including standards pertaining to citation practices.

Keywords: Hallucinated citations; fabrication; generative artificial intelligence; publication ethics; research misconduct

DOI: https://doi.org/10.1080/08989621.2026.2645390

Oral Maxillofac Surg. 2026 Mar 21. pii: 59. [Epub ahead of print]30(1):

Evaluating the accuracy of ChatGPT-4 generated references in oral and maxillofacial surgery: a preliminary observational study.

Anuj Jain, Yash Merchant.

Keywords: Academic integrity; Artificial intelligence in medicine; Biomedical publishing; ChatGPT; Citation accuracy; Large language models (LLMs); Medical writing; Oral and maxillofacial surgery (OMS); Reference fabrication

DOI: https://doi.org/10.1007/s10006-026-01550-8