bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–05–11
six papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Am Med Inform Assoc. 2025 May 07. pii: ocaf063. [Epub ahead of print]
       OBJECTIVES: This study aims to summarize the usage of large language models (LLMs) in the process of creating a scientific review by looking at the methodological papers that describe the use of LLMs in review automation and the review papers that mention they were made with the support of LLMs.
    MATERIALS AND METHODS: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on based on the OpenAI GPT-4o model. ChatGPT and Scite.ai were used in cleaning the data, generating the code for figures, and drafting the manuscript.
    RESULTS: Of the 3788 articles retrieved, 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n = 126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n = 26, 15.1%) were actual reviews that acknowledged LLM usage. Most citations focused on the automation of a particular stage of review, such as Searching for publications (n = 60, 34.9%) and Data extraction (n = 54, 31.4%). When comparing the pooled performance of GPT-based and BERT-based models, the former was better in data extraction with a mean precision of 83.0% (SD = 10.4) and a recall of 86.0% (SD = 9.8).
    DISCUSSION AND CONCLUSION: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. Despite limitations, such as lower accuracy of extraction for numeric data, we anticipate that LLMs will soon change the way scientific reviews are conducted.
    Keywords:  Covidence; large language models; review automation; scoping review; systematic review
    DOI:  https://doi.org/10.1093/jamia/ocaf063
  2. AI Ethics. 2025 Apr;5(2): 1499-1521
      Using artificial intelligence (AI) in research offers many important benefits for science and society but also creates novel and complex ethical issues. While these ethical issues do not necessitate changing established ethical norms of science, they require the scientific community to develop new guidance for the appropriate use of AI. In this article, we briefly introduce AI and explain how it can be used in research, examine some of the ethical issues raised when using it, and offer nine recommendations for responsible use, including: (1) Researchers are responsible for identifying, describing, reducing, and controlling AI-related biases and random errors; (2) Researchers should disclose, describe, and explain their use of AI in research, including its limitations, in language that can be understood by non-experts; (3) Researchers should engage with impacted communities, populations, and other stakeholders concerning the use of AI in research to obtain their advice and assistance and address their interests and concerns, such as issues related to bias; (4) Researchers who use synthetic data should (a) indicate which parts of the data are synthetic; (b) clearly label the synthetic data; (c) describe how the data were generated; and (d) explain how and why the data were used; (5) AI systems should not be named as authors, inventors, or copyright holders but their contributions to research should be disclosed and described; (6) Education and mentoring in responsible conduct of research should include discussion of ethical use of AI.
    Keywords:  Accountability; Artificial intelligence; Bias; Error; Ethics; Explainability; Research; Social responsibility; Transparency; Trust
    DOI:  https://doi.org/10.1007/s43681-024-00493-8
  3. JMIR Med Inform. 2025 May 09. 13 e63267
       Background: For the public health community, monitoring recently published articles is crucial for staying informed about the latest research developments. However, identifying publications about studies with specific research designs from the extensive body of public health publications is a challenge with the currently available methods.
    Objective: Our objective is to develop a fine-tuned pretrained language model that can accurately identify publications from clinical trials that use a group- or cluster-randomized trial (GRT), individually randomized group-treatment trial (IRGT), or stepped wedge group- or cluster-randomized trial (SWGRT) design within the biomedical literature.
    Methods: We fine-tuned the BioMedBERT language model using a dataset of biomedical literature from the Office of Disease Prevention at the National Institute of Health. The model was trained to classify publications into three categories of clinical trials that use nested designs. The model performance was evaluated on unseen data and demonstrated high sensitivity and specificity for each class.
    Results: When our proposed model was tested for generalizability with unseen data, it delivered high sensitivity and specificity for each class as follows: negatives (0.95 and 0.93), GRTs (0.94 and 0.90), IRGTs (0.81 and 0.97), and SWGRTs (0.96 and 0.99), respectively.
    Conclusions: Our work demonstrates the potential of fine-tuned, domain-specific language models to accurately identify publications reporting on complex and specialized study designs, addressing a critical need in the public health research community. This model offers a valuable tool for the public health community to directly identify publications from clinical trials that use one of the three classes of nested designs.
    Keywords:  AI; artificial intelligence; biomedical; clinical trials; dataset; development; document classification; language model; machine learning; model; natural language processing; public health; randomized trials; tool; transformer; trial
    DOI:  https://doi.org/10.2196/63267
  4. Biomed Eng Online. 2025 May 03. 24(1): 52
       BACKGROUND: Despite the proliferation of clinical research that can be used to inform Clinical Practice Guidelines there remain many areas where the number and quality of research studies vary widely. Using the Canadian Clinical Practice Guideline for Moderate-to-Severe Traumatic Brain Injury (MOD-SEV TBI) as an example, there is a lack of robust research evidence, derived from randomized controlled trials, meta-analyses, and systematic reviews to inform the recommendations. Randomized controlled trials in this field often have limitations, such as smaller sample sizes and gender and racial disparities in enrollment, that reduce the level of evidence they can provide. Notably, evidence is often lacking in the priority areas identified by people with lived experience (PWLE) and guideline end-users.
    METHODS: The Canadian Clinical Practice Guideline for MOD-SEV TBI rehabilitation is a Living Guideline that implemented a robust and replicable process to mitigate these issues. This process includes: 1. Identification of Priorities by PWLE of MOD-SEV TBI and Guideline End-Users; 2. Involvement of Diverse Multidisciplinary Expert Panels, Including PWLE; 3. Compilation, Review and Evaluation of Published MOD-SEV TBI Evidence; 4. Identification of Gaps in the Published Literature; 5. Formulation of Recommendations, Rigorous Grading of Available Evidence and Formal Voting; 6. Creation of Knowledge Translation and Mobilization Tools and 7. Publication of the Updated Living Guideline.
    RESULTS: Since 2014-15, the Canadian TBI Living Guideline has implemented and refined this process to produce high-quality expert consensus-based recommendations and knowledge translation and mobilization tools across 21 comprehensive domains of TBI rehabilitation. There are 351 recommendations in the current version of the Canadian TBI Living Guideline; 68% of these are primarily consensus-based recommendations. Developing a comprehensive guideline in areas where research may not be present or strong ensures that the Guideline is comprehensive and addresses the priority needs of clinicians and PWLE.
    CONCLUSIONS: The use of robust, transparent, and replicable evidence reviews and expert consensus building process produces clinical guidelines that are relevant and applicable even when empirical data are lacking or absent. This process of developing consensus-based recommendations can be used to develop guidelines in other content areas and populations facing similar challenges.
    DOI:  https://doi.org/10.1186/s12938-025-01385-6
  5. J Med Libr Assoc. 2025 Apr 18. 113(2): 184-188
      Prompt engineering, an emergent discipline at the intersection of Generative Artificial Intelligence (GAI), library science, and user experience design, presents an opportunity to enhance the quality and precision of information retrieval. An innovative approach applies the widely understood PICO framework, traditionally used in evidence-based medicine, to the art of prompt engineering. This approach is illustrated using the "Task, Context, Example, Persona, Format, Tone" (TCEPFT) prompt framework as an example. TCEPFT lends itself to a systematic methodology by incorporating elements of task specificity, contextual relevance, pertinent examples, personalization, formatting, and tonal appropriateness in a prompt design tailored to the desired outcome. Frameworks like TCEPFT offer substantial opportunities for librarians and information professionals to streamline prompt engineering and refine iterative processes. This practice can help information professionals produce consistent and high-quality outputs. Library professionals must embrace a renewed curiosity and develop expertise in prompt engineering to stay ahead in the digital information landscape and maintain their position at the forefront of the sector.
    Keywords:  Generative Artificial Intelligence; Information Retrieval; PICO; Prompt Engineering
    DOI:  https://doi.org/10.5195/jmla.2025.2022
  6. Sci Rep. 2025 May 03. 15(1): 15493
      Identifying protein-protein interactions (PPIs) is a foundational task in biomedical natural language processing. While specialized models have been developed, the potential of general-domain large language models (LLMs) in PPI extraction, particularly for researchers without computational expertise, remains unexplored. This study evaluates the effectiveness of proprietary LLMs (GPT-3.5, GPT-4, and Google Gemini) in PPI prediction through systematic prompt engineering. We designed six prompting scenarios of increasing complexity, from basic interaction queries to sophisticated entity-tagged formats, and assessed model performance across multiple benchmark datasets (LLL, IEPA, HPRD50, AIMed, BioInfer, and PEDD). Carefully designed prompts effectively guided LLMs in PPI prediction. Gemini 1.5 Pro achieved the highest performance across most datasets, with notable F1-scores in LLL (90.3%), IEPA (68.2%), HPRD50 (67.5%), and PEDD (70.2%). GPT-4 showed competitive performance, particularly in the LLL dataset (87.3%). We identified and addressed a positive prediction bias, demonstrating improved performance after evaluation refinement. While not surpassing specialized models, general-purpose LLMs with appropriate prompting strategies can effectively perform PPI prediction tasks, offering valuable tools for biomedical researchers without extensive computational expertise.
    Keywords:  Large language model; Natural language processing; Protein–protein interaction; Relation extraction
    DOI:  https://doi.org/10.1038/s41598-025-99290-4