bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–03–16
seven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. JMIR Med Inform. 2025 Mar 12. 13 e64682
       Unlabelled: This study demonstrated that while GPT-4 Turbo had superior specificity when compared to GPT-3.5 Turbo (0.98 vs 0.51), as well as comparable sensitivity (0.85 vs 0.83), GPT-3.5 Turbo processed 100 studies faster (0.9 min vs 1.6 min) in citation screening for systematic reviews, suggesting that GPT-4 Turbo may be more suitable due to its higher specificity and highlighting the potential of large language models in optimizing literature selection.
    Keywords:  AI; GPT; Japan; Japanese; LLM; accuracy; artificial intelligence; citation screening; citations; clinical practice guidelines; critical care; efficiency; large language models; reliability; review; screening; sepsis; systematic review
    DOI:  https://doi.org/10.2196/64682
  2. J Med Internet Res. 2025 Mar 11. 27 e67488
       BACKGROUND: Systematic reviews and meta-analyses rely on labor-intensive literature screening. While machine learning offers potential automation, its accuracy remains suboptimal. This raises the question of whether emerging large language models (LLMs) can provide a more accurate and efficient approach.
    OBJECTIVE: This paper evaluates the sensitivity, specificity, and summary receiver operating characteristic (SROC) curve of LLM-assisted literature screening.
    METHODS: We conducted a diagnostic study comparing the accuracy of LLM-assisted screening versus manual literature screening across 6 thoracic surgery meta-analyses. Manual screening by 2 investigators served as the reference standard. LLM-assisted screening was performed using ChatGPT-4o (OpenAI) and Claude-3.5 (Anthropic) sonnet, with discrepancies resolved by Gemini-1.5 pro (Google). In addition, 2 open-source, machine learning-based screening tools, ASReview (Utrecht University) and Abstrackr (Center for Evidence Synthesis in Health, Brown University School of Public Health), were also evaluated. We calculated sensitivity, specificity, and 95% CIs for the title and abstract, as well as full-text screening, generating pooled estimates and SROC curves. LLM prompts were revised based on a post hoc error analysis.
    RESULTS: LLM-assisted full-text screening demonstrated high pooled sensitivity (0.87, 95% CI 0.77-0.99) and specificity (0.96, 95% CI 0.91-0.98), with the area under the curve (AUC) of 0.96 (95% CI 0.94-0.97). Title and abstract screening achieved a pooled sensitivity of 0.73 (95% CI 0.57-0.85) and specificity of 0.99 (95% CI 0.97-0.99), with an AUC of 0.97 (95% CI 0.96-0.99). Post hoc revisions improved sensitivity to 0.98 (95% CI 0.74-1.00) while maintaining high specificity (0.98, 95% CI 0.94-0.99). In comparison, the pooled sensitivity and specificity of ASReview tool-assisted screening were 0.58 (95% CI 0.53-0.64) and 0.97 (95% CI 0.91-0.99), respectively, with an AUC of 0.66 (95% CI 0.62-0.70). The pooled sensitivity and specificity of Abstrackr tool-assisted screening were 0.48 (95% CI 0.35-0.62) and 0.96 (95% CI 0.88-0.99), respectively, with an AUC of 0.78 (95% CI 0.74-0.82). A post hoc meta-analysis revealed comparable effect sizes between LLM-assisted and conventional screening.
    CONCLUSIONS: LLMs hold significant potential for streamlining literature screening in systematic reviews, reducing workload without sacrificing quality. Importantly, LLMs outperformed traditional machine learning-based tools (ASReview and Abstrackr) in both sensitivity and AUC values, suggesting that LLMs offer a more accurate and efficient approach to literature screening.
    Keywords:  accuracy; large language models; literature screening; meta-analysis; thoracic surgery
    DOI:  https://doi.org/10.2196/67488
  3. J Med Internet Res. 2025 Mar 10. 27 e65651
       BACKGROUND: At the end of 2023, Bayer AG launched its own internal large language model (LLM), MyGenAssist, based on ChatGPT technology to overcome data privacy concerns. It may offer the possibility to decrease their harshness and save time spent on repetitive and recurrent tasks that could then be dedicated to activities with higher added value. Although there is a current worldwide reflection on whether artificial intelligence should be integrated into pharmacovigilance, medical literature does not provide enough data concerning LLMs and their daily applications in such a setting. Here, we studied how this tool could improve the case documentation process, which is a duty for authorization holders as per European and French good vigilance practices.
    OBJECTIVE: The aim of the study is to test whether the use of an LLM could improve the pharmacovigilance documentation process.
    METHODS: MyGenAssist was trained to draft templates for case documentation letters meant to be sent to the reporters. Information provided within the template changes depending on the case: such data come from a table sent to the LLM. We then measured the time spent on each case for a period of 4 months (2 months before using the tool and 2 months after its implementation). A multiple linear regression model was created with the time spent on each case as the explained variable, and all parameters that could influence this time were included as explanatory variables (use of MyGenAssist, type of recipient, number of questions, and user). To test if the use of this tool impacts the process, we compared the recipients' response rates with and without the use of MyGenAssist.
    RESULTS: An average of 23.3% (95% CI 13.8%-32.8%) of time saving was made thanks to MyGenAssist (P<.001; adjusted R2=0.286) on each case, which could represent an average of 10.7 (SD 3.6) working days saved each year. The answer rate was not modified by the use of MyGenAssist (20/48, 42% vs 27/74, 36%; P=.57) whether the recipient was a physician or a patient. No significant difference was found regarding the time spent by the recipient to answer (mean 2.20, SD 3.27 days vs mean 2.65, SD 3.30 days after the last attempt of contact; P=.64). The implementation of MyGenAssist for this activity only required a 2-hour training session for the pharmacovigilance team.
    CONCLUSIONS: Our study is the first to show that a ChatGPT-based tool can improve the efficiency of a good practice activity without needing a long training session for the affected workforce. These first encouraging results could be an incentive for the implementation of LLMs in other processes.
    Keywords:  ChatGPT; MyGenAssist; artificial intelligence; efficiency; large language model; pharmacovigilance
    DOI:  https://doi.org/10.2196/65651
  4. JDR Clin Trans Res. 2025 Mar 11. 23800844251321839
      Evidence-based medicine (EBM) enhances clinical decision-making but faces implementation challenges, particularly in dentistry, where patient-specific complexities limit its effectiveness. This article examines EBM through the lens of Aristotelian logic, exploring its use of deductive and inductive reasoning and its limitations in addressing real-world variability. We then discuss how artificial intelligence (AI) can enhance EBM by synthesizing data, automating evidence appraisal, and generating personalized treatment insights. While AI offers a promising solution, it also presents challenges related to ethics, transparency, and reliability. Integrating AI into EBM requires careful consideration to ensure precise, adaptive, and patient-centered decision-making.Knowledge Transfer Statement:This commentary provides a critical discourse on the challenges of evidence-based medicine and how artificial intelligence could help address these shortcomings.
    Keywords:  clinical decision-making; evidence-based medicine; limitations; medicine-based evidence; software as medical device; solutions
    DOI:  https://doi.org/10.1177/23800844251321839
  5. BMC Med Res Methodol. 2025 Mar 10. 25(1): 66
      In this review article, we provide a comprehensive overview of current practices and challenges associated with research synthesis in preclinical biomedical research. We identify critical barriers and roadblocks that impede effective identification, utilisation, and integration of research findings to inform decision making in research translation. We examine practices at each stage of the research lifecycle, including study design, conduct, and publishing, that can be optimised to facilitate the conduct of timely, accurate, and comprehensive evidence synthesis. These practices are anchored in open science and engaging with the broader research community to ensure evidence is accessible and useful to all stakeholders. We underscore the need for collective action from researchers, synthesis specialists, institutions, publishers and journals, funders, infrastructure providers, and policymakers, who all play a key role in fostering an open, robust and synthesis-ready research environment, for an accelerated trajectory towards integrated biomedical research and translation.
    Keywords:  Animal models; Evidence synthesis; Meta-analysis; Open science; Preclinical research; Systematic review
    DOI:  https://doi.org/10.1186/s12874-025-02524-2
  6. BMC Med Inform Decis Mak. 2025 Mar 10. 25(1): 124
       BACKGROUND: As part of qualitative research, the thematic analysis is time-consuming and technical. The rise of generative artificial intelligence (A.I.), especially large language models, has brought hope in enhancing and partly automating thematic analysis.
    METHODS: The study assessed the relative efficacy of conventional against AI-assisted thematic analysis when investigating the psychosocial impact of cutaneous leishmaniasis (CL) scars. Four hundred forty-eight participant responses from a core study were analysed comparing nine A.I. generative models: Llama 3.1 405B, Claude 3.5 Sonnet, NotebookLM, Gemini 1.5 Advanced Ultra, ChatGPT o1-Pro, ChatGPT o1, GrokV2, DeepSeekV3, Gemini 2.0 Advanced with manual expert analysis. Jamovi software maintained methodological rigour through Cohen's Kappa coefficient calculations for concordance assessment and similarity measurement via Python using Jaccard index computations.
    RESULTS: Advanced A.I. models showed impressive congruence with reference standards; some even had perfect concordance (Jaccard index = 1.00). Gender-specific analyses demonstrated consistent performance across subgroups, allowing a nuanced understanding of psychosocial consequences. The grounded theory process developed the framework for the fragile circle of vulnerabilities that incorporated new insights into CL-related psychosocial complexity while establishing novel dimensions.
    CONCLUSIONS: This study shows how A.I. can be incorporated in qualitative research methodology, particularly in complex psychosocial analysis. Consequently, the A.I. deep learning models proved to be highly efficient and accurate. These findings imply that the future directions for qualitative research methodology should focus on maintaining analytical rigour through the utilisation of technology using a combination of A.I. capabilities and human expertise following standardised future checklist of reporting full process transparency.
    Keywords:  Artificial intelligence in qualitative research; Cutaneous leishmaniasis; Grounded theory development; Large language models; Natural language processing; Research automation; Thematic analysis
    DOI:  https://doi.org/10.1186/s12911-025-02961-5