bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–10–19
nine papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Eur Urol Open Sci. 2025 Nov;81 50-57
       Background and objective: Artificial intelligence (AI), capable of analyzing vast volume of data rapidly, presents a promising solution to optimize literature screening for systematic reviews (SRs). Using the INSIDE (artificial INtelligence to Support Informed DEcision making) platform, we compared the performance of AI against the "gold standard" traditional SR method in the context of prostate cancer (PC) to assess whether AI could potentially improve efficiency and quality of screening.
    Methods: Publications from traditional screening of four SRs (focused on PC therapies and potential cardiotoxicity) were compared with the AI-based approach. Publications were ranked based on relevance scores. Work saved over sampling (WSS), that is, efforts saved by automatically excluding nonrelevant publications, determined efficiency. For a quality analysis, data visualization using a scatter plot suggested the proportions of "relevant," "irrelevant," and "not screened" records.
    Key findings and limitations: For AI-based screening, an efficiency analysis used publications from the traditional approach (n = 3363) including 278 relevant records. Of the total 3363 publications, the first ranking method screened 2365 and active learning used 3361 records. This approach was more efficient; fewer publications were required to be screened to identify 80% and 95% of 278 relevant publications (WSS@80% 20.3%; WSS@95% 9.4%). Screening efficiency increased with active learning (WSS@80% 54.0%; WSS@95% 54.8%). A scatter plot analysis presented broader search results with the Dimensions database yielding 384 465 publications and helped identify outlier articles.
    Conclusions: This study confirms the impact of an AI-based approach in optimizing the SR process. It highlights best practices and benchmarks to assess the efficiency and possibly quality of literature screening, supporting the integration of AI into future SRs.
    Patient summary: Systematic reviews (SRs) help create a detailed and unbiased summary on a specific research question. This summary is based on published information. Development of SRs using the traditional method requires detailed in-person review of the records, which takes a lot of time and effort. With the use of artificial intelligence (AI), the key data from a large amount of text are identified faster. This process requires a review of fewer records to find the most relevant ones, which saves time. The aim of this study was to understand how an AI tool, known as INSIDE PC, could help with SRs. This study looked at how well INSIDE PC worked compared with the traditional method for SRs. The AI method scored articles based on their relevance with respect to the topic of this SR. Data visuals or graphs were used to compare data points and remove irrelevant records from the review. This process decreased workload and saved time. The AI method also used a learning algorithm known as active learning. This helps AI tools learn from a small training sample data. Useful records were identified much faster by this method, with less efforts. The results showed that AI could improve the ease and speed of reviewing records for SRs. It is important that these AI methods are tested and improved to meet the needs of SRs.
    Keywords:  Artificial intelligence; INSIDE PC; Literature screening; Machine learning; Prostate cancer; Systematic literature review
    DOI:  https://doi.org/10.1016/j.euros.2025.09.005
  2. iScience. 2025 Oct 17. 28(10): 113559
      Systematic reviews require substantial time and effort. This study compared the results of conducting reviews by human reviewers with those conducted with Artificial Intelligence (AI). We identified 11 AI tools that could assist in conducting a systematic review. None of the AI tools could retrieve all articles that were detected with a manual search strategy. We identified tools for deduplication but did not evaluate them. AI screening tools assist the human reviewer in presenting the most relevant article on top, which could reduce the number of articles that need to be screened on title and abstract, and on full text. There was a poor inter-rater reliability to evaluate the risk of bias between AI tools and the human reviewers. A summary table created by AI tools differs substantially from manually constructed summary tables. This study highlights the potential of AI tools to support systematic reviews, particularly during screening phases, but not to replace human reviews.
    Keywords:  Artificial intelligence; Medical research
    DOI:  https://doi.org/10.1016/j.isci.2025.113559
  3. Eur J Psychotraumatol. 2025 Dec;16(1): 2546214
      Background: The exponential growth of research literature makes it increasingly difficult to identify all relevant studies for systematic reviews and meta-analyses. While traditional search methods are labour-intensive, modern AI-aided approaches have the potential to act as a powerful 'super-assistant' during both the searching and screening phases.Objective: This paper evaluates how a combined, open-source approach - merging traditional and AI-aided search and screening methods - can help identify all relevant literature up to the 'last relevant paper' for a systematic review on post-traumatic stress symptom (PTSS) trajectories after traumatic events.Method: We applied eight search strategies, including database searches, snowballing, full-text retrieval, and semantic search via OpenAlex. All records were screened using a combination of human reviewers, active learning, and large language models (LLMs) for quality control.Results: On top of replicating the original 6,701 search results, we identified an additional 3,822 records using AI-aided methods. The combination of AI tools and human screening led to 126 relevant studies, with each method uncovering papers the others missed. Notably, machine-aided techniques helped find studies with missing keywords, unusual phrasing, or limited indexing. Across all AI-assisted strategies, 10 additional studies were identified, and while the overall yield was modest, these papers were unique and relevant and would likely have been missed using traditional methods.Conclusions: Our findings demonstrate that even when returns are low, AI-aided approaches can meaningfully enhance coverage and offer a scalable path forward when combined with screening prioritisation. A transparent, hybrid workflow where AI serves as a 'super-assistant' can meaningfully extend the reach of systematic reviews and increase the quality of the findings, but is not ready to replace humans fully.
    Keywords:  Meta-analysis; Revisión sistemática; artificial intelligence; gran modelo de lenguaje; large language model; post-traumatic stress disorder; priorización del cribado; screening prioritisation; trastorno de estrés postraumático
    DOI:  https://doi.org/10.1080/20008066.2025.2546214
  4. J Am Dent Assoc. 2025 Oct 15. pii: S0002-8177(25)00488-X. [Epub ahead of print]
       BACKGROUND: This study aimed to compare the performance of ChatGPT-4o (OpenAI), DeepSeek-V3 (High-Flyer), and Gemini 1.5 Pro (Google) during 3 consecutive weeks in performing full-text screening, data extraction, and risk of bias assessment tasks in systematic and umbrella reviews.
    METHODS: This study evaluated the correctness of large language model (LLM) responses in performing review study tasks by prompting 3 independent accounts. This process was repeated during 3 consecutive weeks for 40 primary studies. The correctness of responses was scored, and data were analyzed by Kendall W, generalized estimating equations followed by pairwise comparisons with Bonferroni correction, and Mann-Whitney U tests (α = .05).
    RESULTS: DeepSeek achieved the highest data extraction accuracy (> 90%), followed by ChatGPT (> 88%). Moreover, DeepSeek outperformed significantly in data extraction compared with Gemini in most pairwise comparisons (P < .0167). Gemini showed an improvement in data extraction performance over time, with significantly higher accuracy in the third week than in the first week (P < .0167). ChatGPT generally performed better in systematic reviews than in umbrella reviews (P < .05).
    CONCLUSIONS: The studied LLMs showed potential for accurate data extraction, particularly DeepSeek, but consistently had unreliable performance in critical tasks like full-text screening and risk of bias assessment. LLM applications in review studies require cautious expert supervision.
    PRACTICAL IMPLICATIONS: Researchers planning to use LLMs for review study tasks should be aware that LLM responses to full-text screening and risk of bias assessment are unreliable. DeepSeek is the preferred LLM for data extraction in both systematic and umbrella reviews, whereas ChatGPT is recommended for systematic reviews.
    Keywords:  Artificial intelligence; ChatGPT; DeepSeek; Gemini; large language model; systematic review; umbrella review
    DOI:  https://doi.org/10.1016/j.adaj.2025.08.011
  5. BMJ Open. 2025 Oct 15. 15(10): e099921
       OBJECTIVES: Systematic literature reviews (SLRs) are essential for synthesising research evidence and guiding informed decision-making. However, SLRs require significant resources and substantial efforts in terms of workload. The introduction of artificial intelligence (AI) tools can reduce this workload. This study aims to investigate the preferences in SLR screening, focusing on trade-offs related to tool attributes.
    DESIGN: A discrete choice experiment (DCE) was performed in which participants completed 13 or 14 choice tasks featuring AI tools with varying attributes.
    SETTING: Data were collected via an online survey, where participants provided background on their education and experience.
    PARTICIPANTS: Professionals who have published SLRs registered on Pubmed, or who were affiliated with a recent Health Economics and Outcomes Research conference were included as participants.
    INTERVENTIONS: The use of a hypothetical AI tool in SLRs with different attributes was considered by the participants. Key attributes for AI tools were identified through a literature review and expert consultations. These attributes included the AI tool's role in screening, required user proficiency, sensitivity, workload reduction and the investment needed for training.
    PRIMARY OUTCOME MEASURES: The participants' adoption of the AI tool, that is, the likelihood of preferring the AI tool in the choice experiment, considering different configurations of attribute levels, as captured through the DCE choice tasks. Statistical analysis was performed using conditional multinomial logit. An additional analysis was performed by including the demographic characteristics (such as education, experience with SLR publication and familiarity with AI) as interaction variables.
    RESULTS: The study received responses from 187 participants with diverse experience in performing SLRs and AI use. The familiarity with AI was generally low, with 55.6% of participants being (very) unfamiliar with AI. In contrast, intermediate proficiency in AI tools is positively associated with adoption (p=0.030). Similarly, workload reduction is also strongly linked to adoption (p<0.001). Interestingly, if expert proficiency is needed for the AI, authors with more scientific experience in their profession are less likely to adopt AI (p=0.009). However, more experience specifically with SLR publications increases AI adoption likelihood (p=0.001).
    CONCLUSIONS: The findings suggest that workload reduction is not the only consideration for SLR reviewers when using AI tools. The key to AI adoption in SLRs is creating reliable, workload-reducing tools that assist rather than replace human reviewers, with moderate proficiency requirements and high sensitivity.
    Keywords:  Artificial Intelligence; Decision Making; Systematic Review
    DOI:  https://doi.org/10.1136/bmjopen-2025-099921
  6. Am J Pharm Educ. 2025 Oct 14. pii: S0002-9459(25)00528-5. [Epub ahead of print] 101882
       OBJECTIVE: Qualitative research remains underutilized in health professions education in part due to insufficient training and time-intensive analytic methods. Recent advances in generative artificial intelligence offer new opportunities to streamline the qualitative research process using large language models, such as GPT-4. However, the accuracy of GPT-4-generated codes and themes remains underexplored in health professions education research. This study characterizes qualitative analyses assisted by a general-purpose GPT-4 compared to traditional human-conducted analyses.
    METHODS: Two health professions datasets were previously analyzed using content or thematic analysis and then re-analyzed using a version of GPT-4. Researchers compared the accuracy, alignment, relevance, and appropriateness of codebooks and themes produced by GPT-4 with the prior findings. Dichotomous numerical ratings and explanations were assessed independently and then discussed collaboratively to identify strengths and weaknesses associated with GPT-4 qualitative analysis.
    RESULTS: Thirty-six survey responses and seven 1-hour interview transcripts were analyzed using GPT-4. The codebooks and themes generated by GPT-4 generally aligned with human-identified concepts. Challenges included failure to detect low frequency codes, difficulty constructing coherent code relationships, and a lack of nuance in theme descriptions and quote selection.
    CONCLUSION: GPT-4 can support, though not replace, human-led qualitative analysis. A general understanding of qualitative research processes and the dataset is necessary for researchers to identify potential gaps, limitations, and redundancies in qualitative findings generated by GPT-4.
    Keywords:  artificial intelligence (AI); health professions education; qualitative research
    DOI:  https://doi.org/10.1016/j.ajpe.2025.101882
  7. J Am Coll Radiol. 2025 Oct 14. pii: S1546-1440(25)00599-X. [Epub ahead of print]
      Actionable findings requiring follow-up with additional imaging or other diagnostic procedures are frequently reported for a wide variety of radiology exams. Completion of recommended follow-up can lead to new diagnoses including cancer. However, recommended follow-up completion is inconsistent, particularly when follow-up is for findings unrelated to the initial reason for the exam. Follow-up recommendation tracking systems, using a combination of information technology tools and human navigators, can facilitate completion of recommended follow-up, but often require significant effort for manual chart review and direct communication with providers and patients. Artificial intelligence (AI), including large language models (LLMs) able to process vast and diverse unstructured text data, offer the opportunity to improve efficiency with data extraction and aggregation tasks, like those required for follow-up recommendation management. In this review article, we will review the key components of follow-up recommendation management systems: (1) identification of follow-up recommendations within radiology reports, (2) communication of these recommendations, (3) tracking of follow-up recommendations to completion, and (4) outcomes tracking. For each component, we will explore how AI can improve efficiency and expand capabilities of robust management systems that ensure the loop is closed for follow-up recommendations.
    DOI:  https://doi.org/10.1016/j.jacr.2025.10.019
  8. J Clin Med. 2025 Oct 04. pii: 7030. [Epub ahead of print]14(19):
      Background: Chronic pain affects nearly one in five adults worldwide and is increasingly recognized not only as a disease but as a potential risk factor for neurocognitive decline and dementia. While some evidence supports this association, existing systematic reviews are static and rapidly outdated, and none have leveraged advanced methods for continuous updating and robust uncertainty modeling. Objective: This protocol describes a living systematic review with dose-response Bayesian meta-analysis, enhanced by artificial intelligence (AI) tools, to synthesize and maintain up-to-date evidence on the prospective association between any type of chronic pain and subsequent cognitive decline. Methods: We will systematically search PubMed, Embase, Web of Science, and preprint servers for prospective cohort studies evaluating chronic pain as an exposure and cognitive decline as an outcome. Screening will be semi-automated using natural language processing models (ASReview), with human oversight for quality control. Bayesian hierarchical meta-analysis will estimate pooled effect sizes and accommodate between-study heterogeneity. Meta-regression will explore study-level moderators such as pain type, severity, and cognitive domain assessed. If data permit, a dose-response meta-analysis will be conducted. Living updates will occur biannually using AI-enhanced workflows, with results transparently disseminated through preprints and peer-reviewed updates. Results: This is a protocol; results will be disseminated in future reports. Conclusions: This living Bayesian systematic review aims to provide continuously updated, methodologically rigorous evidence on the link between chronic pain and cognitive decline. The approach integrates innovative AI tools and advanced meta-analytic methods, offering a template for future living evidence syntheses in neurology and pain research.
    Keywords:  AI-aided evidence synthesis; Bayesian meta-analysis; chronic pain; cognitive decline; living systematic review
    DOI:  https://doi.org/10.3390/jcm14197030
  9. Int J Public Health. 2025 ;70 1608572
       Objective: To assess the competence of students and academic staff to use generative artificial intelligence (GenAI) as a tool in epidemiological data analyses in a randomised controlled trial (RCT).
    Methods: We invited postgraduate students and academic staff at the Swiss Tropical and Public Health Institute to the RCT. Participants were randomized to analyse a simulated cross-sectional dataset using ChatGPT's code interpreter (integrated analysis arm) vs. a statistical software (R/Stata) with ChatGPT as a support tool (distributed analysis arm). The primary outcome was the trial task score (out of 17, using an assessment rubric). Secondary outcome was the time to complete the task.
    Results: We invited 338 and randomized 31 participants equally to the two study arms and 30 participants submitted results. Overall, there was no statistically significant difference in mean task scores between the distributed analysis arm (8.5, ±4.6) and the integrated analysis arm (9.4, ±3.8), with a mean difference of 0.93 (p = 0.55). Mean task completion time was significantly shorter in the integrated analysis arm compared to the distributed analysis arm.
    Conclusion: While ChatGPT offers advantages, its effective use requires a careful balance of GenAI capabilities and human expertise.
    Keywords:  ChatGPT; data analysis; epidemiology; generative artificial intelligence; higher education
    DOI:  https://doi.org/10.3389/ijph.2025.1608572