bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–02–08
twenty-two papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Res Synth Methods. 2025 Jan;16(1): 194-210
      Systematic reviews play important roles but manual efforts can be time-consuming given a growing literature. There is a need to use and evaluate automated strategies to accelerate systematic reviews. Here, we comprehensively tested machine learning (ML) models from classical and deep learning model families. We also assessed the performance of prompt engineering via few-shot learning of GPT-3.5 and GPT-4 large language models (LLMs). We further attempted to understand when ML models can help automate screening. These ML models were applied to actual datasets of systematic reviews in education. Results showed that the performance of classical and deep ML models varied widely across datasets, ranging from 1.2 to 75.6% of work saved at 95% recall. LLM prompt engineering produced similarly wide performance variation. We searched for various indicators of whether and how ML screening can help. We discovered that the separability of clusters of relevant versus irrelevant articles in high-dimensional embedding space can strongly predict whether ML screening can help (overall R = 0.81). This simple and generalizable heuristic applied well across datasets and different ML model families. In conclusion, ML screening performance varies tremendously, but researchers and software developers can consider using our cluster separability heuristic in various ways in an ML-assisted screening pipeline.
    Keywords:  active learning; embedding large; language models; machine learning; systematic reviews
    DOI:  https://doi.org/10.1017/rsm.2024.16
  2. Res Synth Methods. 2025 Mar;16(2): 308-322
      When conducting a systematic review, screening the vast body of literature to identify the small set of relevant studies is a labour-intensive and error-prone process. Although there is an increasing number of fully automated tools for screening, their performance is suboptimal and varies substantially across review topic areas. Many of these tools are only trained on small datasets, and most are not tested on a wide range of review topic areas. This study presents two systematic review datasets compiled from more than 8600 systematic reviews and more than 540000 abstracts covering 51 research topic areas in health and medical research. These datasets are the largest of their kinds to date. We demonstrate their utility in training and evaluating language models for title and abstract screening. Our dataset includes detailed metadata of each review, including title, background, objectives and selection criteria. We demonstrated that a small language model trained on this dataset with additional metadata has excellent performance with an average recall above 95% and specificity over 70% across a wide range of review topic areas. Future research can build on our dataset to further improve the performance of fully automated tools for systematic review title and abstract screening.
    DOI:  https://doi.org/10.1017/rsm.2025.1
  3. Res Synth Methods. 2025 May;16(3): 491-508
      Systematic reviews are essential for evidence-based health care, but conducting them is time- and resource-consuming. To date, efforts have been made to accelerate and (semi-)automate various steps of systematic reviews through the use of artificial intelligence (AI) and the emergence of large language models (LLMs) promises further opportunities. One crucial but complex task within systematic review conduct is assessing the risk of bias (RoB) of included studies. Therefore, the aim of this study was to test the LLM Claude 2 for RoB assessment of 100 randomized controlled trials, published in English language from 2013 onwards, using the revised Cochrane risk of bias tool ('RoB 2'; involving judgements for five specific domains and an overall judgement). We assessed the agreement of RoB judgements by Claude with human judgements published in Cochrane reviews. The observed agreement between Claude and Cochrane authors ranged from 41% for the overall judgement to 71% for domain 4 ('outcome measurement'). Cohen's κ was lowest for domain 5 ('selective reporting'; 0.10 (95% confidence interval (CI): -0.10-0.31)) and highest for domain 3 ('missing data'; 0.31 (95% CI: 0.10-0.52)), indicating slight to fair agreement. Fair agreement was found for the overall judgement (Cohen's κ: 0.22 (95% CI: 0.06-0.38)). Sensitivity analyses using alternative prompting techniques or the more recent version Claude 3 did not result in substantial changes. Currently, Claude's RoB 2 judgements cannot replace human RoB assessment. However, the potential of LLMs to support RoB assessment should be further explored.
    Keywords:  GPT; artificial intelligence; automation; large language models; risk of bias; systematic review as topic
    DOI:  https://doi.org/10.1017/rsm.2025.12
  4. Res Synth Methods. 2026 Mar;17(2): 332-347
      Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40 minutes max.) and had usage costs ($0.14-$1.93 per review). Llama 3.1-8B and Gemma 2-9B were run locally in longer times (~4 hours max.) and were free to use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.
    Keywords:  GPT-4o mini; Gemma 2 9B; Llama 3.1 8B; artificial intelligence; large language models; title and abstract screening
    DOI:  https://doi.org/10.1017/rsm.2025.10044
  5. BMC Res Notes. 2026 Jan 30.
       OBJECTIVE: Semi-automated tools used during the preliminary screening of articles in systematic reviews can start with a small set of seed articles and actively learn from human decisions to prioritise more relevant articles for subsequent screening. However, given that these tools are vulnerable to biases and lack clear stopping criteria, their performance in large-scale systematic reviews remains uncertain, especially in reviews covering broad subject areas that require a substantial number of representative seed articles. This article presents a hybrid approach that uses text-mining techniques combined with a semi-automated tool to effectively reduce, screen, and validate a large cohort of articles (N = 90,871).
    RESULT: A preliminary evaluation using simulations indicated that this approach has the potential to craft a comprehensive collection of seed articles that covers broad subject areas for semi-automated tools in a large-scale systematic review. The strengths and limitations of using a semi-automated tool alone in such a context are discussed. Our approach increases the efficiency of automated tools by providing a larger and more focused selection of articles to start with, optimising the learning process for those tools and reducing biases. Additionally, our approach could increase the transparency and reusability of keywords for future review updates.
    Keywords:  Automated tools; Automation; Expert knowledge; Large-scale reviews; Machine learning; Natural language processing; Preliminary screening; Systematic review; Text-mining
    DOI:  https://doi.org/10.1186/s13104-026-07651-7
  6. Res Synth Methods. 2025 Jul;16(4): 620-630
      With the increasing volume of scientific literature, there is a need to streamline the screening process for titles and abstracts in systematic reviews, reduce the workload for reviewers, and minimize errors. This study validated artificial intelligence (AI) tools, specifically Llama 3 70B via Groq's application programming interface (API) and ChatGPT-4o mini via OpenAI's API, for automating this process in biomedical research. It compared these AI tools with human reviewers using 1,081 articles after duplicate removal. Each AI model was tested in three configurations to assess sensitivity, specificity, predictive values, and likelihood ratios. The Llama 3 model's LLA_2 configuration achieved 77.5% sensitivity and 91.4% specificity, with 90.2% accuracy, a positive predictive value (PPV) of 44.3%, and a negative predictive value (NPV) of 97.9%. The ChatGPT-4o mini model's CHAT_2 configuration showed 56.2% sensitivity, 95.1% specificity, 92.0% accuracy, a PPV of 50.6%, and an NPV of 96.1%. Both models demonstrated strong specificity, with CHAT_2 having higher overall accuracy. Despite these promising results, manual validation remains necessary to address false positives and negatives, ensuring that no important studies are overlooked. This study suggests that AI can significantly enhance efficiency and accuracy in systematic reviews, potentially revolutionizing not only biomedical research but also other fields requiring extensive literature reviews.
    Keywords:  abstracting and indexing; artificial intelligence; machine learning; review literature as a topic
    DOI:  https://doi.org/10.1017/rsm.2025.15
  7. Res Synth Methods. 2025 Nov;16(6): 859-875
      Recent studies highlight the potential of large language models (LLMs) in citation screening for systematic reviews; however, the efficiency of individual LLMs for this application remains unclear. This study aimed to compare accuracy, time-related efficiency, cost, and consistency across four LLMs-GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B-for literature screening tasks. The models screened for clinical questions from the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024. Sensitivity and specificity were calculated for each model based on conventional citation screening results for qualitative assessment. We also recorded the time and cost of screening and assessed consistency to verify reproducibility. A post hoc analysis explored whether integrating outputs from multiple models could enhance screening accuracy. GPT-4o and Llama 3.3 70B achieved high specificity but lower sensitivity, while Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited higher sensitivity at the cost of lower specificity. Citation screening times and costs varied, with GPT-4o being the fastest and Llama 3.3 70B the most cost-effective. Consistency was comparable among the models. An ensemble approach combining model outputs improved sensitivity but increased the number of false positives, requiring additional review effort. Each model demonstrated distinct strengths, effectively streamlining citation screening by saving time and reducing workload. However, reviewing false positives remains a challenge. Combining models may enhance sensitivity, indicating the potential of LLMs to optimize systematic review workflows.
    Keywords:  citation screening; clinical practice guidelines; generative AI; large language models; systematic review
    DOI:  https://doi.org/10.1017/rsm.2025.10014
  8. Res Synth Methods. 2026 Mar;17(2): 365-377
      Meta-research and evidence synthesis require considerable resources. Large language models (LLMs) have emerged as promising tools to assist in these processes, yet their performance varies across models, limiting their reliability. Taking advantage of the large availability of small size (<10 billion parameters) open-source LLMs, we implemented an agreement-based framework in which a decision is taken only if at least a given number of LLMs produce the same response. The decision is otherwise withheld. This approach was tested on 1020 abstracts of randomized controlled trials in rheumatology, using 2 classic literature review tasks: (1) classifying each intervention as drug or nondrug based on text interpretation and (2) extracting the total number of randomized patients, a task that sometimes required calculations. Re-examining abstracts where at least 4 LLMs disagreed with the human gold standard (dual review with adjudication) allowed constructing an improved gold standard. Compared to a human gold standard and single large LLMs (>70 billion parameters), our framework demonstrated robust performance: several model combinations achieved accuracies above 95% exceeding the human gold standard on at least 85% of abstracts (e.g., 3 of 5 models, 4 of 6 models, or 5 of 7 models). Performance variability across individual models was not an issue, as low-performing models contributed fewer accepted decisions. This agreement-based framework offers a scalable solution that can replace human reviewers for most abstracts, reserving human expertise for more complex cases. Such frameworks could significantly reduce the manual burden in systematic reviews while maintaining high accuracy and reproducibility.
    Keywords:  classification; data extraction; evidence synthesis; large language model; meta-analysis
    DOI:  https://doi.org/10.1017/rsm.2025.10054
  9. Res Synth Methods. 2025 Nov;16(6): 975-989
      Systematic reviews (SRs) synthesize evidence through a rigorous, labor-intensive, and costly process. To accelerate the title-abstract screening phase of SRs, several artificial intelligence (AI)-based semi-automated screening tools have been developed to reduce workload by prioritizing relevant records. However, their performance is primarily evaluated for SRs of intervention studies, which generally have well-structured abstracts. Here, we evaluate whether screening tool performance is equally effective for SRs of prognosis studies that have larger heterogeneity between abstracts. We conducted retrospective simulations on prognosis and intervention reviews using a screening tool (ASReview). We also evaluated the effects of review scope (i.e., breadth of the research question), number of (relevant) records, and modeling methods within the tool. Performance was assessed in terms of recall (i.e., sensitivity), precision at 95% recall (i.e., positive predictive value at 95% recall), and workload reduction (work saved over sampling at 95% recall [WSS@95%]). The WSS@95% was slightly worse for prognosis reviews (range: 0.324-0.597) than for intervention reviews (range: 0.613-0.895). The precision was higher for prognosis (range: 0.115-0.400) compared to intervention reviews (range: 0.024-0.057). These differences were primarily due to the larger number of relevant records in the prognosis reviews. The modeling methods and the scope of the prognosis review did not significantly impact tool performance. We conclude that the larger abstract heterogeneity of prognosis studies does not substantially affect the effectiveness of screening tools for SRs of prognosis. Further evaluation studies including a standardized evaluation framework are needed to enable prospective decisions on the reliable use of screening tools.
    Keywords:  active learning; clinical guideline development; large language models; prioritized screening; semi-automation; systematic reviews
    DOI:  https://doi.org/10.1017/rsm.2025.10025
  10. Res Synth Methods. 2025 Nov;16(6): 990-1004
      This study aims to explore the feasibility and accuracy of utilizing large language models (LLMs) to assess the risk of bias (ROB) in cohort studies. We conducted a pilot and feasibility study in 30 cohort studies randomly selected from reference lists of published Cochrane reviews. We developed a structured prompt to guide the ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 to assess the ROB of each cohort twice. We used the ROB results assessed by three evidence-based medicine experts as the gold standard, and then we evaluated the accuracy of LLMs by calculating the correct assessment rate, sensitivity, specificity, and F1 scores for overall and item-specific levels. The consistency of the overall and item-specific assessment results was evaluated using Cohen's kappa (κ) and prevalence-adjusted bias-adjusted kappa. Efficiency was estimated by the mean assessment time required. This study assessed three LLMs (ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3) and revealed distinct performance across eight assessment items. Overall accuracy was comparable (80.8%-83.3%). Moonshot-v1-128k showed superior sensitivity in population selection (0.92 versus ChatGPT-4o's 0.55, P < 0.001). In terms of F1 scores, Moonshot-v1-128k led in population selection (F = 0.80 versus ChatGPT-4o's 0.67, P = 0.004). ChatGPT-4o demonstrated the highest consistency (mean κ = 96.5%), with perfect agreement (100%) in outcome confidence. ChatGPT-4o was 97.3% faster per article (32.8 seconds versus 20 minutes manually) and outperformed Moonshot-v1-128k and DeepSeek-V3 by 47-50% in processing speed. The efficient and accurate assessment of ROB in cohort studies by ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 highlights the potential of LLMs to enhance the systematic review process.
    Keywords:  cohort studies; large language models; risk of bias; systematic reviews
    DOI:  https://doi.org/10.1017/rsm.2025.10028
  11. Res Synth Methods. 2025 Nov;16(6): 1005-1024
      screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.
    Keywords:  animal study; clinical study; language models; natural language processing; randomized controlled trial; systematic review
    DOI:  https://doi.org/10.1017/rsm.2025.10031
  12. Res Synth Methods. 2025 Jul;16(4): 601-619
       INTRODUCTION: With the increasing accessibility of tools such as ChatGPT, Copilot, DeepSeek, Dall-E, and Gemini, generative artificial intelligence (GenAI) has been poised as a potential, research timesaving tool, especially for synthesising evidence. Our objective was to determine whether GenAI can assist with evidence synthesis by assessing its performance using its accuracy, error rates, and time savings compared to the traditional expert-driven approach.
    METHODS: To systematically review the evidence, we searched five databases on 17 January 2025, synthesised outcomes reporting on the accuracy, error rates, or time taken, and appraised the risk-of-bias using a modified version of QUADAS-2.
    RESULTS: We identified 3,071 unique records, 19 of which were included in our review. Most studies had a high or unclear risk-of-bias in Domain 1A: review selection, Domain 2A: GenAI conduct, and Domain 1B: applicability of results. When used for (1) searching GenAI missed 68% to 96% (median = 91%) of studies, (2) screening made incorrect inclusion decisions ranging from 0% to 29% (median = 10%); and incorrect exclusion decisions ranging from 1% to 83% (median = 28%), (3) incorrect data extractions ranging from 4% to 31% (median = 14%), (4) incorrect risk-of-bias assessments ranging from 10% to 56% (median = 27%).
    CONCLUSION: Our review shows that the current evidence does not support GenAI use in evidence synthesis without human involvement or oversight. However, for most tasks other than searching, GenAI may have a role in assisting humans with evidence synthesis.
    Keywords:  automation; evidence synthesis; generative artificial intelligence (GenAI); large language models (LLMs); systematic reviews
    DOI:  https://doi.org/10.1017/rsm.2025.16
  13. Res Synth Methods. 2025 Nov;16(6): 1035-1041
      A critical step in systematic reviews involves the definition of a search strategy, with keywords and Boolean logic, to filter electronic databases. We hypothesize that it is possible to screen articles in electronic databases using large language models (LLMs) as an alternative to search equations. To investigate this matter, we compared two methods to identify randomized controlled trials (RCTs) in electronic databases: filtering databases using the Cochrane highly sensitive search and an assessment by an LLM.We retrieved studies indexed in PubMed with a publication date between September 1 and September 30, 2024 using the sole keyword "diabetes." We compared the performance of the Cochrane highly sensitive search and the assessment of all titles and abstracts extracted directly from the database by GPT-4o-mini to identify RCTs. Reference standard was the manual screening of retrieved articles by two independent reviewers.The search retrieved 6377 records, of which 210 (3.5%) were primary reports of RCTs. The Cochrane highly sensitive search filtered 2197 records and missed one RCT (sensitivity 99.5%, 95% CI 97.4% to100%; specificity 67.8%, 95% CI 66.6% to 68.9%). Assessment of all titles and abstracts from the electronic database by GPT filtered 1080 records and included all 210 primary reports of RCTs (sensitivity 100%, 95% CI 98.3% to100%; specificity 85.9%, 95% CI 85.0% to 86.8%).LLMs can screen all articles in electronic databases to identify RCTs as an alternative to the Cochrane highly sensitive search. This calls for the evaluation of LLMs as an alternative to rigid search strategies.
    Keywords:  Large language models; search strategy; systematic reviews
    DOI:  https://doi.org/10.1017/rsm.2025.10034
  14. Res Synth Methods. 2025 Mar;16(2): 350-363
      Machine learning (ML) models have been developed to identify randomised controlled trials (RCTs) to accelerate systematic reviews (SRs). However, their use has been limited due to concerns about their performance and practical benefits. We developed a high-recall ensemble learning model using Cochrane RCT data to enhance the identification of RCTs for rapid title and abstract screening in SRs and evaluated the model externally with our annotated RCT datasets. Additionally, we assessed the practical impact in terms of labour time savings and recall improvement under two scenarios: ML-assisted double screening (where ML and one reviewer screened all citations in parallel) and ML-assisted stepwise screening (where ML flagged all potential RCTs, and at least two reviewers subsequently filtered the flagged citations). Our model achieved twice the precision compared to the existing SVM model while maintaining a recall of 0.99 in both internal and external tests. In a practical evaluation with ML-assisted double screening, our model led to significant labour time savings (average 45.4%) and improved recall (average 0.998 compared to 0.919 for a single reviewer). In ML-assisted stepwise screening, the model performed similarly to standard manual screening but with average labour time savings of 74.4%. In conclusion, compared with existing methods, the proposed model can reduce workload while maintaining comparable recall when identifying RCTs during the title and abstract screening stages, thereby accelerating SRs. We propose practical recommendations to effectively apply ML-assisted manual screening when conducting SRs, depending on reviewer availability (ML-assisted double screening) or time constraints (ML-assisted stepwise screening).
    Keywords:  conducting systematic review; ensemble learning; impact on practice; title and abstract screening
    DOI:  https://doi.org/10.1017/rsm.2025.3
  15. Res Synth Methods. 2026 Jan;17(1): 42-62
      Large language models have shown promise for automating data extraction (DE) in systematic reviews (SRs), but most existing approaches require manual interaction. We developed an open-source system using GPT-4o to automatically extract data with no human intervention during the extraction process. We developed the system on a dataset of 290 randomized controlled trials (RCTs) from a published SR about cognitive behavioral therapy for insomnia. We evaluated the system on two other datasets: 5 RCTs from an updated search for the same review and 10 RCTs used in a separate published study that had also evaluated automated DE. We developed the best approach across all variables in the development dataset using GPT-4o. The performance in the updated-search dataset using o3 was 74.9% sensitivity, 76.7% specificity, 75.7 precision, 93.5% variable detection comprehensiveness, and 75.3% accuracy. In both datasets, accuracy was higher for string variables (e.g., country, study design, drug names, and outcome definitions) compared with numeric variables. In the third external validation dataset, GPT-4o showed a lower performance with a mean accuracy of 84.4% compared with the previous study. However, by adjusting our DE method, while maintaining the same prompting technique, we achieved a mean accuracy of 96.3%, which was comparable to the previous manual extraction study. Our system shows potential for assisting the DE of string variables alongside a human reviewer. However, it cannot yet replace humans for numeric DE. Further evaluation across diverse review contexts is needed to establish broader applicability.
    Keywords:  GPT-4o; data extraction automation; large language models; o3; systematic reviews
    DOI:  https://doi.org/10.1017/rsm.2025.10030
  16. Res Synth Methods. 2025 Jan;16(1): 211-227
      Bibliographic aggregators like OpenAlex and Semantic Scholar offer scope for automated citation searching within systematic review production, promising increased efficiency. This study aimed to evaluate the performance of automated citation searching compared to standard search strategies and examine factors that influence performance. Automated citation searching was simulated on 27 systematic reviews across the OpenAlex and Semantic Scholar databases, across three study areas (health, environmental management and social policy). Performance, measured by recall (proportion of relevant articles identified), precision (proportion of relevant articles identified from all articles identified), and F1-F3 scores (weighted average of recall and precision), was compared to the performance of search strategies originally employed by each systematic review. The associations between systematic review study area, number of included articles, number of seed articles, seed article type, study type inclusion criteria, API choice, and performance was analyzed. Automated citation searching outperformed the reference standard in terms of precision (p < 0.05) and F1 score (p < 0.05) but failed to outperform in terms of recall (p < 0.05) and F3 score (p < 0.05). Study area influenced the performance of automated citation searching, with performance being higher within the field of environmental management compared to social policy. Automated citation searching is best used as a supplementary search strategy in systematic review production where recall is more important that precision, due to inferior recall and F3 score. However, observed outperformance in terms of F1 score and precision suggests that automated citation searching could be helpful in contexts where precision is as important as recall.
    Keywords:  automation; evidence synthesis; guideline development; learning health systems; scoping review; systematic reviews
    DOI:  https://doi.org/10.1017/rsm.2024.15
  17. Res Synth Methods. 2025 Nov;16(6): 953-960
      Our objective was to evaluate the recall and number needed to read (NNR) for the Cochrane RCT Classifier compared to and in combination with established search filters developed for Ovid MEDLINE and Embase.com. A gold standard set of 1,103 randomized controlled trials (RCTs) was created to calculate recall for the Cochrane RCT Classifier in Covidence, the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE and the Cochrane Embase RCT filter for Embase.com. In addition, the classifier and the filters were validated in three case studies using reports from the Swedish Agency for Health Technology Assessment and Assessment of Social Services to assess impact on search results and NNR. The Cochrane RCT Classifier had the highest recall with 99.64% followed by the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE with 98.73% and the Cochrane Embase RCT filter with 98.46%. However, the Cochrane RCT Classifier had a higher NNR than the RCT filters in all case studies. Combining the RCT filters with the Cochrane RCT Classifier reduced NNR compared to using the RCT filters alone while achieving a recall of 98.46% for the Ovid MEDLINE/RCT Classifier combination and 98.28% for the Embase/RCT Classifier combination. In conclusion, we found that the Cochrane RCT Classifier in Covidence has a higher recall than established search filters but also a higher NNR. Thus, using the Cochrane RCT Classifier instead of current state-of-the-art RCT filters would lead to an increased workload in the screening process. A viable option with a lower NNR than RCT filters, at the cost of a slight decrease in recall, is to combine the Cochrane RCT Classifier with RCT filters in database searches.
    Keywords:  literature searching; machine learning; randomized controlled trials; search filters; study classifiers; systematic review software
    DOI:  https://doi.org/10.1017/rsm.2025.10023
  18. Front Digit Health. 2025 ;7 1706383
       Background: The rapid evolution of general large language models (LLMs) provides a promising framework for integrating artificial intelligence into medical practice. While these models are capable of generating medically relevant language, their application in evidence inference in clinical scenarios may pose potential challenges. This study employs empirical experiments to analyze the capability boundaries of current general-purpose LLMs within evidence-based medicine (EBM) tasks, and provides a philosophical reflection on their limitations.
    Methods: This study evaluates the performance of three general-purpose LLMs, including ChatGPT, DeepSeek, and Gemini, when directly applied to core tasks of EBM. The models were tested in a baseline, unassisted setting, without task-specific fine-tuning, external evidence retrieval, or embedded prompting frameworks. Two clinical scenarios, namely SGLT2 inhibitors for heart failure and PD-1/PD-L1 inhibitors for advanced NSCLC were used to assess performance in evidence generation, evidence synthesis, and clinical judgment. Model outputs were evaluated using a multidimensional rubric. The empirical results were analyzed from an epistemological perspective.
    Results: Experiments show that the evaluated general-purpose LLMs can produce syntactically coherent and medically plausible outputs in core evidence-related tasks. However, under current architectures and baseline deployment conditions, several limitations remain, including imperfect accuracy in numerical extraction and processing, limited verifiability of cited sources, inconsistent methodological rigor in synthesis, and weak attribution of clinical responsibility in recommendations. Building on these empirical patterns, the philosophical analysis reveals three potential risks in this testing setting, including disembodiment, deinstitutionalization, and depragmatization.
    Conclusions: This study suggests that directly applying general-purpose LLMs to clinical evidence tasks entails some limitations. Under current architectures, these systems lack embodied engagement with clinical phenomena, do not participate in institutional evaluative norms, and cannot assume responsibility for reasoning. These findings provide a directional compass for future medical AI, including ground outputs in real-world data, integrate deployment into clinical workflows with oversight, and design human-AI collaboration with clear responsibility.
    Keywords:  artificial intelligence; evidence mechanism; evidence-based medicine; large language model; philosophy of science
    DOI:  https://doi.org/10.3389/fdgth.2025.1706383
  19. ESMO Real World Data Digit Oncol. 2024 Dec;6 100078
       Background: Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists' performance in answering urological board questions in a fully clinician-verifiable manner.
    Materials and methods: We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).
    Results: UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss' kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.
    Conclusions: UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.
    Keywords:  evidence-based urology; large language models; retrieval augmented generation
    DOI:  https://doi.org/10.1016/j.esmorw.2024.100078
  20. Nature. 2026 Feb 04.
      Scientific progress depends on the ability of researchers to synthesize the growing body of literature. Can large language models (LLMs) assist scientists in this task? Here we introduce OpenScholar, a specialized retrieval-augmented language model (LM)1 that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience and biomedicine. Despite being a smaller open model, OpenScholar-8B outperforms GPT-4o by 6.1% and PaperQA2 by 5.5% in correctness on a challenging multi-paper synthesis task from the new ScholarQABench. Although GPT-4o hallucinates citations 78-90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's data store, retriever and self-feedback inference loop improve off-the-shelf LMs: for instance, OpenScholar-GPT-4o improves the correctness of GPT-4o by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT-4o responses over expert-written ones 51% and 70% of the time, respectively, compared with 32% for GPT-4o. We open-source all artefacts, including our code, models, data store, datasets and a public demo.
    DOI:  https://doi.org/10.1038/s41586-025-10072-4