bims-arines 2026-06-21 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2026–06–21
nineteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

Cochrane Evaluation of (Semi-) Automated Review Methods (CESAR): Protocol for an adaptive platform study within reviews.
REFLECTIVE-TIAB: cost-effective prompt optimisation for large language model-based title and abstract screening in literature reviews.
AI-Assisted Systematic Literature Review of the Economic Burden of Pneumococcal Disease: Development and Validation Study.
Evaluating the accuracy and speed of eight deduplication tools: A comparative study.
Harnessing artificial intelligence for scalable evidence synthesis in reviews: Application in a bibliometric analysis of physical activity technologies.
Performance of Zero-Shot Classifiers for Categorizing RCT Abstracts by Intervention Type: Validation Study.
Development and Validation of an AI Tool for Automated PICO Scoping for the European Joint Clinical Assessment (JCA): A Proof of Concept.
Large language models for full-text methods assessment: a case study on mediation analysis.
Information Specialist Roles in the Era of Large Language Models: Prompting Continued Professional Development.
ChatGPT assisted generation of systematic review ideas in urooncology.
Large language models for systematic reviews were reported to perform well but rarely with verifiable safeguards: a cross-sectional study.
The use and methodological reporting of large language models in qualitative research: a scoping review.
Design and methodology of the AI-empowered Clinical Evidence for Integrated Chinese-Western Medicine (ACE-iMed) platform.
Evidence-based AI: from trailblazer to trustblazer?
Automated tools for evidence quality assessment: a scoping review.
Enhancing the quality and trustworthiness of large language model-generated summaries of clinical oncology literature.
Challenges of using AI-based synthetic data in Health economics and Outcomes Research.
A framework for evaluating factual consistency in automated text summarization with large language models and prompting strategies.
AI and social science: Automatic classification tools for big data analysis in sociological research.

J Clin Epidemiol. 2026 Jun 19. pii: S0895-4356(26)00266-0. [Epub ahead of print] 112390

Cochrane Evaluation of (Semi-) Automated Review Methods (CESAR): Protocol for an adaptive platform study within reviews.

Gerald Gartlehner, Susan Banda, Max Callaghan, Jo-Ana Chase, Andreea Dobrescu, Angelika Eisele-Metzger, Ella Flemyng, Sean Gardner, Ursula Griebler, Bartosz Helfer, Pawel Jemiolo, Biljana Macura, Jan C Minx, Anna Noel-Storr, Noosheen Rajabzadeh Tahmasebi, Amin Sharifan, Joerg J Meerpohl, James Thomas.

   BACKGROUND: Artificial intelligence (AI) has the potential to improve the efficiency of evidence synthesis and reduce human error. However, robust methods for evaluating rapidly evolving AI tools within the practical workflows of evidence synthesis remain underdeveloped. This protocol describes a study design for assessing the effectiveness, efficiency, and usability of AI tools in comparison to traditional human-only workflows in the context of Cochrane systematic reviews.
METHODS: Members of the Cochrane Evaluation of (Semi-) Automated Review Methods (CESAR) project developed an adaptive platform study-within-a-review (SWAR) design, modeled after clinical platform trials. This design employs a master protocol to concurrently evaluate multiple AI tools (interventions) against a standard human-only process (control) across three key review tasks: title and abstract screening, full-text screening, and data extraction. The adaptive framework allows for the addition or removal of AI tools based on interim performance analyses without necessitating a restart of the study. Performance will be assessed using metrics such as accuracy (sensitivity, specificity, precision), efficiency (time on task), response stability, impact of errors, and usability, in alignment with Responsible use of AI in evidence SynthEsis (RAISE) principles.
RESULTS: The study will generate comparative data about the performance and usability of specific AI tools employed in a semi- or fully automated manner relative to standard human effort. The protocol provides a flexible framework for the assessment of AI tools in evidence synthesis, addressing the limitations of static, one-time evaluations.
DISCUSSION: This study protocol presents a novel methodological approach to addressing the challenges of evaluating AI tools for evidence syntheses. By validating entire workflows rather than individual technologies, the findings will establish an evidence base for determining the viability of integrating AI into evidence-synthesis workflows. The adaptive design of this study is flexible and can be adopted by other investigators, ensuring that the evaluation framework remains relevant as new tools emerge.
PLAIN LANGUAGE SUMMARY: Doctors and researchers rely on systematic reviews, which are thorough summaries of all available research on a health topic, to guide decisions about patient care. However, creating these reviews is a slow and demanding process, often taking more than a year to finish. Artificial intelligence (AI) tools could help speed up this work and reduce human errors, but there are currently no reliable ways to test how well these tools perform in real-world settings. This paper describes the design of a study that will rigorously test how well AI tools perform when used in actual systematic review workflows, specifically within Cochrane Reviews. The study will compare AI-assisted methods with the traditional approach, where two trained researchers independently complete each step. It will look at three main tasks: choosing which studies might be relevant based on their titles and abstracts, reading the full-text publication to confirm which studies should be included, and extracting important information from those studies. A key strength of this study is its flexible design. Instead of testing just one AI tool at a single point in time, the study allows researchers to add or remove AI tools as new ones become available, similar to how some modern drug trials are run. This approach helps the study keep up with the fast pace of AI development. Researchers will assess the AI tools based on their accuracy, the time they save, how consistent their results are, and how easy they are to use. The ultimate goal of this study is to give the research community strong evidence about when and how AI can be safely and effectively used in systematic reviews to help summarize medical research.

Keywords:  Study protocol; artificial intelligence; evidence synthesis; study within reviews; workflow validation

DOI:  https://doi.org/10.1016/j.jclinepi.2026.112390
Expert Rev Pharmacoecon Outcomes Res. 2026 Jun 17.

REFLECTIVE-TIAB: cost-effective prompt optimisation for large language model-based title and abstract screening in literature reviews.

Ákos Józwiak, Attila Imre, Judit Hagymásy, Judit Tittmann, Ágnes Nagy, Sándor Kovács, Przemyslaw Kardas, Job Fm van Boven, Irene Mommers, Tamás Ágh.

   BACKGROUND: Title and abstract screening is a labor-intensive stage of systematic reviews. Large language models (LLMs) can automate this process, but performance depends heavily on prompt design and model selection, which is typically manual and time-consuming. Our objective was to evaluate whether automated, reflection-driven prompt optimization improves LLM performance during title and abstract screening.
RESEARCH DESIGN AND METHODS: REFLECTIVE-TIAB uses the GEPA reflective prompt optimizer to improve prompts under an asymmetric loss penalizing false negatives. Nine LLMs screened 8,520 de-duplicated records from a COPD exacerbation predictor search. A 100-abstract gold standard was constructed from inter-model disagreements and was expert-labeled. The prompt was optimized on Llama 3.3 70B via DSPy/GEPA and evaluated across all nine models.
RESULTS: Optimization improved recall across all LLMs (+3.7% to +37.1%). Gemini 3 Flash Preview achieved the highest performance (91% accuracy, F1 81.6%) while costing 25-fold less per abstract than GPT-5.2, which ranked among the lowest-performing models. A prompt optimized on a single open-source model generalized to all nine without retraining. Total optimization cost was $6.36.
CONCLUSIONS: REFLECTIVE-TIAB provides automated, model-transferable prompt optimization for literature screening at negligible cost. Model price did not predict screening performance. The framework could substantially reduce screening workload while preserving comprehensiveness.

Keywords:  Artificial intelligence; chronic obstructive pulmonary disease; evidence synthesis; healthcare; large language model; systematic review

DOI:  https://doi.org/10.1080/14737167.2026.2691995
JMIR AI. 2026 Jun 15. 5 e81049

AI-Assisted Systematic Literature Review of the Economic Burden of Pneumococcal Disease: Development and Validation Study.

Dong Wang, Surabhi Datta, Julie Glasgow, Kyeryoung Lee, Hunki Paek, Jun Zhang, Yi Zheng, Yi-Ling Huang, Long He, Majid Rastegar-Mojarad, Kelsie Cassell, Xiaoyan Wang, Nicole Cossrow.

   Background: Automated systematic literature review (SLR) may reduce the workload and errors associated with manual review, enabling faster, up-to-date reviews even with increasing publication volumes. Large language models (LLMs) have demonstrated strong capabilities in understanding unstructured languages. However, few studies have explored the potential of a comprehensive LLM platform to streamline the entire SLR process from article screening to data extraction.
Objective: This study aimed to investigate the feasibility of applying an LLM-based system to assist with SLR development.
Methods: We developed the Intelligent Systematic Literature Review (ISLaR 2.0) platform, powered by an LLM, and applied it to a use case of the economic burden of pneumococcal disease (PD) literature. First, we established the inclusion and exclusion criteria for the SLR. Second, we defined data elements related to economic burden and domain knowledge, along with guidelines for applying these definitions. Finally, we used the criteria and data element specifications to develop LLM prompts for screening and data extraction. For data extraction, we identified relevant study characteristics and economic burden outcomes. We evaluated ISLaR 2.0's performance against a gold standard of 50 expert-curated PD articles, using standard metrics (accuracy, precision, recall, and F1-score). We also conducted a qualitative analysis to describe errors made by the system.
Results: ISLaR 2.0 performed well in abstract and full-text screening (F1-scores of 86.27 for abstract screening and 87.18 for full-text screening) and data extraction from text (F1-scores of 92.83 for study details and 79.76 for economic burden outcomes). The F1-score for data extraction of tabular economic burden outcome data was 94.83. The qualitative analysis revealed 2 main challenges in extracting economic burden details: misclassification of cost categories and failure to extract relevant information.
Conclusions: ISLaR 2.0 enabled efficient execution of an SLR regarding the economic burden of PD. The platform allowed users to flexibly define and modify criteria and data elements, supporting its use across a broad range of health research topics.

Keywords:  AI; GenAI; artificial intelligence; economic burden; generative artificial intelligence; large language models; natural language processing; pneumococcal disease; systematic literature review

DOI:  https://doi.org/10.2196/81049
Res Synth Methods. 2026 Jun 17. 1-14

Evaluating the accuracy and speed of eight deduplication tools: A comparative study.

Sarah Bateup, Helen Fulbright, Klas Moberg, Kaitlyn Hair, Emmy Peterson, Claire M Stansfield, Riaz Qureshi, Justin Clark.

  A key task in conducting systematic reviews is deduplicating the results from database searching. Deduplication using reference management software can be time-consuming and prone to error, while automated tools can be expensive and lack transparency. To support review teams, we evaluated eight deduplication tools: (1) The Automated Systematic Search Deduplicator (ASySD); (2) Covidence; (3) Deduklick; (4) EPPI-Reviewer; (5) PICO Portal; (6) Rayyan; (7) The Systematic Review Accelerator (SRA) Deduplicator: Focused; (8) The SRA Deduplicator: Relaxed. Five randomly selected Cochrane reviews had their searches rerun to create five gold standard sets. We compared the gold standard sets to the outputs of the eight deduplication tools and evaluated the results for: (1) unique records removed; (2) duplicate records retained; (3) time taken to deduplicate. Summed across all five reviews, the unique records removed in error ranged from 2 to 22. The three best tools were: (1) Rayyan; (2) Covidence; (3) SRA Deduplicator: Focused. The duplicate records retained in error ranged from 34 to 280, the three best tools were: (1) ASySD; (2) Rayyan; (3) EPPI-Reviewer. The time taken to deduplicate ranged from one minute to 20 hours and 34 minutes, the three fastest tools were: (1) SRA Deduplicator: Relaxed; (2) Deduklick; (3) Covidence. No tool performed so poorly that we don't recommend using it. But, as all the tools had strengths and weaknesses, some are expensive while others require large amounts of manual checking time, we recommend review teams compare the tools across all three outcomes and choose the tool that best suits their needs.

Keywords:  automation; deduplication; duplicate references; evidence synthesis; systematic review software; systematic reviews

DOI:  https://doi.org/10.1017/rsm.2026.10100
Digit Health. 2026 Jan-Dec;12:12 20552076261455158

Harnessing artificial intelligence for scalable evidence synthesis in reviews: Application in a bibliometric analysis of physical activity technologies.

George Thomas, Stephanie Alley, Meighan Browne, Hannes Baumann, Mitch J Duncan, Corneel Vandelanotte, Nicholas D Gilson.

   Introduction: Artificial intelligence (AI) tools offer promising opportunities to support evidence synthesis at scale. This study presents a novel AI-human hybrid screening approach to a large-scale bibliometric analysis of technologies promoting physical activity.
Methods: Records (n = 28,957) were retrieved from electronic databases and screened using ASReview, an open-source machine learning tool. Over 100 seed articles trained the model. Screening followed the SAFE framework across four phases, including (1) initial random screening to inform stopping rules, (2) active learning with human reviewers, and multi-model rescreening of (3) unlabelled and (4) excluded records to minimise risk of missed studies.
Results: In Phase 1, a random 1% sample (n = 290) was screened, identifying 20 relevant records. In Phase 2, 3,994 records were screened using active screening, identifying 2,904 relevant studies. In Phase 3, re-screening of unlabelled records (n = 410) identified 53 additional studies, while Phase 4 re-evaluation of excluded records yielded a further 226 studies. Across all phases, 3,183 records were identified as relevant, with 2,985 retained for analysis following post-screening exclusions (n = 598). Only 18% of records required manual screening, saving an estimated 592 hours.
Conclusion: AI-assisted screening offers a feasible and efficient approach for large-scale evidence synthesis when supported by structured workflows and safeguards. While methods like careful seed selection and stopping rules improve rigour, challenges remain-particularly residual risks and reliance on manual data extraction. Future work should focus on extending AI to downstream tasks and embedding human-in-the-loop approaches to ensure it serves as a reliable, transparent partner in evidence synthesis.

Keywords:  artificial intelligence; bibliometric analysis; evidence synthesis; machine learning; physical activity

DOI:  https://doi.org/10.1177/20552076261455158
JMIR Med Inform. 2026 Jun 18. 14 e77943

Performance of Zero-Shot Classifiers for Categorizing RCT Abstracts by Intervention Type: Validation Study.

Diana Buitrago-Garcia, Delphine S Courvoisier, Sami Capderou, Michele Iudici, Denis Mongin.

   Background: Artificial intelligence has gained relevance due to its potential to reduce the workload in evidence synthesis or bibliometric projects. While the main focus has been lately on the use of instruction-tuned large language models, zero-shot classification models have not been tested for such task. These models are large language models trained on large datasets of labeled data able to categorize text among proposed labels, irrespective of the text domain or the topic. They are relatively small, able to run on consumer-grade computers, and almost hyperparameter-free.
Objective: In our study, we use abstracts of randomized clinical trials in rheumatology as a case example to evaluate the performance of openly available, generalist, zero-shot classification models in classifying types of interventions against a human gold standard.
Methods: We classified all rheumatology RCT abstracts published between 2009 and 2022 (n=1,054) as "drug" or "non-drug" using two zero-shot text classification models (DeBERTa and BART) and few-shot prompting using Llama3 8B. Different labeling of categories provided to the zero-shot classification models and different prompts provided to Llama3 8B were tested. Performance was evaluated using accuracy and predictive value of both categories against a human gold standard.
Results: Most randomized controlled trials, RCTs (452/1054, 42.9%) assessed drug interventions. The DeBERTa model achieved the highest accuracy (929/1054, 88.1%; 95% CI 86%-90%) when using the "drug" and "non-drug" labels. Llama3 8B and few-shot prompting had slightly higher accuracy and predictive values. Both zero-shot and Llama3 8B models had performance on par with a human without experience in evidence synthesis (905/1054, 85.9%; 95% CI 83.6%-87.8% accuracy). Misclassifications occurred for trials where the intervention was harder to classify, such as procedures (eg, intra-articular injections), food compounds, vitamins, supplements, or biological treatments.
Conclusions: This study shows the potential of zero-shot classification models for simple classification tasks, demonstrating accuracy comparable to that of an untrained human. These models are potential tools to streamline systematic review tasks for bibliometric studies in classifying abstracts by supplementing one reviewer.

Keywords:  LLM; automation tools; evidence synthesis; large language models; methodology

DOI:  https://doi.org/10.2196/77943
Value Health. 2026 Jun 13. pii: S1098-3015(26)02485-X. [Epub ahead of print]

Development and Validation of an AI Tool for Automated PICO Scoping for the European Joint Clinical Assessment (JCA): A Proof of Concept.

Bart-Jan Boverhof, Nikos Takatzoglou, Ken Redekop, Carin Uyl-de Groot, Jacob Jan Visser, Maureen Rutten-van Mölken Phd.

   OBJECTIVE: The European Union Health Technology Assessment regulation (EU-HTAR) requires laborious PICO scoping (Patient population, Intervention, Comparator, Outcome) and consolidation across member states during joint clinical assessments (JCAs). We developed a tool to automate PICO extraction from HTA documents and clinical guidelines.
METHODS: We developed a proof-of-concept AI pipeline using retrieval-augmented generation and large language models to process documents, translate to English, extract PICOs, and consolidate findings. The system was validated using two oncology cases from non-small cell lung cancer (NSCLC) and hepatocellular carcinoma (HCC). We curated datasets for both cases, comprising 49 reports totalling 5,463 pages from 16 unique countries. Performance was assessed using recall (sensitivity) & precision (positive predictive value) and compared against human data extraction.
RESULTS: The AI system demonstrated strong recall performance, achieving 0.807 for NSCLC and 0.837 for HCC. The AI outperformed human extraction in the NSCLC case (0.81 versus 0.71 recall) while matching human performance in HCC (0.84 versus 0.81). Extraction of the comparator element showed high recall (0.911 NSCLC; 0.967 HCC). The AI pipeline required 96 minutes of computational time beyond translation, saving approximately 25 hours of human screening work.
CONCLUSIONS: We showed that AI-based PICO extraction was feasible and valuable in two use-cases. The system showed good recall and had the potential to reduce workload in the JCA. It may be useful for the assessors and co-assessors in the EU-HTAR process, the industry in gaining insight into PICO expectations, and national HTA bodies in understanding the expected PICO wishes of other nations.

Keywords:  AI; Automation; European Union HTA Regulation (EU-HTAR); Generative-AI; Joint Clinical Assessment (JCA); LLMs; PICO scoping

DOI:  https://doi.org/10.1016/j.jval.2026.05.009
J Am Med Inform Assoc. 2026 Jun 17. pii: ocag108. [Epub ahead of print]

Large language models for full-text methods assessment: a case study on mediation analysis.

Wenqing Zhang, Trang Nguyen, Elizabeth A Stuart, Yiqun T Chen.

   OBJECTIVE: Systematic reviews remain labor-intensive, particularly when extracting methodological details from full texts. Using mediation analysis as a case study, we evaluated whether large language models (LLMs) can match human-expert-level full-text methodological review on key causal assumptions (eg, no unmeasured confounding, temporal ordering) and best practices (eg, sensitivity analyses, interaction assessments, covariate adjustment) for psychiatry and psychology studies.
MATERIALS AND METHODS: We evaluated 6 LLMs from 3 major families (ChatGPT-4o-mini/4o/o3/5, Claude Sonnet 4, Gemini 2.5 Flash) on 180 full-text mediation analysis articles from 2013 to 2018 previously reviewed by expert methodologists. LLMs assessed 14 binary methodological criteria ranging from straightforward checks (eg, whether the exposure was randomized) to nuanced assessments (eg, whether the temporal ordering between mediator and outcome was established). Performance was benchmarked against expert consensus labels and individual reviewers using accuracy, precision, recall, F1, AUC, and PR-AUC.
RESULTS: LLM performance strongly correlated with human reviewers across methodological criteria (accuracy correlation 0.71; F1 correlation 0.95), indicating tasks difficult for humans were likewise challenging for models. Advanced LLMs achieved near-human accuracy on explicit methodological features but lagged behind top reviewers by up to 15% on inference-intensive tasks. Longer documents reduced model accuracy. Common model errors include overinterpreting on linguistic cues and colloquial use of technical terms.
DISCUSSION AND CONCLUSION: Our findings support a criterion-specific human-AI collaboration strategy for full-text methodological assessment and provide a reproducible framework for future testing in other evidence-synthesis settings.

Keywords:  benchmarking; causal inference; large language models; mediation analysis; systematic reviews

DOI:  https://doi.org/10.1093/jamia/ocag108
Campbell Syst Rev. 2026 Jun;22(2): 18911803261449731

Information Specialist Roles in the Era of Large Language Models: Prompting Continued Professional Development.

Hannah O'Keefe, Claire H Eastaugh, Sheila A Wallace, Fiona R Beyer.

  Prompt engineering is the formation of queries or instructions (prompts) that are deployed in large language models. These prompts are often underscored by frameworks, designed to give structure and encourage robust answers. Discussions in recent information specialists' networks and events have highlighted on multiple occasions that information specialists are well placed to undertake prompt engineering tasks. However, there is little published information outlining why and how information specialists are best placed for these tasks and the universal understanding between information specialists has not filtered out to the wider research synthesis community so progress in this area is slow. Here, we discuss the parallels between information specialist tasks and large language model engineering tasks and demonstrate that the parallels run deeper than just prompts. There are strong similarities between information retrieval and context engineering, prompt engineering and vibing. In the briefest sense, we can consider context engineering to be like a search platform, prompt engineering like a structured search strategy, and vibe coding like a search engine input. Knowledge sharing and dissemination of these core concepts amongst information specialists and research synthesists will drive methods development, particularly with the rise of large language models in synthesis automation, give potential for continual professional development courses and e-learning to be developed, and expand the roles of information specialists. To initiate progress in this area, we discuss the anticipated future direction of information specialist roles.

Keywords:  artificial intelligence; evidence synthesis; information retrieval; prompt engineering

DOI:  https://doi.org/10.1177/18911803261449731
Rev Assoc Med Bras (1992). 2026 ;pii: S0104-42302026000402201. [Epub ahead of print]72(4): e20250910

ChatGPT assisted generation of systematic review ideas in urooncology.

Ahmet Emin Dogan, Gorkem Ozenc.

OBJECTIVE: The aim of this study was to analyze the performance of ChatGPT in the generation of new systematic review ideas in urooncology.
METHODS: In September 2024, we requested ChatGPT Version 4.0 to generate 10 systematic review ideas in general urooncology and also 10 ideas for each of the four subcategories: bladder cancer, prostate cancer, renal cell carcinoma, and testicular cancer. We utilized PubMed and Scopus to examine 50 ideas to determine whether prior systematic reviews had addressed them. Novelty was defined as a topic without prior systematic review.
RESULTS: ChatGPT generated 30% original systematic review ideas, with 15 out of 50 being novel. The novelty rate for general urooncology was 50%. The rates for subcategories were 10% for both bladder and prostate cancer, 50% for renal cell carcinoma, and 30% for testicular malignancies. Approximately 10% of general research concepts external to systematic reviews were novel.
CONCLUSION: ChatGPT performed very well in producing creative, apt, and partially viable systematic review ideas in urooncology. Although human judgment is still required to determine feasibility and create proposals with better accuracy, ChatGPT and other large language models can be useful aids while designing research, especially for dynamic and information-heavy disciplines such as urooncology.

DOI: https://doi.org/10.1590/1806-9282.20250910
J Clin Epidemiol. 2026 Jun 17. pii: S0895-4356(26)00259-3. [Epub ahead of print] 112383

Large language models for systematic reviews were reported to perform well but rarely with verifiable safeguards: a cross-sectional study.

Honghao Lai, Bernardo Sousa-Pinto, Christian Cao, David Moher, Janne Estill, Jiayi Liu, Weilong Zhao, Yutong Wang, Ziying Ye, Bo Tong, Zhenhua Yang, Xufei Luo, Bingyi Wang, Yimeng Li, Pan Bei, Lu Zhang, Jinhui Tian, Yaolong Chen, Nannan Shi, Long Ge.

   OBJECTIVE: The application of large language models (LLMs) to systematic review tasks is rapidly expanding, yet the transparency and methodological rigor of these evaluations remain unclear. We aimed to assess reporting transparency, methodological quality, and how authors frame claims and caveats in studies applying LLMs to systematic review tasks.
STUDY DESIGN AND SETTING: We conducted a cross-sectional meta-epidemiological study by searching PubMed, Embase, Web of Science Core Collection, IEEE Xplore, and five other databases from inception to Dec 1, 2025, for peer-reviewed articles and preprints. We included empirical studies evaluating generative transformer-based LLMs (e.g., Gemini) for core systematic review tasks (e.g., screening, data extraction) against a reference standard. We assessed reporting transparency using an adapted Chatbot Assessment Reporting Tool (CHART) and methodological quality using an adapted Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool. We also analyzed the frequency and strength of claims and caveats mentioned by the authors. The study is registered with the Open Science Framework (https://osf.io/8edhb).
RESULTS: We identified and included 229 studies comprising 440 empirical tasks. Reporting transparency was moderate, with a mean item score of 0.52 (SD 0.30) on a 0-1 scale, where higher values indicate more complete reporting. We observed substantial gaps in reproducibility-essential domains, including protocol information (mean score 0.12) and model details (0.30). Although 60.6% of assessments were rated as having a low risk, key safeguards against overfitting and data leakage were rarely reported; for example, locking the test set before prompt optimisation, a basic protection against information leakage, was not reported in 99.8% of tasks. We identified 837 claims and 693 caveats. Authors framed claims weakly more often than strongly (66.8% vs 33.2%). Performance superiority over a comparator was the most common claim (64.8% of tasks). Readiness for practical use was claimed in 47.5% of tasks, almost always in qualified terms (93.3%).
CONCLUSIONS: Studies applying LLMs to systematic review tasks are reported with moderate transparency but often omit reproducibility-critical details necessary to assess leakage and overfitting. While authors frequently make claims about performance and practice readiness, these are typically expressed cautiously. Improved reporting standards and clearer safeguards are urgently needed before routine use of LLMs in evidence synthesis can be recommended.

Keywords:  evidence synthesis; large language models; meta-epidemiological study; methodological quality; reporting transparency; systematic review

DOI:  https://doi.org/10.1016/j.jclinepi.2026.112383
BMC Med Res Methodol. 2026 06 16. pii: 137. [Epub ahead of print]26(1):

The use and methodological reporting of large language models in qualitative research: a scoping review.

Christian Kempny, Julian Frings, Paul Rust, Sven Meister, Leonard Fehring.

   BACKGROUND: Large language models (LLMs) are being integrated into qualitative research processes, yet the scope, function, and reporting quality of their use remain poorly understood. Existing reporting guidelines for qualitative research, including for example the Consolidated Criteria for Reporting Qualitative Research (COREQ), provide minimal guidance for documenting LLM use. This scoping review provides an overview of the emerging use of LLMs applications in qualitative research and assesses the associated reporting practices.
METHODS: A scoping review was conducted following the PRISMA-ScR guidelines and the Joanna Briggs Institute methodological framework. Five databases (PubMed, CINAHL, PsycINFO, Business Source Premier, and Scopus) were searched for peer-reviewed empirical studies published between January 2020 and May 2025 that employed at least one LLM in a substantive qualitative research stage. The search yielded 5,049 records, of which 4,201 remained after duplicate removal. Studies were screened independently by multiple reviewers, and data were extracted using a standardized template capturing study metadata, methodological characteristics, and comprehensive LLM implementation details.
RESULTS: Seventy-five studies were included. OpenAI GPT models dominated the field, appearing in 93% of studies. LLMs were applied across the full spectrum of qualitative research, with coding assistance (n = 43) and theme identification (n = 41) as the most common applications. Thematic analysis was the predominant qualitative method (n = 38), and content analysis (n = 12). Technical reporting was highly inconsistent: only 13 studies reported temperature settings, 12 documented context length, and 4 provided top_p values. Approximately half of studies (45%, n = 34) did not specify the deployment configuration (API, web interface, or local), and 75% (n = 56) reported no parameter settings at all. While 61% of studies provided complete or partial prompts, 13% reported no prompting details. Agreement rates between LLM and human coders ranged from 36% to 99%, reflecting substantial variation related to task complexity, prompt engineering quality, and validation rigor. Nearly all studies (95%) discussed ethical considerations, and 97% incorporated human verification of AI outputs.
DISCUSSION: LLMs have been adopted across qualitative research workflows, yet critical methodological details are frequently underreported, undermining comparability. The findings highlight an urgent need for dedicated reporting guidelines, such as the COREQ + LLM extension, to ensure that LLM-assisted qualitative research meets standards of transparency, rigor, and interpretive depth. Future research should address the predominance of proprietary models, the limited evidence for non-English contexts, and the need for systematic comparison of models, prompting strategies, and validation approaches.

Keywords:  Artificial intelligence; COREQ; Human-AI collaboration; Large language models; Methodological transparency; Prompt engineering; Qualitative research; Reporting guidelines; Scoping review; Thematic analysis

DOI:  https://doi.org/10.1186/s12874-026-02913-1
Integr Med Res. 2026 Sep;15(3Part B): 101351

Design and methodology of the AI-empowered Clinical Evidence for Integrated Chinese-Western Medicine (ACE-iMed) platform.

Hui Liu, Ke Xu, Jie Zhang, Shouyuan Wu, Yishan Qin, Yanfang Ma, Xuan Yu, Huayu Zhang, Haodong Li, Meihua Wu, Zijing Wang, Xufei Luo, Bingyi Wang, Yuanyuan Yao, Yandong Feng, Luyuan Sun, Mengyue Dong, Yingjie Hong, Jiayi Liu, Rui Yang, Yiming Hu, Honghao Lai, Qi Zhou, Xuefeng Li, Long Ge, Yaolong Chen, Zhaoxiang Bian.

   Background: Integrated Chinese-Western medicine (ICWM) is a distinctive medical system that plays an important role in healthcare and has received increasing attention in recent years. To facilitate the dissemination of evidence in ICWM, we developed an Artificial Intelligence (AI)-empowered Clinical Evidence for Integrated Chinese-Western Medicine (ACE-iMed) platform.
Methods: A multidisciplinary working group was established, including individuals with professional backgrounds in evidence-based medicine methodology, Chinese medicine (CM), Western medicine (WM), and ICWM clinical practice and research, and computer science. Through multiple rounds of discussions, the working group defined the framework and methodology of the platform, and then applied the platform to summarize evidence for eight diseases.
Results: The ACE-iMed platform (website: www.aceimed.org) contains two interfaces. The first enables the developers to store and screen the literature, perform methodological quality assessments, and generate evidence summaries. The AI-empowered workflows showed good consistency and stability across multiple stages, including literature screening and assessment of risk of bias/methodological quality, and effectively support summarizing evidence for eight diseases. The second interface, intended for end users, provides synchronized access to the included literature and the generated summaries, enabling quick access to clinical question-oriented evidence resources.
Conclusion: This study introduces an AI-empowered, clinical question-oriented ICWM evidence platform. Application across eight diseases demonstrated the platform's feasibility and practical utility. The platform not only supports the developers in summarizing evidence but also provides end users with a potential pathway to access evidence and its summaries.

Keywords:  ACE-iMed platform; Artificial intelligence; Chinese medicine; Evidence; Integrated Chinese-Western medicine; Large language model

DOI:  https://doi.org/10.1016/j.imr.2026.101351
Front Artif Intell. 2026 ;9 1818128

Evidence-based AI: from trailblazer to trustblazer?

Thomas Luechtefeld, Thomas Hartung.

  Agentic AI systems can plan, call tools, and coordinate specialized sub-agents, enabling multi-step scientific workflows that exceed what single-model text generation can reliably deliver. Yet in high-stakes domains such as regulatory science and toxicology, fluent outputs are not sufficient: adoption hinges on traceability, reproducibility, context-of-use validity, and explicit uncertainty communication. This perspective argues that evidence-based medicine and evidence-based toxicology provide a mature epistemic scaffold for making agentic AI trustworthy by design. We propose an Evidence-based Agent Stack that decomposes end-to-end tasks into protocolized roles (question framing, retrieval, screening, extraction, risk-of-bias appraisal, synthesis, mechanistic/causal integration, uncertainty assessment, and evidence-to-decision translation) with mandatory provenance and versioning. Anchoring agentic workflows in systematic review practice, risk-of-bias frameworks, and emerging regulatory principles (e.g., TREAT and e-validation) can turn "trailblazing" AI into "trustblazing" AI: systems whose outputs are auditable, updateable, and aligned with decision accountability.

Keywords:  agentic AI; e-validation; evidence-based edicine; evidence-based toxicology; regulatory science; retrieval-augmented generation; risk of bias; systematic review

DOI:  https://doi.org/10.3389/frai.2026.1818128
BMC Med Res Methodol. 2026 Jun 15.

Automated tools for evidence quality assessment: a scoping review.

Jiayi Huang, Xinxin Deng, Liying Zhou, Junliang Tao, Cui Liang, Kehu Yang, Xiuxia Li.

   BACKGROUND: Evidence quality assessment is critical for informed public health decision-making, but manual approaches are time-consuming and subject to variability. Automated support tools have been proposed to improve efficiency and consistency, yet their current status has not been comprehensively or systematically mapped.
OBJECTIVE: To identify or map the characteristics, performance, and limitations of existing automated tools for evidence quality assessment.
METHODS: Following the JBI methodology and PRISMA-ScR checklist, we searched 6 English and 4 Chinese databases from their inception to February 9, 2025, to identify studies evaluating automated tools for evidence quality assessment. Eligible studies included original research on tool development, application, or validation. Study characteristics (e.g., year, country, design, tool type, technical features, reliability, and validity) were extracted and summarized descriptively.
RESULTS: Twenty studies were included, most from the United Kingdom (30%), Canada (25%), and Australia (15%). Observational designs predominated (75%), with only 10% randomized controlled trials (RCTs). Twelve distinct tools were identified, of which 65% were publicly available. 58% of the tools were developed for RCTs, while 50% remained experimental and required human oversight. Reported outcomes focused on sensitivity, specificity, precision, efficiency, and consistency. Despite promising results, external validity and scalability were limited.
CONCLUSION: Automated tools for evidence quality assessment show potential to enhance efficiency and consistency but remain restricted in applicability. Current tools are often tailored to clinical trials and require human supervision. Broader adaptation and rigorous validation are needed before such tools can be widely integrated into public health decision-making.

Keywords:  Automated Tools; Critical Appraisal; Decision Support; Evidence Quality Assessment; GRADE; Public Health Decision-Making; Scoping Review

DOI:  https://doi.org/10.1186/s12874-026-02868-3
JAMIA Open. 2026 Jun;9(3): ooag078

Enhancing the quality and trustworthiness of large language model-generated summaries of clinical oncology literature.

Arnulf Stenzl, Eamonn Rogers, Sophia Ananiadou, Yanshan Wang, Andrew J Armstrong, Andrea Sboner, Giovanni Cacciamani, Bob J A Schijvenaars, Kausar Riaz Ahmed, Hanna Thomsen, Timothy Wiemken, Antonio Campello, Cora N Sternberg.

   Objectives: This study evaluated the quality and trustworthiness of large language model (LLM)-generated scientific and plain language summaries (PLS) from clinical oncology literature, focusing on faithfulness (absence of hallucinations), relevance, and readability.
Materials and Methods: Ten LLM-generated scientific summaries and PLS from the INSIDE (artificial INtelligence to Support Informed DEcision making) prostate cancer dataset. For comparison, expert-written PLS from the BioLaySumm dataset were used. A panel of 5 LLMs and 3 human experts verified faithfulness. Verification was performed on original facts and facts modified with varying levels of error (subtle, moderate, contradictory). Readability was assessed using Flesch-Kincaid Reading Ease (FRE) scores.
Results: Fact verification against the summaries was ∼100%, confirming accurate fact extraction. LLM panel vs human panel agreement was substantial (kappa 0.67), outperforming agreement among the interhuman (0.43 [95% CI, 0.34-0.52]) and inter-LLM (0.40 [0.38-0.42]) panels. Large language model scientific summaries showed high faithfulness (88.9% [88.0-89.8]) and low hallucinations (9.6% [6.5-12.7]) compared to human-written PLS (61.6% [60.1-63.1] faithfulness; 40.6% [37.8- 43.4] hallucinations). The LLMs detected errors sensitively with scores decreasing as fact modifications became more severe. Finally, LLM-generated PLS were more readable than human-written versions (FRE 42.3 [interquartile range, IQR 35.27-49.41] vs 28.8 [IQR 21.02-36.18]).
Discussion: A panel of LLMs reliably assessed the faithfulness of scientific summaries to their original source and thus can help increase reliability for clinical use. The lower faithfulness in human-written PLS likely reflects extrinsic hallucinations added for context.
Conclusion: The study demonstrates a novel approach to automatically assess the quality and trustworthiness of LLM-generated scientific and PLS via faithfulness, relevance, and readability.

Keywords:  artificial intelligence; hallucinations; large language models; oncology literature; scientific summaries

DOI:  https://doi.org/10.1093/jamiaopen/ooag078
Expert Rev Pharmacoecon Outcomes Res. 2026 Jun 16.

Challenges of using AI-based synthetic data in Health economics and Outcomes Research.

Arindam Saha, Zongliang Yue, Surachat Ngorsuraches.

   INTRODUCTION: The growing demand for artificial intelligence (AI)-generated synthetic data (SD) in health economics and outcome research (HEOR) offers both opportunities and risks. As SD might be used in the near future to inform drug pricing, reimbursement, and policy decisions, a rigorous evaluation of its associated challenges is essential.
AREAS COVERED: This non-systematic narrative review provides a conceptual overview of the generation and application of SD in HEOR, based on targeted searches of PubMed and Google Scholar with priority given to the publications from 2019 onwards. We then identified four interconnected challenges: bias as a foundational upstream driver, the privacy - utility trade-off, the absence of standardized human-in-the-loop evaluation, and underdeveloped regulatory and governance frameworks.
EXPERT OPINION: The use of SD might offer opportunities to improve data accessibility; however, its adoption as standalone evidence in healthcare decisions is constrained by the absence of HEOR-specific validation standards, equity-centered evaluation metrics, and regulatory guidance. A structured hybrid ecosystem that integrates SD with real-world evidence, supported by coordinated regulatory frameworks and equity impact assessments, will be the most responsible pathway toward meaningful adoption in HEOR.

Keywords:  Synthetic data; bias in healthcare data; generative artificial intelligence; healthcare synthetic data governance

DOI:  https://doi.org/10.1080/14737167.2026.2691184
Neural Netw. 2026 Jun 06. pii: S0893-6080(26)00621-0. [Epub ahead of print]204 109160

A framework for evaluating factual consistency in automated text summarization with large language models and prompting strategies.

Md Moinul Islam, Mourad Oussalah.

  The exponential growth of textual data has intensified the need for reliable automated text summarization (ATS) systems that can extract and synthesize knowledge while maintaining factual accuracy. Current evaluation frameworks for large language models (LLMs) in summarization tasks lack comprehensive assessment of factual consistency, particularly in knowledge engineering contexts where information integrity is paramount. This paper presents a comprehensive evaluation framework that systematically assesses factual consistency in LLM-generated summaries through advanced prompting strategies and multi-dimensional evaluation metrics. Our framework integrates five prompting methodologies, such as Zero-shot, Few-shot, Chain-of-Thought (CoT), Structured Chain-of-Thought (SCoT), and Chain-of-Verification (CoVe) with state-of-the-art (SOTA) factuality assessment approaches, such as FActScore, LongDocFACTScore (LDFActs) and AlignScore across eight LLMs and five diverse datasets spanning news, scientific literature, and conversational domains. Results demonstrate that Few-shot prompting achieves optimal performance across most domains except scientific literature, with LLMs consistently outperforming human-generated summaries. Our findings reveal trade-offs between completeness and precision, with models generating 2-10 times more atomic facts than human references while maintaining comparable or superior factual accuracy. The framework provides actionable insights for researchers developing reliable summarization systems, with open-source implementation available for reproducibility.

Keywords:  Factual consistency; Factuality metrics; Information extraction; Large language model; Prompting strategy; Text summarization

DOI:  https://doi.org/10.1016/j.neunet.2026.109160
PLoS One. 2026 ;21(6): e0351336

AI and social science: Automatic classification tools for big data analysis in sociological research.

Andrea Nucita, Assunta Penna, Antonia Cava, Giancarlo Iannizzotto, Massimo Mucciardi.

This study examines the use of Social Network Sites for public institutional communication through a sociological, data-driven lens, focusing on the challenges and potential of automated classification tools for data analysis. Although Large Language Models are increasingly used to process social media data, a key research gap remains: few studies systematically assess whether AI-based categorizations are as reliable as human coding, especially when categories are semantically ambiguous. The research addresses the following questions: How reliable are AI-generated classifications compared to those made by human experts? Is human-machine agreement comparable to the level of agreement observed among human coders? To experimentally test this approach, we conducted a case study on Facebook posts published by two Italian universities (March 2020-March 2023), classified into eight categories of public institutional communication. Three researchers independently annotated the dataset. Human annotations are used as a benchmark to assess agreement patterns and to compare them with classifications produced by AI-based systems. Results show substantial interpretive ambiguity across several categories, mirrored by variability among human coders. Nonetheless, automated models achieve agreement with human classifications that is broadly comparable to inter-coder agreement. Overall, the findings support integrating AI as an additional coder within hybrid workflows to enable scalable and transparent sociological analysis of complex social media data.

DOI: https://doi.org/10.1371/journal.pone.0351336