bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–06–08
ten papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Cochrane Evid Synth Methods. 2024 Jun;2(6): e12078
       Introduction: One of the main tasks in information retrieval is the development of Boolean search strategies for systematic searches in bibliographic databases. This includes the identification of free-text terms and controlled vocabulary. IQWiG has previously implemented its objective approach for search strategy development using a fee-based text analysis software. However, this implementation is not fully automated, due to a lack of technical options. The aim of our project was to develop a text analysis tool for the development of Boolean search strategies using R.
    Methods: We adopt an incremental approach to software development, with the first goal being to develop a minimum viable product for the previously defined use cases. To create an interactive user interface, we use the shiny framework.
    Results: Our newly developed shiny app searchbuildR is a text analysis tool with a point-and-click user interface, that automatically extracts and ranks terms from titles, abstracts, and MeSH terms of a given test set of PubMed records. It returns searchable, interactive tables of free-text and MeSH terms. Each free-text term can also be viewed within its original context in the full titles and abstracts or in a user-defined word window. In addition, 2-word combinations are extracted and also provided as an interactive table to help the user identify free-text term combinations, that can be searched with proximity operators in Boolean searches. The results can be exported to a CSV file. The new implementation with searchbuildR was evaluated by validating the text analysis results against the results of the previously used fee-based software.
    Conclusions: QWiG has developed the shiny app searchbuildR to support the development of search strategies in systematic reviews. It is open source and can be used by researchers and other information specialists without extensive R or programming skills. The package code is openly available on GitHub at www.github.com/IQWiG/searchbuildR.
    Keywords:  data mining; evidence synthesis; information storage and retrieval; natural language processing; review literature as topic; systematic reviews as topic; user‐centered design
    DOI:  https://doi.org/10.1002/cesm.12078
  2. Cochrane Evid Synth Methods. 2025 Jan;3(1): e70012
       Introduction: Using machine learning functions, such as study design classifiers, to automatically identify studies that do not meet the inclusion criteria, is one way to speed up the systematic review screening process. As a qualitative study design classifier is yet to be developed, using the Cochrane randomized controlled trial (RCT) classifier in reverse is one possible way to speed up the identification of primary qualitative studies during screening. The objective of this study was to evaluate whether the Cochrane RCT classifier can be used to speed up the study selection process for qualitative evidence synthesis (QES).
    Methods: We performed a retrospective evaluation where we first identified QES. We then extracted the bibliographic information of the included primary qualitative studies in each QES, and uploaded the references into our data management tool, EPPI-Reviewer. We then ran the Cochrane RCT classifier on each group of included studies for each QES.
    Results: Eighty-two QES with 2828 unique primary studies were included in the analysis. 56% of the primary studies were classified as unlikely to be an RCT and 40% as being 0-9% likely to be an RCT. 4% were classified as being 10% or more likely to be an RCT. Of these, only 1.7% were classified as being 50% or more likely to be an RCT.
    Conclusions: The Cochrane RCT classifier could be a useful tool to identify primary studies with qualitative study designs to speed up study selection in a QES. However, it is possible that mixed methods studies or qualitative studies conducted as part of a clinical trial may be missed. Further evaluations using the Cochrane RCT classifier on all the references retrieved from the complete literature search is needed to investigate time- and resource savings.
    Keywords:  Cochrane RCT classifier; artificial intelligence; classification; machine learning; qualitative evidence synthesis; systematic review automation
    DOI:  https://doi.org/10.1002/cesm.70012
  3. Cochrane Evid Synth Methods. 2023 Jul;1(5): e12021
      Evidence reviews are important for informing decision-making and primary research, but they can be time-consuming and costly. With the advent of artificial intelligence, including machine learning, there is an opportunity to accelerate the review process at many stages, with study screening identified as a prime candidate for assistance. Despite the availability of a large number of tools promising to assist with study screening, these are not consistently used in practice and there is skepticism about their application. Single-arm evaluations suggest the potential for tools to reduce screening burden. However, their integration into practice may need further investigation through evaluations of outcomes such as overall resource use and impact on review findings and recommendations. Because the literature lacks comparative studies, it is not currently possible to determine their relative accuracy. In this commentary, we outline the published research and discuss options for incorporating tools into the review workflow, considering the needs and requirements of different types of review.
    Keywords:  machine learning; rapid review; record screening; systematic review
    DOI:  https://doi.org/10.1002/cesm.12021
  4. Sci Rep. 2025 Jun 03. 15(1): 19379
      Systematic literature review (SLR) is an important tool for Health Economics and Outcomes Research (HEOR) evidence synthesis. SLRs involve the identification and selection of pertinent publications and extraction of relevant data elements from full-text articles, which can be a manually intensive procedure. Previously we developed machine learning models to automatically identify relevant publications based on pre-specified inclusion and exclusion criteria. This study investigates the feasibility of applying Natural Language Processing (NLP) approaches to automatically extract data elements from the relevant scientific literature. First, 239 full-text articles were collected and annotated for 12 important variables including study cohort, lab technique, and disease type, for proper SLR summary of Human papillomavirus (HPV) Prevalence, Pneumococcal Epidemiology, and Pneumococcal Economic Burden. The three resulting annotated corpora are shared publicly at [ https://github.com/Merck/NLP-SLR-corpora ], to provide training data and a benchmark baseline for the NLP community to further research this challenging task. We then compared three classic Named Entity Recognition (NER) algorithms, namely Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), and the Bidirectional Encoder Representations from Transformers (BERT) models, to assess performance on the data element extraction task. The annotation corpora contain 4,498, 579, and 252 annotated entity mentions for HPV Prevalence, Pneumococcal Epidemiology, and Pneumococcal Economic Burden tasks respectively. Deep learning algorithms achieved superior performance in recognizing the targeted SLR data elements, compared to conventional machine learning algorithms. LSTM models have achieved 0.890, 0.646 and 0.615 micro-averaged F1 scores for three tasks respectively. CRF models could not provide comparable performance on most of the elements of interest. Although BERT-based models are known to generally achieve superior performance on many NLP tasks, we did not observe improvement in our three tasks. Deep learning algorithms have achieved superior performance compared with machine learning models on multiple SLR data element extraction tasks. LSTM model, in particular, is more preferable for deployment in supporting HEOR SLR data element extraction, due to its better performance, generalizability, and scalability as it's cost-effective in our SLR benchmark datasets.
    DOI:  https://doi.org/10.1038/s41598-025-03979-5
  5. J Dent. 2025 May 29. pii: S0300-5712(25)00290-8. [Epub ahead of print] 105846
       INTRODUCTION: Data extraction for systematic reviews is a time-consuming step and prone to errors.
    OBJECTIVE: This study aimed to evaluate the agreement between artificial intelligence and human data extraction methods.
    METHODS: Studies published in seven orthodontic journals between 2019 to 2024, were retrieved and included. Fifteen data sets from each study were extracted manually and using the Microsoft Bing AI-based tool by two independent reviewers. Files in Portable Document Format were uploaded to the AI-based tool, and specific data were requested through its chat feature. The association between the data extraction methods and study characteristics was examined, and agreement was evaluated using interclass correlation and Kappa statistics.
    RESULTS: A total of 300 orthodontic studies were included. Slight differences between human and AI-based data extraction methods for publication years and study designs were observed, though these were not statistically significant. Minor inconsistencies were also found in the extraction of the number of trial arms and the mean age of participants per group, but these were not significant. The AI-based tool was less effective in extracting variables related to the study design (P = 0.017) and the number of centers (P < 0.001). Agreement between human and AI-based extraction methods ranged from slight (0.16) for the type of study design to moderate (0.45) for study design classification, and substantial to perfect (0.65-1.00) for most other variables.
    CONCLUSION: AI-based data extraction, while effective for straightforward variables, is not fully reliable for complex data extraction. Human input remains essential for ensuring accuracy and completeness in systematic reviews.
    CLINICAL SIGNIFICANCE: AI-based tools can effectively extract straightforward data, potentially reducing the time and effort required for systematic reviews. This can help clinicians and researchers process large volumes of data more efficiently. However, it is important to keep human supervision to maintain the integrity and reliability of clinical evidence.
    Keywords:  AI; data extraction; systematic review
    DOI:  https://doi.org/10.1016/j.jdent.2025.105846
  6. Environ Evid. 2025 Jun 02. 14(1): 9
      Uptake of AI tools in knowledge production processes is rapidly growing. In this pilot study, we explore the ability of generative AI tools to reliably extract qualitative data from a limited sample of peer-reviewed documents in the context of community-based fisheries management (CBFM) literature. Specifically, we evaluate the capacity of multiple AI tools to analyse 33 CBFM papers and extract relevant information for a systematic literature review, comparing the results to those of human reviewers. We address how well AI tools can discern the presence of relevant contextual data, whether the outputs of AI tools are comparable to human extractions, and whether the difficulty of question influences the performance of the extraction. While the AI tools we tested (GPT4-Turbo and Elicit) were not reliable in discerning the presence or absence of contextual data, at least one of the AI tools consistently returned responses that were on par with human reviewers. These results highlight the potential utility of AI tools in the extraction phase of evidence synthesis for supporting human-led reviews, while underscoring the ongoing need for human oversight. This exploratory investigation provides initial insights into the current capabilities and limitations of AI in qualitative data extraction within the specific domain of CBFM, laying groundwork for future, more comprehensive evaluations across diverse fields and larger datasets.
    Keywords:  Artificial Intelligence; Future of science; Large language models; Natural-language processing; Scientific publication; Systematic review
    DOI:  https://doi.org/10.1186/s13750-025-00362-9
  7. J Glob Health. 2025 Jun 06. 15 03019
      To help achieve the goals of accountability and research excellence, funding organisations often utilise evidence from research priority setting exercises (RPSEs), which distil, from data gathered from relevant stakeholders, a systematic and 'objective' rank-order of research priorities. RPSEs are, however, costly and labour-intensive. Also, critics of RPSEs have highlighted certain limitations: insufficient representation of difficult-to-reach stakeholders, especially in low- and middle-income countries; a lack of genuine stakeholder engagement; wide variation in the extent to which exercises are documented; a lack of specificity in the identified priorities; and minimal impact of the priorities. Artificial intelligence (AI) tools such as ChatGPT may potentially help, valuably complementing conventional RPSEs. While the opacity of AI decision-making is a limitation, advantages include speed, affordability, and highly inclusive distillation of the vastness of existing human knowledge. We encourage research identifying the extent to which AI can replicate conventional RPSEs. We suggest that AI tools could complement conventional approaches either at the initial question generation stage or in generating supplementary insights for reflection at the data analysis stage. Also, under conditions of high existing stakeholder engagement and an extant prevalence of conventional RPSEs, AI-only studies may be valuable.
    DOI:  https://doi.org/10.7189/jogh.15.03019
  8. Cochrane Evid Synth Methods. 2024 Feb;2(2): e12041
       Introduction: Plain language summaries (PLSs) make complex healthcare evidence accessible to patients and the public. Large language models (LLMs) may assist in generating accurate, readable PLSs. This study explored using the LLM Claude 2 to create PLSs of evidence reviews from the Agency for Healthcare Research and Quality (AHRQ) Effective Health Care Program.
    Methods: We selected 10 evidence reviews published from 2021 to 2023, representing a range of methods and topics. We iteratively developed a prompt to guide Claude 2 in creating PLSs which included specifications for plain language, reading level, length, organizational structure, active voice, and inclusive language. PLSs were assessed for adherence to prompt specifications, comprehensiveness, accuracy, readability, and cultural sensitivity.
    Results: All PLSs met the word count. We judged one PLS as fully comprehensive; seven mostly comprehensive. We judged two PLSs as fully capturing the PICO elements; five with minor PICO errors. We judged three PLSs as accurately reporting the results; and four with minor result errors. We judged three PLSs as having major result errors for incorrectly reporting total participants. Five PLSs met the target 6th to 8th grade reading level. Passive voice use averaged 16%. All PLSs used inclusive language.
    Conclusions: LLMs show promise for assisting in PLS creation but likely require human input to ensure accuracy, comprehensiveness, and the appropriate nuances of interpretation. Iterative prompt refinement may improve results and address the needs of specific reviews and audiences. As text-only summaries, the AI-generated PLSs could not meet all consumer communication criteria, such as textual design and visual representations. Further testing should include consumer reviewers and explore how to best leverage LLM support in drafting PLS text for complex evidence reviews.
    DOI:  https://doi.org/10.1002/cesm.12041
  9. Natl Sci Rev. 2025 Jun;12(6): nwaf169
      Literature research, which is vital for scientific work, faces the challenge of surging information volumes that are exceeding researchers' processing capabilities. This paper describes an automated review-generation method based on large language models (LLMs) to overcome efficiency bottlenecks and reduce cognitive load. Our statistically validated evaluation framework demonstrates that the generated reviews match or exceed manual quality, offering broad applicability across research fields without requiring user domain knowledge. Applied to propane dehydrogenation catalysts, our method demonstrated two aspects: first, generating comprehensive reviews from 343 articles spanning 35 topics; and, second, evaluating data-mining capabilities by using 1041 articles for experimental catalyst property analysis. Through multilayered quality control, we effectively mitigated the hallucinations of LLMs, with expert verification confirming accuracy and citation integrity, while demonstrating hallucination risks reduced to <0.5% with 95% confidence. The released software application enables one-click review generation, enhancing research productivity and literature-recommendation efficiency while facilitating broader scientific explorations.
    Keywords:  automated review generation; large language models; literature analysis; scientific writing
    DOI:  https://doi.org/10.1093/nsr/nwaf169
  10. Health Econ Rev. 2025 Jun 04. 15(1): 46
       BACKGROUND: Health Technology Assessment (HTA) is a crucial tool for evaluating the worth and roles of health technologies, and providing evidence-based guidance for their adoption and use. Artificial intelligence (AI) can enhance HTA processes by improving data collection, analysis, and decision-making. This study aims to explore the opportunities and challenges of utilizing artificial intelligence (AI) in health technology assessment (HTA), with a specific focus on economic dimensions. By leveraging AI's capabilities, this research examines how innovative tools and methods can optimize economic evaluation frameworks and enhance decision-making processes within the HTA context.
    METHODS: This study adopted Arksey and O'Malley's scoping review framework and conducted a systematic search in PubMed, Scopus, and Web of Science databases. It examined the benefits and challenges of AI integration into HTA, with a focus on economic dimensions.
    FINDINGS: AI significantly enhances HTA outcomes by driving methodological advancements, improving utility, and fostering healthcare innovation. It enables comprehensive assessments through robust data systems and databases. However, ethical considerations such as biases, transparency, and accountability emphasize the need for deliberate planning and policymaking to ensure responsible integration within the HTA framework.
    CONCLUSION: AI applications in HTA have significant potential to enhance health outcomes and decision-making processes. However, the development of robust data management strategies and regulatory frameworks is essential to ensure effective and ethical implementation. Future research should prioritize the establishment of comprehensive frameworks for AI integration, fostering collaboration among stakeholders, and improving data quality and accessibility on an ongoing basis.
    Keywords:  Applications; Artificial intelligence; Economic evaluation; Health technology assessment; Policy-making
    DOI:  https://doi.org/10.1186/s13561-025-00645-4