bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–01–26
five papers selected by
Farhad Shokraneh



  1. J Clin Epidemiol. 2025 Jan 17. pii: S0895-4356(25)00005-8. [Epub ahead of print] 111672
       PURPOSE: Randomised controlled trials (RCTs) are the cornerstone of evidence-based medicine. Unfortunately, not all RCTs are based on real data. This serious breach of research integrity compromises the reliability of systematic reviews and meta-analyses, leading to misinformed clinical guidelines and posing a risk to both individual and public health. While methods to detect problematic RCTs have been proposed, they are time-consuming and labour-intensive. The use of artificial intelligence large language models (LLM) has the potential to accelerate the data collection needed to assess the trustworthiness of published RCTs.
    METHODS: We present a case study using ChatGPT powered by OpenAI's GPT-4o to assess an RCT paper. The case study focuses on applying the TRACT Checklist and automating data table extraction to accelerate statistical analysis targeting the trustworthiness of the data. We provide a detailed step-by-step outline of the process, along with considerations for potential improvements.
    RESULTS: ChatGPT completed all tasks by processing the PDF of the selected publication and responding to specific prompts. ChatGPT addressed items in the TRACT checklist effectively, demonstrating an ability to provide precise 'yes' or 'no' answers while quickly synthesizing information from both the paper and relevant online resources. A comparison of results generated by ChatGPT and the human assessor showed an 84% level of agreement of (16/19) TRACT items. This substantially accelerated the qualitative assessment process. Additionally, ChatGPT was able to extract efficiently the data tables as Microsoft Excel worksheets and reorganize the data, with three out of four extracted tables achieving an accuracy score of 100%, facilitating subsequent analysis and data verification.
    CONCLUSION: ChatGPT demonstrates potential in semi-automating the trustworthiness assessment of RCTs, though in our experience this required repeated prompting from the user. Further testing and refinement will involve applying ChatGPT to collections of RCT papers to improve the accuracy of data capture and lessen the role of the user. The ultimate aim is a completely automated process for large volumes of papers that seems plausible given our initial experience.
    Keywords:  RCT; artificial intelligence; data integrity; trustworthiness assessment
    DOI:  https://doi.org/10.1016/j.jclinepi.2025.111672
  2. J Am Med Inform Assoc. 2025 Jan 21. pii: ocae325. [Epub ahead of print]
       OBJECTIVE: Data extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world 2-reviewer process.
    MATERIALS AND METHODS: A dataset of 10 trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n = 5) and held-out test sets (n = 17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the 2 LLMs were considered concordant if they were the same for a given variable. The discordant responses from each LLM were provided to the other LLM for cross-critique. Accuracy, ie, the total number of correct responses divided by the total number of responses, was computed to assess performance.
    RESULTS: In the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, increasing accuracy to 0.76.
    DISCUSSION: Concordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy.
    CONCLUSION: Large language models, when simulated in a collaborative, 2-reviewer workflow, can extract data with reasonable performance, enabling truly "living" systematic reviews.
    Keywords:  data extraction; large language models; meta-analysis; natural language processing; systematic review
    DOI:  https://doi.org/10.1093/jamia/ocae325
  3. Reg Anesth Pain Med. 2025 Jan 19. pii: rapm-2024-106231. [Epub ahead of print]
       BACKGROUND: This study evaluated the effectiveness of large language models (LLMs), specifically ChatGPT 4o and a custom-designed model, Meta-Analysis Librarian, in generating accurate search strings for systematic reviews (SRs) in the field of anesthesiology.
    METHODS: We selected 85 SRs from the top 10 anesthesiology journals, according to Web of Science rankings, and extracted reference lists as benchmarks. Using study titles as input, we generated four search strings per SR: three with ChatGPT 4o using general prompts and one with the Meta-Analysis Librarian model, which follows a structured, Population, Intervention, Comparator, Outcome-based approach aligned with Cochrane Handbook standards. Each search string was used to query PubMed, and the retrieved results were compared with the PubMed retrieved studies from the original search string in each SR to assess retrieval accuracy. Statistical analysis compared the performance of each model.
    RESULTS: Original search strings demonstrated superior performance with a 65% (IQR: 43%-81%) retrieval rate, which was statistically different from both LLM groups in PubMed retrieved studies (p=0.001). The Meta-Analysis Librarian achieved a superior median retrieval rate to ChatGPT 4o (median, (IQR); 24% (13%-38%) vs 6% (0%-14%), respectively).
    CONCLUSION: The findings of this study highlight the significant advantage of using original search strings over LLM-generated search strings in PubMed retrieval studies. The Meta-Analysis Librarian demonstrated notable superiority in retrieval performance compared with ChatGPT 4o. Further research is needed to assess the broader applicability of LLM-generated search strings, especially across multiple databases.
    Keywords:  Methods; Nerve Block; TECHNOLOGY
    DOI:  https://doi.org/10.1136/rapm-2024-106231
  4. MethodsX. 2025 Jun;14 103129
      Researchers today face significant challenges reshaping the landscape of academic, government, and industry research due to the exponential growth of global research outputs and the advent of Generative Artificial Intelligence (GenAI). The annual increase in published works has made it difficult for traditional literature review and data analysis methods to keep pace, often rendering reviews outdated by the time of publication. In response, this methods article introduces a suite of new tools designed to automate a number of stages for systematic literature reviews. Designated SPARK (Systematic Processing and Automated Review Kit), the new computational-based approaches presented in this article automate the collection, organisation, and filtering of journal articles, alongside a data extraction scaffolding technique, for use in a systematic literature review on trauma-informed policing. As global research outputs rise, so does the need for automated methods. This paper highlights how these methods can enhance research efficiency and impact.•Hard-coded tools can be utilised to automate research.•Hard-coded tools do not carry the dangers of 'hallucinations' that GenAI infused tools may.•Hard-coded automation tools allow researchers to keep up to date with contemporary research outputs while maintaining a high level of control in the research process.
    Keywords:  Automation, Scopus; Google; LDA Topic Modelling; Python; SPARK: Systematic Processing and Automated Review Kit; Systematic literature review; Web of science
    DOI:  https://doi.org/10.1016/j.mex.2024.103129
  5. ACS Environ Au. 2025 Jan 15. 5(1): 61-68
      Methods to quantitatively synthesize findings across multiple studies is an emerging need in wastewater-based epidemiology (WBE), where disease tracking through wastewater analysis is performed at broad geographical locations using various techniques to facilitate public health responses. Meta-analysis provides a rigorous statistical procedure for research synthesis, yet the manual process of screening large volumes of literature remains a hurdle for its application in timely evidence-based public health responses. Here, we evaluated the performance of GPT-3, GPT-3.5, and GPT-4 models in automated screening of publications for meta-analysis in the WBE literature. We show that the chat completion model in GPT-4 accurately differentiates papers that contain original data from those that did not with texts of the Abstract as the input at a Precision of 0.96 and Recall of 1.00, exceeding current quality standards for manual screening (Recall = 0.95) while costing less than $0.01 per paper. GPT models performed less accurately in detecting studies reporting relevant sampling location, highlighting the value of maintaining human intervention in AI-assisted literature screening. Importantly, we show that certain formulation and model choices generated nonsensical answers to the screening tasks, while others did not, urging the attention to robustness when employing AI-assisted literature screening. This study provided novel performance evaluation data on GPT models for document screening as a step in meta-analysis, suggesting AI-assisted literature screening a useful complementary technique to speed up research synthesis in WBE.
    DOI:  https://doi.org/10.1021/acsenvironau.4c00042