bims-arines Biomed News
on AI in evidence synthesis
Issue of 2024–11–10
three papers selected by
Farhad Shokraneh



  1. Syst Rev. 2024 Nov 01. 13(1): 274
       BACKGROUND: Title-abstract screening in the preparation of a systematic review is a time-consuming task. Modern techniques of natural language processing and machine learning might allow partly automatization of title-abstract screening. In particular, clear guidance on how to proceed with these techniques in practice is of high relevance.
    METHODS: This paper presents an entire pipeline how to use natural language processing techniques to make the titles and abstracts usable for machine learning and how to apply machine learning algorithms to adequately predict whether or not a publication should be forwarded to full text screening. Guidance for the practical use of the methodology is given.
    RESULTS: The appealing performance of the approach is demonstrated by means of two real-world systematic reviews with meta analysis.
    CONCLUSIONS: Natural language processing and machine learning can help to semi-automatize title-abstract screening. Different project-specific considerations have to be made for applying them in practice.
    Keywords:  Automatization; Language models; Machine learning; Meta analysis; Natural language processing; Systematic review; Title-abstract screening
    DOI:  https://doi.org/10.1186/s13643-024-02688-w
  2. Int J Technol Assess Health Care. 2024 Nov 05. 40(1): e48
       OBJECTIVES: The Health Technology Assessment International (HTAi) 2023 Annual Meeting included a novel "fishbowl" session intended to 1) probe the role of HTA in the emergence of generative pretrained transformer (GPT) large language models (LLMs) into health care and 2) demonstrate the semistructured, interactive fishbowl process applied to an emerging "hot topic" by diverse international participants.
    METHODS: The fishbowl process is a format for conducting medium-to-large group discussions. Participants are separated into an inner group and an outer group on the periphery. The inner group responds to a set of questions, whereas the outer group listens actively. During the session, participants voluntarily enter and leave the inner group. The questions for this fishbowl were: What are current and potential future applications of GPT LLMs in health care? How can HTA assess intended and unintended impacts of GPT LLM applications in health care? How might GPT be used to improve HTA methodology?
    RESULTS: Participants offered approximately sixty responses across the three questions. Among the prominent themes were: improving operational efficiency, terminology and language, training and education, evidence synthesis, detecting and minimizing biases, stakeholder engagement, and recognizing and accounting for ethical, legal, and social implications.
    CONCLUSIONS: The interactive fishbowl format enabled the sharing of real-time input on how GPT LLMs and related disruptive technologies will influence what technologies will be assessed, how they will be assessed, and how they might be used to improve HTA. It offers novel perspectives from the HTA community and aligns with certain aspects of ongoing HTA and evidence framework development.
    Keywords:  artificial intelligence; group processes; technology assessment, biomedical
    DOI:  https://doi.org/10.1017/S0266462324000382
  3. BMC Med Res Methodol. 2024 Nov 04. 24(1): 266
       BACKGROUND: Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series.
    METHODS: We searched Scopus for systematic reviews published in 2023-2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment.
    RESULTS: We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases.
    CONCLUSIONS: The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.
    Keywords:  Artificial intelligence; Case reports and series; Methodological quality assessment; Murad tool; Systematic review
    DOI:  https://doi.org/10.1186/s12874-024-02372-6