bims-arines Biomed News
on AI in evidence synthesis
Issue of 2024–10–20
five papers selected by
Farhad Shokraneh



  1. Res Synth Methods. 2024 Oct 16.
      Several AI-aided screening tools have emerged to tackle the ever-expanding body of literature. These tools employ active learning, where algorithms sort abstracts based on human feedback. However, researchers using these tools face a crucial dilemma: When should they stop screening without knowing the proportion of relevant studies? Although numerous stopping rules have been proposed to guide users in this decision, they have yet to undergo comprehensive evaluation. In this study, we evaluated the performance of three stopping rules: the knee method, a data-driven heuristic, and a prevalence estimation technique. We measured performance via sensitivity, specificity, and screening cost and explored the influence of the prevalence of relevant studies and the choice of the learning algorithm. We curated a dataset of abstract collections from meta-analyses across five psychological research domains. Our findings revealed performance differences between stopping rules regarding all performance measures and variations in the performance of stopping rules across different prevalence ratios. Moreover, despite the relatively minor impact of the learning algorithm, we found that specific combinations of stopping rules and learning algorithms were most effective for certain prevalence ratios of relevant abstracts. Based on these results, we derived practical recommendations for users of AI-aided screening tools. Furthermore, we discuss possible implications and offer suggestions for future research.
    Keywords:  literature screening; machine learning; meta‐analysis; stopping rules; systematic reviews
    DOI:  https://doi.org/10.1002/jrsm.1762
  2. ALTEX. 2024 Oct 10.
      Systematic reviews (SRs) are an important tool in implementing the 3Rs in preclinical research. With the ever-increasing amount of scientific literature, SRs require increasing time-investments. Thus, using the most efficient review tools is essential. Most available tools aid the screening process, tools for data-extraction and / or multiple review phases are relatively scarce. Using a single platform for all review phases allows for auto-transfer of references from one phase to the next, which enables work on multiple phases at the same time. We performed succinct formal tests of four multiphase review tools that are free or relatively affordable: Covidence, Eppi, SRDR+ and SYRF. Our tests comprised full-text screening, sham data extraction and discrepancy resolution in the context of parts of a systematic review. Screening was performed as per protocol. Sham data extraction comprised free text, numerical and categorial data. Both reviewers kept a log of their experiences with the platforms throughout. These logs were qualitatively summarized and supplemented with further user experiences. We show value of all tested tools in the SR process. Which tool is optimal depends on multiple factors, comprising previous experience with the tool, but also review type, review questions and review team member enthusiasm.
    Keywords:  data extraction; literature screening; systematic review
    DOI:  https://doi.org/10.14573/altex.2409251
  3. medRxiv. 2024 Sep 23. pii: 2024.09.20.24314108. [Epub ahead of print]
       Objective: Data extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world two-reviewer process.
    Materials and Methods: A dataset of 10 clinical trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n=5) and held-out test sets (n=17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the two LLMs were compared for concordance. In instances with discordance, original responses from each LLM were provided to the other LLM for cross-critique. Evaluation metrics, including accuracy, were used to assess performance against the manually curated gold standard.
    Results: In the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, with an increase in accuracy to 0.76.
    Discussion: Concordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy.
    Conclusion: Large language models, when simulated in a collaborative, two-reviewer workflow, can extract data with reasonable performance, enabling truly 'living' systematic reviews.
    DOI:  https://doi.org/10.1101/2024.09.20.24314108