bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–01–19
five papers selected by
Farhad Shokraneh



  1. Br J Ophthalmol. 2025 Jan 15. pii: bjo-2024-326254. [Epub ahead of print]
       BACKGROUND/AIMS: Large language models (LLMs) have substantial potential to enhance the efficiency of academic research. The accuracy and performance of LLMs in a systematic review, a core part of evidence building, has yet to be studied in detail.
    METHODS: We introduced two LLM-based approaches of systematic review: an LLM-enabled fully automated approach (LLM-FA) utilising three different GPT-4 plugins (Consensus GPT, Scholar GPT and GPT web browsing modes) and an LLM-facilitated semi-automated approach (LLM-SA) using GPT4's Application Programming Interface (API). We benchmarked these approaches using three published systematic reviews that reported the prevalence of diabetic retinopathy across different populations (general population, pregnant women and children).
    RESULTS: The three published reviews consisted of 98 papers in total. Across these three reviews, in the LLM-FA approach, Consensus GPT correctly identified 32.7% (32 out of 98) of papers, while Scholar GPT and GPT4's web browsing modes only identified 19.4% (19 out of 98) and 6.1% (6 out of 98), respectively. On the other hand, the LLM-SA approach not only successfully included 82.7% (81 out of 98) of these papers but also correctly excluded 92.2% of 4497 irrelevant papers.
    CONCLUSIONS: Our findings suggest LLMs are not yet capable of autonomously identifying and selecting relevant papers in systematic reviews. However, they hold promise as an assistive tool to improve the efficiency of the paper selection process in systematic reviews.
    Keywords:  Epidemiology; Public health
    DOI:  https://doi.org/10.1136/bjo-2024-326254
  2. BMC Med Res Methodol. 2025 Jan 15. 25(1): 10
       PURPOSE: The process of searching for and selecting clinical evidence for systematic reviews (SRs) or clinical guidelines is essential for researchers in Traditional Chinese medicine (TCM). However, this process is often time-consuming and resource-intensive. In this study, we introduce a novel precision-preferred comprehensive information extraction and selection procedure to enhance both the efficiency and accuracy of evidence selection for TCM practitioners.
    METHODS: We integrated an established deep learning model (Evi-BERT combined rule-based method) with Boolean logic algorithms and an expanded retrieval strategy to automatically and accurately select potential evidence with minimal human intervention. The selection process is recorded in real-time, allowing researchers to backtrack and verify its accuracy. This innovative approach was tested on ten high-quality, randomly selected systematic reviews of TCM-related topics written in Chinese. To evaluate its effectiveness, we compared the screening time and accuracy of this approach with traditional evidence selection methods.
    RESULTS: Our finding demonstrated that the new method accurately selected potential literature based on consistent criteria while significantly reducing the time required for the process. Additionally, in some cases, this approach identified a broader range of relevant evidence and enabled the tracking of selection progress for future reference. The study also revealed that traditional screening methods are often subjective and prone to errors, frequently resulting in the inclusion of literature that does not meet established standards. In contrast, our method offers a more accurate and efficient way to select clinical evidence for TCM practitioners, outperforming traditional manual approaches.
    CONCLUSION: We proposed an innovative approach for selecting clinical evidence for TCM reviews and guidelines, aiming to reduce the workload for researchers. While this method showed promise in improving the efficiency and accuracy of evidence-based selection, its full potential required further validation. Additionally, it may serve as a useful tool for editors to assess manuscript quality in the future.
    Keywords:  Evidence-based medicine; Systematic review; TCM literature; Text mining
    DOI:  https://doi.org/10.1186/s12874-024-02430-z
  3. Cureus. 2024 Dec;16(12): e75748
      Introduction The application of natural language processing (NLP) for extracting data from biomedical research has gained momentum with the advent of large language models (LLMs). However, the effect of different LLM parameters, such as temperature settings, on biomedical text mining remains underexplored and a consensus on what settings can be considered "safe" is missing. This study evaluates the impact of temperature settings on LLM performance for a named entity recognition and a classification task in clinical trial publications. Methods Two datasets were analyzed using GPT-4o and GPT-4o-mini models at nine different temperature settings (0.00-2.00). The models were used to extract the number of randomized participants and classify abstracts as randomized controlled trials (RCTs) and/or as oncology-related. Different performance metrics were calculated for each temperature setting and task. Results Both models provided correctly formatted predictions for more than 98.7% of abstracts across temperatures from 0.00 to 1.50. While the number of correctly formatted predictions started to decrease afterward with the most notable drop between temperatures 1.75 and 2.00, the other performance metrics remained largely stable. Conclusion Temperature settings at or below 1.50 yielded consistent performance across text-mining tasks, with performance declines at higher settings. These findings are aligned with research on different temperature settings for other tasks, suggesting stable performance within a controlled temperature range across various NLP applications.
    Keywords:  large language models; natural language processing; temperature; text mining; transformer
    DOI:  https://doi.org/10.7759/cureus.75748
  4. J Oral Facial Pain Headache. 2024 Jun;38(2): 74-81
      The objective was to develop and evaluate a comprehensive search strategy (SS) and automated classifier (AC) for retrieving temporomandibular disorders (TMD) research articles. An initial version of SS and AC was created by compiling terms from various sources, including previous systematic reviews (SRs) and consulting with TMD specialists. Performance was assessed using the relative recall (RR) method against a sample of all the primary studies (PS) included in 100 TMD-related SRs, with RR calculated for both SS and AC based on their ability to capture/classify TMD PSs. Adjustments were made iteratively. A validation was performed against PSs included in all TMD-relevant SRs published from January to April 2023. The analysis included 1271 PSs from 100 SRs published between 2002-2022. The initial SS had a relative recall of 89.34%, while the AC detected 70.05% of the studies. After adjustments, the fifth version reached 99.5% and 89.5% relative recall, respectively. Validation with 28 SRs from 2023 showed a search strategy sensitivity of 99.67% and AC sensitivity of 88.04%. In conclusion, the proposed SS demonstrated excellent performance in retrieving TMD-related research articles, with only a small percentage not correctly classified by the AC. The SS can effectively support evidence synthesis related to TMD, while the AC can aid in creating an open-access, continuously updated digital repository for all relevant TMD evidence.
    Keywords:  Automated classification; Evidence-based dentistry; Research methodology; Systematic review; Temporomandibular disorders
    DOI:  https://doi.org/10.22514/jofph.2024.015